Creating the BC Schools Master File

Author

Jorge Andrés Delgado-Ron

Context

The Youth Development Instrument (YDI) is a survey of BC youth collected through school personnel. It is used to inform research and policy at the school, district, health authority, and provincial levels. To do so, each individual record is linked to the corresponding administrative area. In the past, such linkages had been done manually, at least in part. The creation of documents and linkages occurred at different data pipeline stages, creating unnecessary delays. As the YDI continues to expand, a decision was made to create a Master File that includes the codes and names of schools, their districts, and health areas.

Data sources

This document uses three data sources:

  1. The mailing information for all schools from the BC School District and Contact Information website (https://bcschoolcontacts.gov.bc.ca/). Click “Download Schools Info” followed up by “All schools”;

  2. The mailing information for all districts from the same website. Click “Mailing for All Districts (CSV)”; and

  3. BC Stats’ Translation of Place Names into Administrative Areas. You can search this file with quotation marks at the BC Stats website (https://www2.gov.bc.ca/gov/search), but you will have better luck with the direct link (below).

Importing Data

Loading the files:

mailing <- read.csv("data/allschoolcontacts.csv", na.strings = "") %>%
  janitor::clean_names() %>% 
  select(district_number, 
         school_code, 
         school_name, 
         postal_code,
         physical_postal_code,
         school_category, 
         type,
         city,
         grade_10_enrollment, # added 2024-10-01 to
         grade_11_enrollment, # reduce processing times
         grade_12_enrollment) %>% 
  rename(school_district = district_number)
# You may select other variables in the future 
# (e.g., enrollment for some grades)

districts <- read.csv("data/all-mailing.csv") %>%
  janitor::clean_names() %>% 
  select(district_number,
         district_name) %>% 
  rename(school_district = district_number)

admin_areas <- openxlsx::read.xlsx("https://www2.gov.bc.ca/assets/gov/data/geographic/land-use/administrative-boundaries/translation_place_names2019.xlsx", startRow = 4)%>% 
  janitor::clean_names() %>% 
  slice(.,-1) %>% #The first row is empty data
  rename(health_authority_name = x11) 

Data wrangling

Let’s take a look at the mailing dataset:

glimpse(mailing)
Rows: 2,304
Columns: 11
$ school_district      <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
$ school_code          <int> 501007, 501009, 501010, 501017, 502001, 502011, 5…
$ school_name          <chr> "Jaffray Elem-Jr Secondary", "Isabella Dicken Ele…
$ postal_code          <chr> "V0B1T0", "V0B1M0", "V0B2G0", "V0B1H0", "V1C1B2",…
$ physical_postal_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ school_category      <chr> "Public School", "Public School", "Public School"…
$ type                 <chr> "Standard school", "Standard school", "Standard s…
$ city                 <chr> "JAFFRAY", "FERNIE", "SPARWOOD", "ELKFORD", "CRAN…
$ grade_10_enrollment  <chr> "Y", "N", "N", "N", "Y", "N", "N", "N", "N", "N",…
$ grade_11_enrollment  <chr> "N", "N", "N", "N", "Y", "N", "N", "N", "N", "N",…
$ grade_12_enrollment  <chr> "N", "N", "N", "N", "Y", "N", "N", "N", "N", "N",…

It looks like the schools have two postal codes. We need the physical postal code because we are interested in the distance from the school to the student’s house. But most schools only provided this information under “postal code”. So we need to do some cleaning here:

# If the physical_postal_code variable is not empty, 
# I want it to replace the postal_code variable

mailing$postal_code <- ifelse(!is.na(mailing$physical_postal_code), #condition
                              mailing$physical_postal_code, #true 
                              mailing$postal_code) #false

# Now we can delete redundant information
mailing$physical_postal_code <- NULL

We also need the short version of the postal code for later.

# substr (target, first_character, last_character)
mailing$short_postal_code <- substr(mailing$postal_code,1,3)

Now that everything is sorted, we can just add the district name from the districts dataframe and the health authorities from the admin_areas dataframe.

# I am telling R to match the two dataframes by school_district,
# and to find the district_name  variable in districts
mailing$district_name <-
  districts$district_name[match(mailing$school_district,
                                          districts$school_district)]

# Now I repeat the process for health_authority_name in admin_areas

mailing$health_authority <-
  admin_areas$health_authority_name[match(mailing$school_district,
                                          admin_areas$school_district)]

# That is all I needed from these dataframes
admin_areas <- NULL
districts <- NULL

Did everyone get a district?

round(100*prop.table(table(!is.na(mailing$district_name))),1)

TRUE 
 100 

Did everyone get a health authority?

round(100*prop.table(table(!is.na(mailing$health_authority))),1)

FALSE  TRUE 
  2.4  97.6 

It looks like a few entries did not. Let’s find out which districts are missing.

unique(mailing$school_district[is.na(mailing$health_authority)])
[1] 93

Ok, I just googled it!

The Conseil scolaire francophone de la Colombie-Britannique is the French-language school board for all French schools located in British Columbia.

This is an interesting problem because district 93 is the only one that can be located in many health authorities. So, we are going to find the health authority for each school using their 3-character postal code. But first, I want to make sure all missing postal codes can be matched with existing postal codes.

lost_postal_codes <- mailing$short_postal_code[is.na(mailing$health_authority)]
known_postal_codes <- mailing$short_postal_code[!is.na(mailing$health_authority)]

table(lost_postal_codes %in% known_postal_codes) #Verifying no one is left behind

TRUE 
  55 

Success! Now let’s get to work. First I want to create a dataframe with short postal codes and health authorities

HA_finder <- mailing  %>% 
  filter(!is.na(health_authority)) %>% 
  group_by(short_postal_code) %>% 
  reframe(HA = health_authority) %>%
  unique.array() 

glimpse(HA_finder)
Rows: 207
Columns: 2
$ short_postal_code <chr> "V0A", "V0B", "V0C", "V0E", "V0E", "V0G", "V0H", "V0…
$ HA                <chr> "Interior", "Interior", "Northern", "Interior", "Nor…

Next, I want to separate the French schools from the rest. Technically speaking, this is not really required. But it is easier to understand the code this way.

mailing_complete <- mailing %>% filter(!is.na(health_authority))
mailing_missing  <- mailing %>% filter(is.na(health_authority))

Now we can match the health authority by postal code, just like we did before with school districts.

mailing_missing$health_authority <-
  HA_finder$HA[match(mailing_missing$short_postal_code,
                     HA_finder$short_postal_code)]

Finally, we put together the pieces of the broken data frame and save it as a csv file to import it into the caraspace.

master_file <- rbind(mailing_complete,mailing_missing)
write.csv(x = master_file, file = "data/BC_schools_master_file.csv")

Downloads

BC Schools Master File by Jorge Andrés Delgado-Ron and its associated associated metadata are marked with CC0 1.0 CC iconCC0 icon and free to download and use.