<- read.csv("data/allschoolcontacts.csv", na.strings = "") %>%
mailing ::clean_names() %>%
janitorselect(district_number,
school_code,
school_name,
postal_code,
physical_postal_code,
school_category,
type,
city,# added 2024-10-01 to
grade_10_enrollment, # reduce processing times
grade_11_enrollment, %>%
grade_12_enrollment) rename(school_district = district_number)
# You may select other variables in the future
# (e.g., enrollment for some grades)
<- read.csv("data/all-mailing.csv") %>%
districts ::clean_names() %>%
janitorselect(district_number,
%>%
district_name) rename(school_district = district_number)
<- openxlsx::read.xlsx("https://www2.gov.bc.ca/assets/gov/data/geographic/land-use/administrative-boundaries/translation_place_names2019.xlsx", startRow = 4)%>%
admin_areas ::clean_names() %>%
janitorslice(.,-1) %>% #The first row is empty data
rename(health_authority_name = x11)
Creating the BC Schools Master File
Context
The Youth Development Instrument (YDI) is a survey of BC youth collected through school personnel. It is used to inform research and policy at the school, district, health authority, and provincial levels. To do so, each individual record is linked to the corresponding administrative area. In the past, such linkages had been done manually, at least in part. The creation of documents and linkages occurred at different data pipeline stages, creating unnecessary delays. As the YDI continues to expand, a decision was made to create a Master File that includes the codes and names of schools, their districts, and health areas.
Data sources
This document uses three data sources:
The mailing information for all schools from the BC School District and Contact Information website (https://bcschoolcontacts.gov.bc.ca/). Click “Download Schools Info” followed up by “All schools”;
The mailing information for all districts from the same website. Click “Mailing for All Districts (CSV)”; and
BC Stats’ Translation of Place Names into Administrative Areas. You can search this file with quotation marks at the BC Stats website (https://www2.gov.bc.ca/gov/search), but you will have better luck with the direct link (below).
Importing Data
Loading the files:
Data wrangling
Let’s take a look at the mailing
dataset:
glimpse(mailing)
Rows: 2,304
Columns: 11
$ school_district <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
$ school_code <int> 501007, 501009, 501010, 501017, 502001, 502011, 5…
$ school_name <chr> "Jaffray Elem-Jr Secondary", "Isabella Dicken Ele…
$ postal_code <chr> "V0B1T0", "V0B1M0", "V0B2G0", "V0B1H0", "V1C1B2",…
$ physical_postal_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ school_category <chr> "Public School", "Public School", "Public School"…
$ type <chr> "Standard school", "Standard school", "Standard s…
$ city <chr> "JAFFRAY", "FERNIE", "SPARWOOD", "ELKFORD", "CRAN…
$ grade_10_enrollment <chr> "Y", "N", "N", "N", "Y", "N", "N", "N", "N", "N",…
$ grade_11_enrollment <chr> "N", "N", "N", "N", "Y", "N", "N", "N", "N", "N",…
$ grade_12_enrollment <chr> "N", "N", "N", "N", "Y", "N", "N", "N", "N", "N",…
It looks like the schools have two postal codes. We need the physical postal code because we are interested in the distance from the school to the student’s house. But most schools only provided this information under “postal code”. So we need to do some cleaning here:
# If the physical_postal_code variable is not empty,
# I want it to replace the postal_code variable
$postal_code <- ifelse(!is.na(mailing$physical_postal_code), #condition
mailing$physical_postal_code, #true
mailing$postal_code) #false
mailing
# Now we can delete redundant information
$physical_postal_code <- NULL mailing
We also need the short version of the postal code for later.
# substr (target, first_character, last_character)
$short_postal_code <- substr(mailing$postal_code,1,3) mailing
Now that everything is sorted, we can just add the district name from the districts
dataframe and the health authorities from the admin_areas
dataframe.
# I am telling R to match the two dataframes by school_district,
# and to find the district_name variable in districts
$district_name <-
mailing$district_name[match(mailing$school_district,
districts$school_district)]
districts
# Now I repeat the process for health_authority_name in admin_areas
$health_authority <-
mailing$health_authority_name[match(mailing$school_district,
admin_areas$school_district)]
admin_areas
# That is all I needed from these dataframes
<- NULL
admin_areas <- NULL districts
Did everyone get a district?
round(100*prop.table(table(!is.na(mailing$district_name))),1)
TRUE
100
Did everyone get a health authority?
round(100*prop.table(table(!is.na(mailing$health_authority))),1)
FALSE TRUE
2.4 97.6
It looks like a few entries did not. Let’s find out which districts are missing.
unique(mailing$school_district[is.na(mailing$health_authority)])
[1] 93
Ok, I just googled it!
The Conseil scolaire francophone de la Colombie-Britannique is the French-language school board for all French schools located in British Columbia.
This is an interesting problem because district 93 is the only one that can be located in many health authorities. So, we are going to find the health authority for each school using their 3-character postal code. But first, I want to make sure all missing postal codes can be matched with existing postal codes.
<- mailing$short_postal_code[is.na(mailing$health_authority)]
lost_postal_codes <- mailing$short_postal_code[!is.na(mailing$health_authority)]
known_postal_codes
table(lost_postal_codes %in% known_postal_codes) #Verifying no one is left behind
TRUE
55
Success! Now let’s get to work. First I want to create a dataframe with short postal codes and health authorities
<- mailing %>%
HA_finder filter(!is.na(health_authority)) %>%
group_by(short_postal_code) %>%
reframe(HA = health_authority) %>%
unique.array()
glimpse(HA_finder)
Rows: 207
Columns: 2
$ short_postal_code <chr> "V0A", "V0B", "V0C", "V0E", "V0E", "V0G", "V0H", "V0…
$ HA <chr> "Interior", "Interior", "Northern", "Interior", "Nor…
Next, I want to separate the French schools from the rest. Technically speaking, this is not really required. But it is easier to understand the code this way.
<- mailing %>% filter(!is.na(health_authority))
mailing_complete <- mailing %>% filter(is.na(health_authority)) mailing_missing
Now we can match the health authority by postal code, just like we did before with school districts.
$health_authority <-
mailing_missing$HA[match(mailing_missing$short_postal_code,
HA_finder$short_postal_code)] HA_finder
Finally, we put together the pieces of the broken data frame and save it as a csv file to import it into the caraspace.
<- rbind(mailing_complete,mailing_missing)
master_file write.csv(x = master_file, file = "data/BC_schools_master_file.csv")
Downloads
BC Schools Master File by Jorge Andrés Delgado-Ron and its associated associated metadata are marked with CC0 1.0 and free to download and use.