mailing <- read.csv("data/allschoolcontacts.csv", na.strings = "") %>%
janitor::clean_names() %>%
select(district_number,
school_code,
school_name,
postal_code,
physical_postal_code,
school_category,
type,
city,
grade_10_enrollment, # added 2024-10-01 to
grade_11_enrollment, # reduce processing times
grade_12_enrollment) %>%
rename(school_district = district_number)
# You may select other variables in the future
# (e.g., enrollment for some grades)
districts <- read.csv("data/all-mailing.csv") %>%
janitor::clean_names() %>%
select(district_number,
district_name) %>%
rename(school_district = district_number)
admin_areas <- openxlsx::read.xlsx("https://www2.gov.bc.ca/assets/gov/data/geographic/land-use/administrative-boundaries/translation_place_names2019.xlsx", startRow = 4)%>%
janitor::clean_names() %>%
slice(.,-1) %>% #The first row is empty data
rename(health_authority_name = x11) Creating the BC Schools Master File
Context
The Youth Development Instrument (YDI) is a survey of BC youth collected through school personnel. It is used to inform research and policy at the school, district, health authority, and provincial levels. To do so, each individual record is linked to the corresponding administrative area. In the past, such linkages had been done manually, at least in part. The creation of documents and linkages occurred at different data pipeline stages, creating unnecessary delays. As the YDI continues to expand, a decision was made to create a Master File that includes the codes and names of schools, their districts, and health areas.
Data sources
This document uses three data sources:
The mailing information for all schools from the BC School District and Contact Information website (https://bcschoolcontacts.gov.bc.ca/). Click “Download Schools Info” followed up by “All schools”;
The mailing information for all districts from the same website. Click “Mailing for All Districts (CSV)”; and
BC Stats’ Translation of Place Names into Administrative Areas. You can search this file with quotation marks at the BC Stats website (https://www2.gov.bc.ca/gov/search), but you will have better luck with the direct link (below).
Importing Data
Loading the files:
Data wrangling
Let’s take a look at the mailing dataset:
glimpse(mailing)Rows: 2,304
Columns: 11
$ school_district <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
$ school_code <int> 501007, 501009, 501010, 501017, 502001, 502011, 5…
$ school_name <chr> "Jaffray Elem-Jr Secondary", "Isabella Dicken Ele…
$ postal_code <chr> "V0B1T0", "V0B1M0", "V0B2G0", "V0B1H0", "V1C1B2",…
$ physical_postal_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ school_category <chr> "Public School", "Public School", "Public School"…
$ type <chr> "Standard school", "Standard school", "Standard s…
$ city <chr> "JAFFRAY", "FERNIE", "SPARWOOD", "ELKFORD", "CRAN…
$ grade_10_enrollment <chr> "Y", "N", "N", "N", "Y", "N", "N", "N", "N", "N",…
$ grade_11_enrollment <chr> "N", "N", "N", "N", "Y", "N", "N", "N", "N", "N",…
$ grade_12_enrollment <chr> "N", "N", "N", "N", "Y", "N", "N", "N", "N", "N",…
It looks like the schools have two postal codes. We need the physical postal code because we are interested in the distance from the school to the student’s house. But most schools only provided this information under “postal code”. So we need to do some cleaning here:
# If the physical_postal_code variable is not empty,
# I want it to replace the postal_code variable
mailing$postal_code <- ifelse(!is.na(mailing$physical_postal_code), #condition
mailing$physical_postal_code, #true
mailing$postal_code) #false
# Now we can delete redundant information
mailing$physical_postal_code <- NULLWe also need the short version of the postal code for later.
# substr (target, first_character, last_character)
mailing$short_postal_code <- substr(mailing$postal_code,1,3)Now that everything is sorted, we can just add the district name from the districts dataframe and the health authorities from the admin_areas dataframe.
# I am telling R to match the two dataframes by school_district,
# and to find the district_name variable in districts
mailing$district_name <-
districts$district_name[match(mailing$school_district,
districts$school_district)]
# Now I repeat the process for health_authority_name in admin_areas
mailing$health_authority <-
admin_areas$health_authority_name[match(mailing$school_district,
admin_areas$school_district)]
# That is all I needed from these dataframes
admin_areas <- NULL
districts <- NULLDid everyone get a district?
round(100*prop.table(table(!is.na(mailing$district_name))),1)
TRUE
100
Did everyone get a health authority?
round(100*prop.table(table(!is.na(mailing$health_authority))),1)
FALSE TRUE
2.4 97.6
It looks like a few entries did not. Let’s find out which districts are missing.
unique(mailing$school_district[is.na(mailing$health_authority)])[1] 93
Ok, I just googled it!
The Conseil scolaire francophone de la Colombie-Britannique is the French-language school board for all French schools located in British Columbia.
This is an interesting problem because district 93 is the only one that can be located in many health authorities. So, we are going to find the health authority for each school using their 3-character postal code. But first, I want to make sure all missing postal codes can be matched with existing postal codes.
lost_postal_codes <- mailing$short_postal_code[is.na(mailing$health_authority)]
known_postal_codes <- mailing$short_postal_code[!is.na(mailing$health_authority)]
table(lost_postal_codes %in% known_postal_codes) #Verifying no one is left behind
TRUE
55
Success! Now let’s get to work. First I want to create a dataframe with short postal codes and health authorities
HA_finder <- mailing %>%
filter(!is.na(health_authority)) %>%
group_by(short_postal_code) %>%
reframe(HA = health_authority) %>%
unique.array()
glimpse(HA_finder)Rows: 207
Columns: 2
$ short_postal_code <chr> "V0A", "V0B", "V0C", "V0E", "V0E", "V0G", "V0H", "V0…
$ HA <chr> "Interior", "Interior", "Northern", "Interior", "Nor…
Next, I want to separate the French schools from the rest. Technically speaking, this is not really required. But it is easier to understand the code this way.
mailing_complete <- mailing %>% filter(!is.na(health_authority))
mailing_missing <- mailing %>% filter(is.na(health_authority))Now we can match the health authority by postal code, just like we did before with school districts.
mailing_missing$health_authority <-
HA_finder$HA[match(mailing_missing$short_postal_code,
HA_finder$short_postal_code)]Finally, we put together the pieces of the broken data frame and save it as a csv file to import it into the caraspace.
master_file <- rbind(mailing_complete,mailing_missing)
write.csv(x = master_file, file = "data/BC_schools_master_file.csv")Downloads
BC Schools Master File by Jorge Andrés Delgado-Ron and its associated associated metadata are marked with CC0 1.0 and free to download and use.