Piping allows us to break down complex tasks into manageable chunks that can be written and tested one after another. There are several powerful commands in the tidyverse as part of the dplyr package that can help us group, filter, select, mutate and summarise datasets. With this small set of commands we can use piping to convert massive datasets into simple and useful results.
Using the pipe %>% command, we can feed the results from one command into the next command making for reusable and easy to read code.
Note
The pipe command we are using %>% is from the magrittr package which is installed alongside the tidyverse. Recently R introduced another pipe |> which offers very similar functionality and tutorials online might use either. The examples below use the %>% pipe.
Let’s look at an example of using the pipe on the PISA_2022 table to calculate the best performing OECD countries for maths PV1MATH by gender ST004D01T:
line 1 passes the whole PISA_2022 dataset and pipes it into the next line using %>%
2
line 2 filters out any results that are from non-OECD countries by finding all the rows where OECD equals == “Yes”, this is then piped to the next line
3
line 3 groups the data by country CNT and by student gender ST004D01T, this is then piped to the next line
4
line 4-6 the summarise command performs a calculation on the country and gender groupings returning three new columns, each command is described by code on a new line and separated by a comma: the mean value for maths mean_maths, the standard deviation sd_maths, and a column telling us how many students were in each grouping using the n() which returns the number of rows in a group. These new columns and the grouping columns are then piped to the next line, all other columns are dropped
5
line 7 filters out any gender ST004D01T that is NA. First is finds all the students that have NA as their gender by using is.na(ST004D01T), then it NOTs/flips the result using the exclamation mark !, giving those students who don’t have their gender set to NA. The filtered data is then piped to the next line
6
line 8, finally we arrange/sort the results in descending order by the mean_maths column. The default for arrange is ascending order, leave out the desc( ) for the numbers to be ordered in the opposite way.
# A tibble: 74 × 5
# Groups: CNT [37]
CNT ST004D01T mean_maths sd_maths students
<fct> <fct> <dbl> <dbl> <int>
1 Japan Male 540. 99.0 2856
2 Korea Male 534. 112. 3325
3 Japan Female 531. 88.1 2904
4 Korea Female 528. 99.1 3129
5 Estonia Male 515. 87.4 3272
6 Switzerland Male 511. 99.3 3540
7 Estonia Female 510. 81.5 3120
8 Czech Republic Male 502. 100. 4232
9 Switzerland Female 501. 90.9 3289
10 Austria Male 500. 94.5 3110
# ℹ 64 more rows
Across the top few countries, Males get a slightly better maths score PV1MATH than Females, other scores are available, please read §What are Plausible Values? to find out more about the limitations of using a “PV” value.
Note
we met the assignment command earlier <-. Within the tidyverse commands we use the equals sign instead =.
The commands we have just used come from a package within the tidyverse called dplyr, let’s take a look at what they do:
command
purpose
example
select
reduce the dataframe to the fields that you specify
select(<field>, <field>, <field>)
filter
get rid of rows that don’t meet one or more criteria
filter(<field> <comparison>)
group
group fields together to perform calculations
group_by(<field>, <field>))
mutate
add new fields or change values in current fields
mutate(<new_field> = <field> / 2)
summarise
create summary data optionally using a grouping command
summarise(<new_field> = max(<field>))
arrange
order the results by one or more fields
arrange(desc(<field>))
Note
If you want to explore more of the functions of dplyr, take a look at the helpsheet
Adjust the code above to find out the lowest performing countries for reading PV1READ by gender that are not in the OECD
# A tibble: 613,744 × 4
CNT ESCS ST004D01T ST003D02T
<fct> <dbl> <fct> <fct>
1 Albania 1.11 Female May
2 Albania -3.05 Male February
3 Albania -0.187 Male August
4 Albania -3.22 Female July
5 Albania -1.05 Female January
6 Albania 1.09 Male May
7 Albania -0.762 Male May
8 Albania -1.02 Female December
9 Albania -1.17 Female August
10 Albania 0.286 Female September
# ℹ 613,734 more rows
You might also be in the situation where you want to select everything but one or two fields, you can do this with the negative signal -, the below code returns all the fields exceptCNT and OECD:
PISA_2022 %>%select(-CNT, -OECD)
# A tibble: 613,744 × 81
CNTSCHID CNTSTUID REGION LANGTEST_QQQ ST003D02T ST003D03T ST004D01T
<dbl> <dbl> <fct> <fct> <fct> <fct> <fct>
1 800282 800001 Albania Albanian May 2006 Female
2 800115 800002 Albania Albanian February 2006 Male
3 800242 800003 Albania Albanian August 2006 Male
4 800245 800005 Albania Albanian July 2006 Female
5 800285 800006 Albania Albanian January 2006 Female
6 800172 800007 Albania Albanian May 2006 Male
7 800082 800008 Albania Albanian May 2006 Male
8 800274 800009 Albania Albanian December 2006 Female
9 800057 800010 Albania Albanian August 2006 Female
10 800132 800012 Albania Albanian September 2006 Female
# ℹ 613,734 more rows
# ℹ 74 more variables: ST250Q01JA <fct>, ST250Q02JA <fct>, ST250Q03JA <fct>,
# ST250Q05JA <fct>, ST251Q01JA <fct>, ST251Q06JA <fct>, ST251Q07JA <fct>,
# ST253Q01JA <fct>, ST254Q01JA <fct>, ST254Q02JA <fct>, ST254Q03JA <fct>,
# ST254Q04JA <fct>, ST254Q05JA <fct>, ST254Q06JA <fct>, ST255Q01JA <fct>,
# ST256Q02JA <fct>, ST005Q01JA <fct>, ST007Q01JA <fct>, ST019AQ01T <fct>,
# ST019BQ01T <fct>, ST019CQ01T <fct>, ST125Q01NA <fct>, ST261Q01JA <fct>, …
You might find that you have a vector of column names that you want to select, to do this, we can use the any_of command:
# A tibble: 613,744 × 3
CNT CNTSCHID ST004D01T
<fct> <dbl> <fct>
1 Albania 800282 Female
2 Albania 800115 Male
3 Albania 800242 Male
4 Albania 800245 Female
5 Albania 800285 Female
6 Albania 800172 Male
7 Albania 800082 Male
8 Albania 800274 Female
9 Albania 800057 Female
10 Albania 800132 Female
# ℹ 613,734 more rows
With hundreds of fields available, you might want to focus on fields whose names match a certain pattern, to do this you can use starts_with, ends_with, contains:
# country of birth of student, and father and mother are recorded in ST019___PISA_2022 %>%select(starts_with("ST019"))
# A tibble: 613,744 × 3
ST019AQ01T ST019BQ01T ST019CQ01T
<fct> <fct> <fct>
1 Country of test Country of test Country of test
2 Country of test Country of test Country of test
3 Other country Country of test Country of test
4 Country of test Country of test Country of test
5 Country of test Country of test Country of test
6 Country of test Country of test Country of test
7 Country of test Country of test Country of test
8 Country of test Country of test Country of test
9 Country of test Country of test Country of test
10 Country of test Country of test Country of test
# ℹ 613,734 more rows
When you come to building your statistical models you often need to use numeric data, you can find the columns that have only numbers in them by the following. Be warned though, sometimes there are numeric fields which have a few words in them, so R treats them as characters. Use the PISA codebook to help work out where those numbers are.
Write a select statement to show all the fields that are to do with well being and health, e.g. WB150Q01HA “How is your health?”
answer
PISA_2022 %>%select(starts_with("WB15"))
[EXTENSION] Adjust your answer to Q3 so that you select the gender ST004D01T and the ID CNTSTUID of each student in addition to the ST254____ fields looking at digital devices in the home:
Not only does the PISA_2022 dataset have a huge number of columns, it has hundred of thousands of rows. We want to filter this down to the students that we are interested in, i.e. filter out data that isn’t useful for our analysis. If we only wanted the results that were Male, we could do the following:
# A tibble: 307,906 × 5
CNT ESCS ST004D01T ST003D02T PV1MATH
<fct> <dbl> <fct> <fct> <dbl>
1 Albania -3.05 Male February 308.
2 Albania -0.187 Male August 268.
3 Albania 1.09 Male May 534.
4 Albania -0.762 Male May 382.
5 Albania -1.98 Male October 425.
6 Albania 0.063 Male April 463.
7 Albania -0.170 Male May 236.
8 Albania -2.58 Male August 327.
9 Albania -1.09 Male March 326.
10 Albania -0.334 Male October 428.
# ℹ 307,896 more rows
We can combine filter commands to look for Males born in September and where the PV1MATH figure is greater than 750. We can list multiple criteria in the filter by separating the criteria with commas, using commas mean that all of these criteria need to be TRUE for a row to be returned. A comma in a filter is the equivalent of an AND, :
# A tibble: 52 × 5
CNT ESCS ST004D01T ST003D02T PV1MATH
<fct> <dbl> <fct> <fct> <dbl>
1 Australia 0.994 Male September 770.
2 Australia 0.837 Male September 758.
3 Australia 1.22 Male September 807.
4 Canada 1.06 Male September 752.
5 Canada 1.04 Male September 788.
6 Canada 1.21 Male September 781.
7 Canada 0.804 Male September 750.
8 Chinese Taipei 0.842 Male September 753.
9 Chinese Taipei 0.488 Male September 794.
10 Chinese Taipei 1.23 Male September 794.
# ℹ 42 more rows
Remember to include the == sign when looking to filter on equality; additionally, you can use != (not equals), >=, <=, >, <.
Remember matching is case sensitive, “september” != “September”
Rather than just looking at September born students, we want to find all the students born in the Autumn term. But if we add a couple more criteria on ST003D02T nothing is returned! Why?
The reason is R is looking for individual students born in September AND October AND November AND December. As a student can only have one birth month there are no students that meet this criteria. We need to use OR :
To create an OR in a filter we use the bar | character, the below looks for all students who are “Male” AND were born in “September” OR “October” OR “November” OR “December”, AND have a PV1MATH > 750.
# A tibble: 194 × 5
CNT ESCS ST004D01T ST003D02T PV1MATH
<fct> <dbl> <fct> <fct> <dbl>
1 Australia 0.994 Male September 770.
2 Australia 1.03 Male October 808.
3 Australia 1.30 Male October 762.
4 Australia 1.21 Male November 764.
5 Australia 0.837 Male September 758.
6 Australia 1.22 Male September 807.
7 Belgium 0.746 Male December 770.
8 Belgium 0.939 Male December 779.
9 Belgium 1.25 Male November 832.
10 Belgium 1.11 Male October 776.
# ℹ 184 more rows
It’s neater, maybe, to use the %in% command, which checks to see if the value in a column is present in a vector, this can mimic the OR/| command:
# show the actual unique values in a field# this might be a slightly smaller set of valuesunique(PISA_2022$ST003D02T)
[1] May February August July January December September
[8] October April March June November <NA>
16 Levels: January February March April May June July August ... No Response
# You might also want to read the label of a fieldattr(PISA_2022$ST003D02T, "label")
[1] "Student (Standardized) Birth - Month"
2.1 Questions
Spot the two errors with the following select statement
PISA_2022 %>%select(CNT, ESCS) %>%#1 you have ESCS in the filter, it needs to be in the select as wellfilter(CNT %in%c("France", "Belgium"), #2 Belgium needs a capital letter#3 the %in% command needs percentages#4 you need a comma (or &) at the end of the line ESCS <0)
Use filter to find all the students with Three or more cars in their home ST251Q01JA. How does this compare to those with no None cars?
Adjust your code in Q2. to find the number of students with Three or more cars in their home ST251Q01JA in Italy, how does this compare with Spain?
answer
PISA_2022 %>%select(CNT, ST251Q01JA) %>%filter(ST251Q01JA =="Three or more", CNT =="Italy")PISA_2022 %>%select(CNT, ST251Q01JA) %>%filter(ST251Q01JA =="Three or more", CNT =="Spain")# EXTENSION:# Note we would need to know the percentage of students # in each country with that number of cars to make a proper# comparison. Spain might have more students taking the PISA# test than Italy, or vice-versaPISA_2022 %>%select(CNT, ST251Q01JA) %>%filter(CNT %in%c("Italy", "Spain")) %>%group_by(CNT) %>%mutate(total_stus =n()) %>%filter(ST251Q01JA =="Three or more") %>%summarise(three_more =n(),per_three_more = three_more/unique(total_stus))
Write a filter to create a table for the number of Female students with reading PV1READ scores lower than 400 in the United Kingdom, store the result as read_low_female, repeat but for Male students and store as read_low_male. Use nrow() to work out if there are more males or females with a low reading score in the UK
answer
read_low_female <- PISA_2022 %>%filter(CNT =="United Kingdom", PV1READ <400, ST004D01T =="Female")read_low_male <- PISA_2022 %>%filter(CNT =="United Kingdom", PV1READ <400, ST004D01T =="Male")nrow(read_low_female)nrow(read_low_male)# You could also pipe the whole dataframe into nrow()PISA_2022 %>%filter(CNT =="United Kingdom", PV1READ <400, ST004D01T =="Female") %>%nrow()
How many students in the United Kingdom had no television ST254Q01JAOR no connection to the internet ST250Q05JA. HINT: use levels(PISA_2022$ST254Q01JA) to look at the levels available for each column.
Very often when dealing with datasets such as PISA or TIMSS, the column names can be very confusing without a reference key, e.g. ST004D01T, OCOD3 and ST261Q04JA. To rename columns in the tidyverse we use the rename(<new_name> = <old_name>) command. For example, if you wanted to rename the rather confusingly named student column for gender, also known as ST004D01T, and the column for having a having enough digital resources in school, also known as IC172Q01JA, you could use:
CNT gender
Spain : 30800 Female :305759
United Arab Emirates: 24600 Male :307906
Canada : 23073 Valid Skip : 0
Kazakhstan : 19769 Not Applicable: 0
Indonesia : 13439 Invalid : 0
Australia : 13437 No Response : 0
(Other) :488626 NA's : 79
dig_resources
Agree :183233
Disagree : 70595
Strongly agree : 49276
Strongly disagree: 35223
Valid Skip : 0
(Other) : 0
NA's :275417
If you want to change the name of the column so that it stays when you need to perform another calculation, remember to assign the renamed dataframe back to the original dataframe. But be warned, you’ll need to reload the full dataset to restore the original names:
So far we have looked at ways to return rows that meet certain criteria. Using group_by and summarise we can start to analyse data for different groups of students. For example, let’s look at the number of students who don’t have internet connections at home besides a mobile phone ST250Q05JA:
Line 2 makes groups within PISA_2022 using the unique values of ST250Q05JA
3
Line 3, these groups are then passed to summarise, which creates a new column called student_n and stores the number of rows in each ST250Q05JA group using the n() command. summarise only returns the columns it creates, or are in the group_by, everything else is discarded.
# A tibble: 3 × 2
ST250Q05JA student_n
<fct> <int>
1 Yes 525842
2 No 56968
3 <NA> 30934
What we might want to do is look at this data from a country by country perspective, by adding another field to the group_by() command, we then group by the unique combination of countries CNT and internet access ST250Q05JA, e.g. Albania + Yes; Albania + No; Albania + NA; United Arab Emirates + Yes; etc
# A tibble: 238 × 3
# Groups: CNT [80]
CNT ST250Q05JA student_n
<fct> <fct> <int>
1 Albania Yes 4693
2 Albania No 535
3 Albania <NA> 901
4 United Arab Emirates Yes 22206
5 United Arab Emirates No 1116
6 United Arab Emirates <NA> 1278
7 Argentina Yes 10484
8 Argentina No 938
9 Argentina <NA> 689
10 Australia Yes 12808
# ℹ 228 more rows
summarise can also be used to work out statistics by grouping. For example, if you wanted to find out the max, mean and min science grade PV1SCIE by country CNT, you could do the following:
# A tibble: 80 × 4
CNT sci_max sci_mean sci_min
<fct> <dbl> <dbl> <dbl>
1 Albania 724. 376. 74.6
2 United Arab Emirates 842. 436. 0
3 Argentina 751. 415. 128.
4 Australia 875. 508. 108.
5 Austria 804. 494. 170.
6 Belgium 792. 495. 169.
7 Bulgaria 728. 422. 130.
8 Brazil 784. 406. 106.
9 Brunei Darussalam 762. 445. 135.
10 Canada 893. 499. 107.
# ℹ 70 more rows
Important
group_by() can have unintended consequences in your code if you are saving your pipes to new dataframes. To be safe your can clear any grouping by adding: my_data %>% ungroup()
4.1 Questions
Spot the three errors with the following summarise statement
PISA_2022 %>%group(CNT)summarise(num_stus = n)
answer
PISA_2022 %>%group_by(CNT) %>%#1 group_by NOT group #2 missing pipe %>%summarise(num_stus =n()) #3 = n() not = n
Write a group_by and summarise statement to work out the mean and median cultural capital value ESCS for each student by country CNT
Using summarise work out, Yes or No, by country CNT and gender ST004D01T, whether students “Agree/disagree: There are enough [digital resources] for every student at my school” IC172Q01JA. Filter out any NA values on IC172Q01JA:
Sometimes you will want to adjust the values stored in a field, e.g. converting a distance in miles into kilometres; or compute a new fields based on other fields, e.g. working out a total grade given the parts of a test. To do this we can use mutate. Unlike summarise, mutate retains all the other columns either adding a new column or changing an existing one
mutate(<field> = <field_calculation>)
The PISA_2022 dataset has results for maths PV1MATH, science PV1SCIE and reading PV1READ. We could combine these to create an overall PISA_grade, and PISA_mean:
mutate creates a new field called PV1_total made up by adding together the columns for maths, science and reading. Each column acts like a vector and adding them together is the equivalent of adding each students individual grades together, row by row. See §Vectors for more details on vector addition.
2
inside the same mutate statement, we take the PV1_total calculated on line two and divide it by 3, to give us a mean value, this is then assigned to a new column, PV1_mean.
3
this line selects only the fields that we are interested in, dropping the others
We can use mutate to create subsets of data in fields. For example, if we wanted to see how many students in each country were high performing readers, specified by getting a reading grade of greater than 550, we could do the following:
this line creates a new column called PV1READ_high for every students, storing a boolean value, TRUE or FALSE depending on whether their reading grates PV1READ > 550.
3
a grouping is made on the country and the new field PV1READ_high, e.g. Albania + TRUE, Albania + FALSE, etc.
4
using summarise we can find the number of student rows in each grouping using n(), and drop all the other fields
# A tibble: 159 × 3
# Groups: CNT [80]
CNT PV1READ_high n
<fct> <lgl> <int>
1 Albania FALSE 6055
2 Albania TRUE 74
3 United Arab Emirates FALSE 20711
4 United Arab Emirates TRUE 3889
5 Argentina FALSE 11197
6 Argentina TRUE 914
7 Australia FALSE 9005
8 Australia TRUE 4432
9 Austria FALSE 4461
10 Austria TRUE 1690
# ℹ 149 more rows
Comparisons can also be made between different columns, if we wanted to find out the percentage of Males and Females that got a better grade in their maths test PV1MATH than in their reading test PV1READ:
mutate creates a new field called maths_better made up by comparing the PV1MATH grade with PV1READ and creating a boolean/logical vector for the column.
3
selects a subset of the columns
4
filters out any students that don’t have gender data ST004D01Tand where the calculation on line 2 failed, i.e. PV1MATH or PV1READ was NA
5
group on the gender of the student
6
using the group on line 5, use mutate to calculate the total number of Males and Females by looking for the number of rows in each group n(), store this as students_n
7
re-group the data on gender ST004D01T and whether the student is better at maths than reading maths_better
8
count the number of students, n in each group specified by line 7.
9
create a percentage figure for the number of students in each grouping given by line 7. Use the n value from line 8 and the students_n value from line 6. NOTE: we need to use unique(students_n) to return just one value for each grouping rather than a value for every row of the line 7 grouping
# A tibble: 4 × 4
# Groups: ST004D01T [2]
ST004D01T maths_better n per
<fct> <lgl> <int> <dbl>
1 Female FALSE 180350 0.590
2 Female TRUE 125409 0.410
3 Male FALSE 118341 0.384
4 Male TRUE 189565 0.616
For more information on how to mutate fields using ifelse, see Section 8.1
6 arrange
The results returned by pipes can be huge, so it’s a good idea to store them in objects and explore them in the Environment window where you can sort and search within the output. There might also be times when you want to order/arrange the outputs in a particular way. We can do this quite easily in the tidyverse by using the arrange(<column_name>, <column_name>) function.
# A tibble: 613,744 × 4
CNT LANGN ST004D01T PV1MATH
<fct> <fct> <fct> <dbl>
1 Singapore Invalid Male 943.
2 Chinese Taipei Mandarin Male 917.
3 Singapore Invalid Male 897.
4 Korea Korean Male 893.
5 Korea Korean Female 889.
6 Chinese Taipei Mandarin Male 887.
7 Singapore Invalid Male 883.
8 Hong Kong (China) Cantonese Male 880.
9 Korea Korean Male 868.
10 Singapore Invalid Male 867.
# ℹ 613,734 more rows
7 Bring everyting together
We know that the evidence strongly indicates that repeating a year is not good for student progress, but how do countries around the world differ in terms of the percentage of their students who repeat a year?
uses the PISA_2022 dataframe, note that this line includes <- to store the result opf the piping into a new object called data_repeat
2
filter out any NA values in the REPEAT field
3
group on the country of student CNT
4
create a new column total for total number of rows n() in each country CNT grouping
5
select on the CNT, REPEAT and total columns
6
regroup the data on country CNT and whether a student has repeated a year REPEAT, i.e. Albania+Did not repeat a grade; Albania+Repeated a grade; etc.
7
using the above grouping, count the number of rows in each group n() and assign this to student_n
8
for each grouping keep the total number of students in each country, as calculated on line 4. Note: unique(total) is needed here to return a single value of total, rather than a value for each student in each country
9
using student_n from line 7 and the number of students per country total, from line 4, create a percentage per for each grouping
10
as we have percentages for both Repeated at lease once and Never repeated, we only need to display one of these.
11
finally, we sort the data on the per/percentage column, to show the countries with the highest level of repeating a grade. This data is self-recorded by students, so might not be totally reliable!
12
save the data to your own folder as a csv
# A tibble: 78 × 5
# Groups: CNT [78]
CNT REPEAT student_n total per
<fct> <fct> <int> <int> <dbl>
1 Morocco Repeated at lease once 3156 6796 0.464
2 Colombia Repeated at lease once 2783 7401 0.376
3 Cambodia Repeated at lease once 1697 5156 0.329
4 Guatemala Repeated at lease once 1382 5110 0.270
5 Panama Repeated at lease once 1050 4080 0.257
6 Philippines Repeated at lease once 1793 7071 0.254
7 Dominican Republic Repeated at lease once 1603 6566 0.244
8 Belgium Repeated at lease once 1941 8055 0.241
9 Jamaica Repeated at lease once 851 3600 0.236
10 Netherlands Repeated at lease once 1118 4858 0.230
# ℹ 68 more rows
8 Advanced topics
8.1 Recoding data (ifelse)
Often we want to plot values in groupings that don’t yet exist, for example might want to give all schools over a certain size a different colour from others schools, or flag up students who have a different home language to the language that is being taught in school. To do this we need to look at how we can recode values. A common way to recode values is through an ifelse statement:
ifelse allows us to recode the data. In the example below, we are going to add a new column to the PISA_2022 dataset (using mutate) noting whether a student got a higher grade in their Maths PV1MATH or Reading PV1READ tests. ifPV1MATH is bigger then PV1READ, the maths_better is TRUE, elsemaths_better is FALSE, or in dplyr format:
We now take this new dataset maths_data and look at whether the difference between relative performance in maths and reading is the same for girls and boys:
# A tibble: 4 × 3
# Groups: ST004D01T [2]
ST004D01T maths_better n
<fct> <lgl> <int>
1 Female FALSE 180350
2 Female TRUE 125409
3 Male FALSE 118341
4 Male TRUE 189565
Adjust the code above to work out the percentages of Males and Females ST004D01T in each group. Check to see if the pattern also exists between science PV1SCIE and reading PV1READ:
ifelse statements can get a little complicated when using factors (see: Section 8.2). Take this example. Let’s flag students who have a different home language LANGN to the language that is being used in the PISA assessment tool LANGTEST_QQQ. We make an assumption here that the assessment tool will be the language used at school, so these students will be learning in a different language to their mother tongue. ifLANGN equals LANGTEST_QQQ, the lang_diff is FALSE, elselang_diff is TRUE, this raises an error:
Error in `mutate()`:
ℹ In argument: `lang_diff = ifelse(LANGN == LANGTEST_QQQ, FALSE, TRUE)`.
Caused by error in `Ops.factor()`:
! level sets of factors are different
The levels in each field are different, i.e. the range of home languages is larger than the range of test languages. To fix this, all we need to do is change the datatype of the factors LANGN and LANGTEST_QQQ to characters using as.character(<field>). This will then allow the comparison of the text stored in each row:
We can now look at this dataset to get an idea of which countries have the largest percentage of students learning in a language other than their mother tongue:
# A tibble: 80 × 4
# Groups: CNT [80]
CNT lang_diff n percentage
<fct> <lgl> <int> <dbl>
1 Albania TRUE 720 11.7
2 United Arab Emirates TRUE 13933 56.6
3 Argentina TRUE 788 6.51
4 Australia TRUE 1827 13.6
5 Austria TRUE 1455 23.7
6 Belgium TRUE 2445 29.5
7 Bulgaria TRUE 961 15.7
8 Brazil TRUE 559 5.18
9 Brunei Darussalam TRUE 4831 86.6
10 Canada TRUE 5971 25.9
# ℹ 70 more rows
This looks like a promising dataset, but there are some strange results:
lang_data_diff %>%filter(percentage >92)
# A tibble: 8 × 4
# Groups: CNT [8]
CNT lang_diff n percentage
<fct> <lgl> <int> <dbl>
1 Hong Kong (China) TRUE 5626 95.2
2 Macao (China) TRUE 4384 100
3 Montenegro TRUE 5596 96.6
4 Norway TRUE 6611 100
5 Philippines TRUE 6693 93.0
6 Ukrainian regions (18 of 27) TRUE 3747 96.7
7 Singapore TRUE 6567 99.4
8 Chinese Taipei TRUE 5821 99.4
Exploring data for Ukraine, we can see that a different spelling has been used in each field, Ukrainian and Ukranain, an incorrect spelling.
lang_data %>%filter(CNT =="Ukrainian regions (18 of 27)")
# A tibble: 3,876 × 4
CNT lang_diff LANGTEST_QQQ LANGN
<fct> <lgl> <fct> <fct>
1 Ukrainian regions (18 of 27) TRUE Ukranian Ukrainian
2 Ukrainian regions (18 of 27) TRUE Ukranian Ukrainian
3 Ukrainian regions (18 of 27) TRUE Ukranian Russian
4 Ukrainian regions (18 of 27) TRUE Ukranian Ukrainian
5 Ukrainian regions (18 of 27) TRUE Ukranian Ukrainian
6 Ukrainian regions (18 of 27) TRUE Ukranian Ukrainian
7 Ukrainian regions (18 of 27) TRUE Ukranian Russian
8 Ukrainian regions (18 of 27) TRUE Ukranian Ukrainian
9 Ukrainian regions (18 of 27) TRUE Ukranian Ukrainian
10 Ukrainian regions (18 of 27) TRUE Ukranian Ukrainian
# ℹ 3,866 more rows
ifelse can help here too. If we pick the spelling we want to stick to, we can recode fields to match:
lang_data %>%mutate(LANGTEST_QQQ =ifelse(as.character(LANGTEST_QQQ) =="Ukranian","Ukrainian",as.character(LANGTEST_QQQ))) %>%mutate(lang_diff =ifelse(as.character(LANGN) ==as.character(LANGTEST_QQQ),FALSE, TRUE)) %>%filter(CNT =="Ukrainian regions (18 of 27)")
# A tibble: 3,876 × 4
CNT lang_diff LANGTEST_QQQ LANGN
<fct> <lgl> <chr> <fct>
1 Ukrainian regions (18 of 27) FALSE Ukrainian Ukrainian
2 Ukrainian regions (18 of 27) FALSE Ukrainian Ukrainian
3 Ukrainian regions (18 of 27) TRUE Ukrainian Russian
4 Ukrainian regions (18 of 27) FALSE Ukrainian Ukrainian
5 Ukrainian regions (18 of 27) FALSE Ukrainian Ukrainian
6 Ukrainian regions (18 of 27) FALSE Ukrainian Ukrainian
7 Ukrainian regions (18 of 27) TRUE Ukrainian Russian
8 Ukrainian regions (18 of 27) FALSE Ukrainian Ukrainian
9 Ukrainian regions (18 of 27) FALSE Ukrainian Ukrainian
10 Ukrainian regions (18 of 27) FALSE Ukrainian Ukrainian
# ℹ 3,866 more rows
Unfortunately, if you explore this dataset a little further, the language fields don’t conform well with each other and a lot more work with ifelse will be needed before you could put together any full analysis around students who speak different languages at home and at school.
Tip
It’s possible to nest our ifelse statements, by writing another ifelse where you would have the <value_if_false>, for example we might want to give describe the type of school in England:
# TODO: use PISA for thisplot_data <- schools %>%mutate(sch_type =ifelse(EstablishmentGroup =="Special schools", "Special",ifelse(EstablishmentGroup =="Independent schools", "Independent",ifelse(AdmissionsPolicy=="Selective", "Grammar", "Comprehensive"))))
8.2 Factors and statistical data types
The types of variable will heavily influence what statistical analysis you can perform, e.g. you’ll need numeric values for a t-test. R is there to help by assigning datatypes to each field. We have different sorts of data that can be stored:
Categorical - data that can be divided into groups or categories
Nominal - categorical data where the order isn’t important, e.g. gender, or colours
Ordinal - categorical data that may have order or ranking, e.g. exam grades (A, B, C, D) or lickert scales (strongly agree, agree, disgaree, strongly disagree)
Numeric - data that consists of numbers
Continuous - numeric data that can take any value within a given range, e.g. height (178cm, 134.54cm)
Discrete - numeric data that can take only certain values within a range, e.g. number of children in a family (0,1,2,3,4,5)
But here we are going to look at how R handles factors. Factors have two parts, levels and codes. levels are what you see when you view a table column, codes are an underlying order to the data. Factors allow you to store data that has a known set of values that you might want to display in an order other than alphabetical. For example, if we look at the month field ST003D02T using the levels(<field>) command:
We can see that the months of the year are there along with other possible levels. With this particular column there are levels for missing or wrong responses (“Valid Skip”, “Not Applicable” “Invalid”, “No Response”), though PISA rarely uses them. You are more likely to find that missing/wrong data items are coded as NA, as you can see below:
PISA_2022 %>%count(ST003D02T)
# A tibble: 13 × 2
ST003D02T n
<fct> <int>
1 January 48760
2 February 43030
3 March 48671
4 April 47014
5 May 49235
6 June 48890
7 July 51262
8 August 51681
9 September 51755
10 October 51703
11 November 48179
12 December 48156
13 <NA> 25408
Codes are the underlying numbers/order for each level, in this case 1 = January, 2 = February, etc. R stores factors as codes, then uses the levels to display the data. You can see the codes by using the as.numeric command on a factor:
How can this be useful? Firstly it’s more efficient for R to store data this way, numbers (codes) are smaller and easier to sort/search than text (levels). But it also helps when we come to presenting data. A good example is how plots are made, they will use the codes to give an order to the display of columns, in the plot below, February (2) comes before August (8), even though there were more students born in August and A is before F in the alphabet:
[1] September October August July May June January
[8] March November December April February <NA>
attr(,"label")
[1] Student (Standardized) Birth - Month
16 Levels: January February March April May June July August ... No Response
# get the levels in order and pull/create a vector of themmy_levels <- grph_data %>%arrange(desc(n)) %>%pull(ST003D02T)# reassign the re-ordered levels to the dataframe columngrph_data$ST003D02T <-factor(grph_data$ST003D02T, levels=my_levels)ggplot(data=grph_data, aes(x=ST003D02T, y=n)) +geom_bar(stat ="identity")
Spot the five errors with the following code. Can you make it work? What does it do?
# Work out when science scores are better than mathsPISA_2022_scimath < PISA_2022 %>%rename(gender = ST004D01T) %>%mutate(sci better = PV1SCIE - PV1MATH) %>%filter(is.na(scibetter) %>%group_by(CNT gender) %>%summarise(students = n,sci_win =sum(scibetter >=0),per_scibetter =100*(sci_win/students))
answer
# Work out when more time spent in language lessons than maths lessonsPISA_2022_scimath <- PISA_2022 %>%#1 make sure you have the assignment arrow <-rename(gender = ST004D01T) %>%mutate(sci_better = PV1MATH - PV1SCIE) %>%#2 _ not space in name of fieldfilter(!is.na(sci_better)) %>%#3 this needs to be !is.na, otherwise it'll return nothinggroup_by(CNT, gender) %>%#4 missing commasummarise(students =n(), #5 missing brackets on the n() commandsci_win =sum(sci_better >=0),per_sci_win =100*(sci_win/students))
By country and gender work out the mean, median and standard deviations of STUBMI, order by the descending mean.
To further check your understanding of this section you will be attempting to analyse the 2022 teacher dataset. This dataset includes records for 68054 teachers from 18 countries, including 544 columns, covering attitudinal, demographic and workplace data. You can find the dataset here in the .parquet format.
example loading code
# download the file then# Work out when more time spent in language lessons than maths lessonsPISA_2022_teacher <-read_parquet("C:/Users/Peter/Downloads/PISA_2012_teacher.parquet")
Work out how many teachers are in the dataset for Portugal
For each country CNTRYID by gender TC001Q01NA, what is the mean time that a teacher has been in the teaching profession TC007Q02NA? Include the number of teachers in each group. Order this to show the country with the longest serving workforce:
For each country CNT find out which teachers report that they ‘Help students think critically’ TC199Q07HA. Hin: you’ll need to look at the levels of this question to find the correct filter:
answer
crit_thinking <- PISA_2022_teacher %>%rename(crit_think = TC199Q07HA) %>%group_by(CNT) %>%mutate(teachers=n()) %>%group_by(CNT, crit_think) %>%summarise(n =n(),per =n()/unique(teachers)) %>%arrange(desc(per)) %>%filter(crit_think =="A lot")# interestingly the highest performing countries also # have some of the lowest scores in helping children # think critically. To plot this:left_join(crit_thinking, PISA_2022 %>%group_by(CNT) %>%summarise(maths =mean(PV1MATH))) %>%ggplot(aes(x=per, y=maths)) +geom_point() +geom_smooth()
Explore the data on use of technology in the classroom TC169____