Piping and dplyr

Piping allows us to break down complex tasks into manageable chunks that can be written and tested one after another. There are several powerful commands in the tidyverse as part of the dplyr package that can help us group, filter, select, mutate and summarise datasets. With this small set of commands we can use piping to convert massive datasets into simple and useful results.

Using the pipe %>% command, we can feed the results from one command into the next command making for reusable and easy to read code.

how piping works
Note

The pipe command we are using %>% is from the magrittr package which is installed alongside the tidyverse. Recently R introduced another pipe |> which offers very similar functionality and tutorials online might use either. The examples below use the %>% pipe.

Let’s look at an example of using the pipe on the PISA_2022 table to calculate the best performing OECD countries for maths PV1MATH by gender ST004D01T:

1PISA_2022 %>%
2  filter(OECD == "Yes") %>%
3  group_by(CNT, ST004D01T) %>%
4  summarise(mean_maths = mean(PV1MATH, na.rm=TRUE),
            sd_maths = sd(PV1MATH, na.rm=TRUE),
            students = n()) %>%
5  filter(!is.na(ST004D01T)) %>%
6  arrange(desc(mean_maths))
1
line 1 passes the whole PISA_2022 dataset and pipes it into the next line using %>%
2
line 2 filters out any results that are from non-OECD countries by finding all the rows where OECD equals == “Yes”, this is then piped to the next line
3
line 3 groups the data by country CNT and by student gender ST004D01T, this is then piped to the next line
4
line 4-6 the summarise command performs a calculation on the country and gender groupings returning three new columns, each command is described by code on a new line and separated by a comma: the mean value for maths mean_maths, the standard deviation sd_maths, and a column telling us how many students were in each grouping using the n() which returns the number of rows in a group. These new columns and the grouping columns are then piped to the next line, all other columns are dropped
5
line 7 filters out any gender ST004D01T that is NA. First is finds all the students that have NA as their gender by using is.na(ST004D01T), then it NOTs/flips the result using the exclamation mark !, giving those students who don’t have their gender set to NA. The filtered data is then piped to the next line
6
line 8, finally we arrange/sort the results in descending order by the mean_maths column. The default for arrange is ascending order, leave out the desc( ) for the numbers to be ordered in the opposite way.
# A tibble: 74 × 5
# Groups:   CNT [37]
   CNT            ST004D01T mean_maths sd_maths students
   <fct>          <fct>          <dbl>    <dbl>    <int>
 1 Japan          Male            540.     99.0     2856
 2 Korea          Male            534.    112.      3325
 3 Japan          Female          531.     88.1     2904
 4 Korea          Female          528.     99.1     3129
 5 Estonia        Male            515.     87.4     3272
 6 Switzerland    Male            511.     99.3     3540
 7 Estonia        Female          510.     81.5     3120
 8 Czech Republic Male            502.    100.      4232
 9 Switzerland    Female          501.     90.9     3289
10 Austria        Male            500.     94.5     3110
# ℹ 64 more rows

Across the top few countries, Males get a slightly better maths score PV1MATH than Females, other scores are available, please read §What are Plausible Values? to find out more about the limitations of using a “PV” value.

Note

we met the assignment command earlier <-. Within the tidyverse commands we use the equals sign instead =.

The commands we have just used come from a package within the tidyverse called dplyr, let’s take a look at what they do:

command purpose example
select reduce the dataframe to the fields that you specify select(<field>, <field>, <field>)
filter get rid of rows that don’t meet one or more criteria filter(<field> <comparison>)
group group fields together to perform calculations group_by(<field>, <field>))
mutate add new fields or change values in current fields mutate(<new_field> = <field> / 2)
summarise create summary data optionally using a grouping command summarise(<new_field> = max(<field>))
arrange order the results by one or more fields arrange(desc(<field>))
Note

If you want to explore more of the functions of dplyr, take a look at the helpsheet

Adjust the code above to find out the lowest performing countries for reading PV1READ by gender that are not in the OECD

answer
PISA_2022 %>% 
  filter(OECD == "No") %>%
  group_by(CNT, ST004D01T) %>% 
  summarise(mean_read = mean(PV1READ, na.rm=TRUE),
            sd_read = sd(PV1READ, na.rm=TRUE),
            students = n()) %>%
  filter(!is.na(ST004D01T)) %>%
  arrange(mean_read)

1 select

The PISA_2022 dataset has far too many fields, to reduce the number of fields to focus on just a few of them we can use select

PISA_2022 %>% 
  select(CNT,ESCS, ST004D01T, ST003D02T)
# A tibble: 613,744 × 4
   CNT       ESCS ST004D01T ST003D02T
   <fct>    <dbl> <fct>     <fct>    
 1 Albania  1.11  Female    May      
 2 Albania -3.05  Male      February 
 3 Albania -0.187 Male      August   
 4 Albania -3.22  Female    July     
 5 Albania -1.05  Female    January  
 6 Albania  1.09  Male      May      
 7 Albania -0.762 Male      May      
 8 Albania -1.02  Female    December 
 9 Albania -1.17  Female    August   
10 Albania  0.286 Female    September
# ℹ 613,734 more rows

You might also be in the situation where you want to select everything but one or two fields, you can do this with the negative signal -, the below code returns all the fields except CNT and OECD:

PISA_2022 %>% select(-CNT, -OECD)
# A tibble: 613,744 × 81
   CNTSCHID CNTSTUID REGION  LANGTEST_QQQ ST003D02T ST003D03T ST004D01T
      <dbl>    <dbl> <fct>   <fct>        <fct>     <fct>     <fct>    
 1   800282   800001 Albania Albanian     May       2006      Female   
 2   800115   800002 Albania Albanian     February  2006      Male     
 3   800242   800003 Albania Albanian     August    2006      Male     
 4   800245   800005 Albania Albanian     July      2006      Female   
 5   800285   800006 Albania Albanian     January   2006      Female   
 6   800172   800007 Albania Albanian     May       2006      Male     
 7   800082   800008 Albania Albanian     May       2006      Male     
 8   800274   800009 Albania Albanian     December  2006      Female   
 9   800057   800010 Albania Albanian     August    2006      Female   
10   800132   800012 Albania Albanian     September 2006      Female   
# ℹ 613,734 more rows
# ℹ 74 more variables: ST250Q01JA <fct>, ST250Q02JA <fct>, ST250Q03JA <fct>,
#   ST250Q05JA <fct>, ST251Q01JA <fct>, ST251Q06JA <fct>, ST251Q07JA <fct>,
#   ST253Q01JA <fct>, ST254Q01JA <fct>, ST254Q02JA <fct>, ST254Q03JA <fct>,
#   ST254Q04JA <fct>, ST254Q05JA <fct>, ST254Q06JA <fct>, ST255Q01JA <fct>,
#   ST256Q02JA <fct>, ST005Q01JA <fct>, ST007Q01JA <fct>, ST019AQ01T <fct>,
#   ST019BQ01T <fct>, ST019CQ01T <fct>, ST125Q01NA <fct>, ST261Q01JA <fct>, …

You might find that you have a vector of column names that you want to select, to do this, we can use the any_of command:

my_fields <- c("CNT", "CNTSCHID", "ST004D01T")
PISA_2022 %>% select(any_of(my_fields))
# A tibble: 613,744 × 3
   CNT     CNTSCHID ST004D01T
   <fct>      <dbl> <fct>    
 1 Albania   800282 Female   
 2 Albania   800115 Male     
 3 Albania   800242 Male     
 4 Albania   800245 Female   
 5 Albania   800285 Female   
 6 Albania   800172 Male     
 7 Albania   800082 Male     
 8 Albania   800274 Female   
 9 Albania   800057 Female   
10 Albania   800132 Female   
# ℹ 613,734 more rows

With hundreds of fields available, you might want to focus on fields whose names match a certain pattern, to do this you can use starts_with, ends_with, contains:

# country of birth of student, and father and mother are recorded in ST019___
PISA_2022 %>% select(starts_with("ST019"))
# A tibble: 613,744 × 3
   ST019AQ01T      ST019BQ01T      ST019CQ01T     
   <fct>           <fct>           <fct>          
 1 Country of test Country of test Country of test
 2 Country of test Country of test Country of test
 3 Other country   Country of test Country of test
 4 Country of test Country of test Country of test
 5 Country of test Country of test Country of test
 6 Country of test Country of test Country of test
 7 Country of test Country of test Country of test
 8 Country of test Country of test Country of test
 9 Country of test Country of test Country of test
10 Country of test Country of test Country of test
# ℹ 613,734 more rows

When you come to building your statistical models you often need to use numeric data, you can find the columns that have only numbers in them by the following. Be warned though, sometimes there are numeric fields which have a few words in them, so R treats them as characters. Use the PISA codebook to help work out where those numbers are.

PISA_2022 %>% select(where(is.numeric)) %>% names()
 [1] "CNTSCHID"   "CNTSTUID"   "ST016Q01NA" "WB151Q01HA" "WB152Q01HA"
 [6] "WB156Q01HA" "BELONG"     "BULLIED"    "DISCLIM"    "HOMEPOS"   
[11] "ESCS"       "ATTCONFM"   "ICTEFFIC"   "STUBMI"     "PQMIMP"    
[16] "PARINVOL"   "PV1MATH"    "PV1READ"    "PV1SCIE"   
Tip

If you do want to change the type of a column to numeric you are going to need to:

  • filter out any offending rows, and
  • mutate the column to be numeric: col = as.numeric(col)

1.1 Questions

  1. Spot the three errors with the following select statement
PISA_2022 
  select(CNT BELONG) %>%
answer
PISA_2022 %>%  #1 missing pipe
  select(CNT, BELONG) #2 no comma between column names, #3 stray pipe on end
  1. Write a select statement to display the month ST003D02T and year of birth ST003D03T and the gender ST004D01T of each student.
answer
PISA_2022 %>% 
  select(ST003D02T, ST003D03T, ST004D01T)
  1. Write a select statement to show all the fields that are to do with well being and health, e.g. WB150Q01HA “How is your health?”
answer
PISA_2022 %>% 
  select(starts_with("WB15"))
  1. [EXTENSION] Adjust your answer to Q3 so that you select the gender ST004D01T and the ID CNTSTUID of each student in addition to the ST254____ fields looking at digital devices in the home:
answer
PISA_2022 %>% 
  select(CNTSTUID, ST004D01T, starts_with("ST254")) %>% na.omit()

2 filter

Not only does the PISA_2022 dataset have a huge number of columns, it has hundred of thousands of rows. We want to filter this down to the students that we are interested in, i.e. filter out data that isn’t useful for our analysis. If we only wanted the results that were Male, we could do the following:

PISA_2022 %>% 
  select(CNT, ESCS, ST004D01T, ST003D02T, PV1MATH) %>%
  filter(ST004D01T == "Male")
# A tibble: 307,906 × 5
   CNT       ESCS ST004D01T ST003D02T PV1MATH
   <fct>    <dbl> <fct>     <fct>       <dbl>
 1 Albania -3.05  Male      February     308.
 2 Albania -0.187 Male      August       268.
 3 Albania  1.09  Male      May          534.
 4 Albania -0.762 Male      May          382.
 5 Albania -1.98  Male      October      425.
 6 Albania  0.063 Male      April        463.
 7 Albania -0.170 Male      May          236.
 8 Albania -2.58  Male      August       327.
 9 Albania -1.09  Male      March        326.
10 Albania -0.334 Male      October      428.
# ℹ 307,896 more rows

We can combine filter commands to look for Males born in September and where the PV1MATH figure is greater than 750. We can list multiple criteria in the filter by separating the criteria with commas, using commas mean that all of these criteria need to be TRUE for a row to be returned. A comma in a filter is the equivalent of an AND, :

PISA_2022 %>% 
  select(CNT, ESCS, ST004D01T, ST003D02T, PV1MATH) %>%
  filter(ST004D01T == "Male",
         ST003D02T == "September",
         PV1MATH > 750)
# A tibble: 52 × 5
   CNT             ESCS ST004D01T ST003D02T PV1MATH
   <fct>          <dbl> <fct>     <fct>       <dbl>
 1 Australia      0.994 Male      September    770.
 2 Australia      0.837 Male      September    758.
 3 Australia      1.22  Male      September    807.
 4 Canada         1.06  Male      September    752.
 5 Canada         1.04  Male      September    788.
 6 Canada         1.21  Male      September    781.
 7 Canada         0.804 Male      September    750.
 8 Chinese Taipei 0.842 Male      September    753.
 9 Chinese Taipei 0.488 Male      September    794.
10 Chinese Taipei 1.23  Male      September    794.
# ℹ 42 more rows

You can also write it as an ampersand &

PISA_2022 %>% 
  select(CNT, ESCS, ST004D01T, ST003D02T, PV1MATH) %>%
  filter(ST004D01T == "Male" &
         ST003D02T == "September" &
         PV1MATH > 750)
Important

Remember to include the == sign when looking to filter on equality; additionally, you can use != (not equals), >=, <=, >, <.

Remember matching is case sensitive, “september” != “September”

Rather than just looking at September born students, we want to find all the students born in the Autumn term. But if we add a couple more criteria on ST003D02T nothing is returned! Why?

PISA_2022 %>% 
  select(CNT, ESCS, ST004D01T, ST003D02T, PV1MATH) %>%
  filter(ST004D01T == "Male",
         ST003D02T == "September",
         ST003D02T == "October",
         ST003D02T == "November",
         ST003D02T == "December",
         PV1MATH > 750)
# A tibble: 0 × 5
# ℹ 5 variables: CNT <fct>, ESCS <dbl>, ST004D01T <fct>, ST003D02T <fct>,
#   PV1MATH <dbl>

The reason is R is looking for individual students born in September AND October AND November AND December. As a student can only have one birth month there are no students that meet this criteria. We need to use OR :

To create an OR in a filter we use the bar | character, the below looks for all students who are “Male” AND were born in “September” OR “October” OR “November” OR “December”, AND have a PV1MATH > 750.

PISA_2022 %>% 
  select(CNT, ESCS, ST004D01T, ST003D02T, PV1MATH) %>%
  filter(ST004D01T == "Male",
         (ST003D02T == "September" | ST003D02T == "October" | ST003D02T == "November" | ST003D02T == "December"),
         PV1MATH > 750)
# A tibble: 194 × 5
   CNT        ESCS ST004D01T ST003D02T PV1MATH
   <fct>     <dbl> <fct>     <fct>       <dbl>
 1 Australia 0.994 Male      September    770.
 2 Australia 1.03  Male      October      808.
 3 Australia 1.30  Male      October      762.
 4 Australia 1.21  Male      November     764.
 5 Australia 0.837 Male      September    758.
 6 Australia 1.22  Male      September    807.
 7 Belgium   0.746 Male      December     770.
 8 Belgium   0.939 Male      December     779.
 9 Belgium   1.25  Male      November     832.
10 Belgium   1.11  Male      October      776.
# ℹ 184 more rows

It’s neater, maybe, to use the %in% command, which checks to see if the value in a column is present in a vector, this can mimic the OR/| command:

PISA_2022 %>% 
  select(CNT, ESCS, ST004D01T, ST003D02T, PV1MATH) %>%
  filter(ST004D01T == "Male",
         ST003D02T %in% c("September", "October", "November", "December"),
         PV1MATH > 750)
Tip

When building filters you need to know the range of values that a column can take, we can do this in several ways:

# show the possible levels
levels(PISA_2022$ST003D02T)
 [1] "January"        "February"       "March"          "April"         
 [5] "May"            "June"           "July"           "August"        
 [9] "September"      "October"        "November"       "December"      
[13] "Valid Skip"     "Not Applicable" "Invalid"        "No Response"   
# show the actual unique values in a field
# this might be a slightly smaller set of values
unique(PISA_2022$ST003D02T)
 [1] May       February  August    July      January   December  September
 [8] October   April     March     June      November  <NA>     
16 Levels: January February March April May June July August ... No Response
# You might also want to read the label of a field
attr(PISA_2022$ST003D02T, "label")
[1] "Student (Standardized) Birth - Month"

2.1 Questions

  1. Spot the two errors with the following select statement
PISA_2022
  select(CNT, PV1READ) %>%
  filter(CNT = "Finland")
answer
PISA_2022 %>%  #1 missing pipe command
  select(CNT, PV1READ) %>%
  filter(CNT == "Finland") #2 you need a double equals
  1. Use filter to find all the students with PV1READ grade equal to 333.
answer
PISA_2022 %>% 
  filter(PV1READ == 333)
  1. Use filter to find all the students with PV1READ, PV1SCIE, and PV1MATH grades over 800.
answer
PISA_2022 %>% 
  filter(PV1READ > 800,
         PV1SCIE > 800,
         PV1MATH > 800)
  1. Spot the three errors with the following select statement
PISA_2022 %>% 
  select(CNT) %>%
  filter(CNT in c("France", "belgium")
         ESCS < 0)
answer
PISA_2022 %>% 
  select(CNT, ESCS) %>% #1 you have ESCS in the filter, it needs to be in the select as well
  filter(CNT %in% c("France", "Belgium"), #2 Belgium needs a capital letter
                                          #3 the %in% command needs percentages
                                          #4 you need a comma (or &) at the end of the line
         ESCS < 0)
  1. Use filter to find all the students with Three or more cars in their home ST251Q01JA. How does this compare to those with no None cars?
answer
# cars 3+ 
PISA_2022 %>%
  select(CNT, ST251Q01JA) %>%
  filter(ST251Q01JA == "Three or more")

# cars 0
PISA_2022 %>%
  select(CNT, ST251Q01JA) %>%
  filter(ST251Q01JA == "None")
  1. Adjust your code in Q2. to find the number of students with Three or more cars in their home ST251Q01JA in Italy, how does this compare with Spain?
answer
PISA_2022 %>%
  select(CNT, ST251Q01JA) %>%
  filter(ST251Q01JA == "Three or more",
         CNT == "Italy")

PISA_2022 %>%
  select(CNT, ST251Q01JA) %>%
  filter(ST251Q01JA == "Three or more",
         CNT == "Spain")

# EXTENSION:
# Note we would need to know the percentage of students 
# in each country with that number of cars to make a proper
# comparison. Spain might have more students taking the PISA
# test than Italy, or vice-versa

PISA_2022 %>%
  select(CNT, ST251Q01JA) %>%
  filter(CNT %in% c("Italy", "Spain")) %>%
  group_by(CNT) %>%
  mutate(total_stus = n()) %>%
  filter(ST251Q01JA == "Three or more") %>%
  summarise(three_more = n(),
            per_three_more = three_more/unique(total_stus))
  1. Write a filter to create a table for the number of Female students with reading PV1READ scores lower than 400 in the United Kingdom, store the result as read_low_female, repeat but for Male students and store as read_low_male. Use nrow() to work out if there are more males or females with a low reading score in the UK
answer
read_low_female <- PISA_2022 %>% 
  filter(CNT == "United Kingdom",
         PV1READ < 400,
         ST004D01T == "Female")

read_low_male <- PISA_2022 %>% 
  filter(CNT == "United Kingdom",
         PV1READ < 400,
         ST004D01T == "Male")

nrow(read_low_female)
nrow(read_low_male)

# You could also pipe the whole dataframe into nrow()
PISA_2022 %>% 
  filter(CNT == "United Kingdom",
         PV1READ < 400,
         ST004D01T == "Female") %>% 
  nrow()
  1. How many students in the United Kingdom had no television ST254Q01JA OR no connection to the internet ST250Q05JA. HINT: use levels(PISA_2022$ST254Q01JA) to look at the levels available for each column.
answer
PISA_2022 %>% filter(CNT == "United Kingdom", 
                     ST254Q01JA == "None" |
                     ST250Q05JA == "None")
  1. Which countr[y|ies] had students with NA for Gender, remember to check for NA using is.na()?
answer
PISA_2022 %>% 
  filter(is.na(ST004D01T)) %>%
  select(CNT) %>%
  distinct()

3 renaming columns

Very often when dealing with datasets such as PISA or TIMSS, the column names can be very confusing without a reference key, e.g. ST004D01T, OCOD3 and ST261Q04JA. To rename columns in the tidyverse we use the rename(<new_name> = <old_name>) command. For example, if you wanted to rename the rather confusingly named student column for gender, also known as ST004D01T, and the column for having a having enough digital resources in school, also known as IC172Q01JA, you could use:

PISA_2022 %>%
  rename(gender = ST004D01T,
         dig_resources = IC172Q01JA) %>%
  select(CNT, gender, dig_resources) %>% 
  summary()
                   CNT                    gender      
 Spain               : 30800   Female        :305759  
 United Arab Emirates: 24600   Male          :307906  
 Canada              : 23073   Valid Skip    :     0  
 Kazakhstan          : 19769   Not Applicable:     0  
 Indonesia           : 13439   Invalid       :     0  
 Australia           : 13437   No Response   :     0  
 (Other)             :488626   NA's          :    79  
           dig_resources   
 Agree            :183233  
 Disagree         : 70595  
 Strongly agree   : 49276  
 Strongly disagree: 35223  
 Valid Skip       :     0  
 (Other)          :     0  
 NA's             :275417  

If you want to change the name of the column so that it stays when you need to perform another calculation, remember to assign the renamed dataframe back to the original dataframe. But be warned, you’ll need to reload the full dataset to restore the original names:

PISA_2022 <- PISA_2022 %>%
    rename(gender = ST004D01T,
           dig_resources = IC172Q01JA)

4 group_by and summarise

So far we have looked at ways to return rows that meet certain criteria. Using group_by and summarise we can start to analyse data for different groups of students. For example, let’s look at the number of students who don’t have internet connections at home besides a mobile phone ST250Q05JA:

1PISA_2022 %>%
2  group_by(ST250Q05JA) %>%
3  summarise(student_n = n())
1
Line 1 passes the full PISA_2022 to the pipe
2
Line 2 makes groups within PISA_2022 using the unique values of ST250Q05JA
3
Line 3, these groups are then passed to summarise, which creates a new column called student_n and stores the number of rows in each ST250Q05JA group using the n() command. summarise only returns the columns it creates, or are in the group_by, everything else is discarded.
# A tibble: 3 × 2
  ST250Q05JA student_n
  <fct>          <int>
1 Yes           525842
2 No             56968
3 <NA>           30934

What we might want to do is look at this data from a country by country perspective, by adding another field to the group_by() command, we then group by the unique combination of countries CNT and internet access ST250Q05JA, e.g. Albania + Yes; Albania + No; Albania + NA; United Arab Emirates + Yes; etc

int_by_cnt <- PISA_2022 %>% 
  group_by(CNT, ST250Q05JA) %>%
  summarise(student_n = n())

print(int_by_cnt)
# A tibble: 238 × 3
# Groups:   CNT [80]
   CNT                  ST250Q05JA student_n
   <fct>                <fct>          <int>
 1 Albania              Yes             4693
 2 Albania              No               535
 3 Albania              <NA>             901
 4 United Arab Emirates Yes            22206
 5 United Arab Emirates No              1116
 6 United Arab Emirates <NA>            1278
 7 Argentina            Yes            10484
 8 Argentina            No               938
 9 Argentina            <NA>             689
10 Australia            Yes            12808
# ℹ 228 more rows

summarise can also be used to work out statistics by grouping. For example, if you wanted to find out the max, mean and min science grade PV1SCIE by country CNT, you could do the following:

PISA_2022 %>% 
  group_by(CNT) %>%
  summarise(sci_max  = max(PV1SCIE,  na.rm = TRUE),
            sci_mean = mean(PV1SCIE, na.rm = TRUE),
            sci_min  = min(PV1SCIE,  na.rm = TRUE))
# A tibble: 80 × 4
   CNT                  sci_max sci_mean sci_min
   <fct>                  <dbl>    <dbl>   <dbl>
 1 Albania                 724.     376.    74.6
 2 United Arab Emirates    842.     436.     0  
 3 Argentina               751.     415.   128. 
 4 Australia               875.     508.   108. 
 5 Austria                 804.     494.   170. 
 6 Belgium                 792.     495.   169. 
 7 Bulgaria                728.     422.   130. 
 8 Brazil                  784.     406.   106. 
 9 Brunei Darussalam       762.     445.   135. 
10 Canada                  893.     499.   107. 
# ℹ 70 more rows
Important

group_by() can have unintended consequences in your code if you are saving your pipes to new dataframes. To be safe your can clear any grouping by adding: my_data %>% ungroup()

4.1 Questions

  1. Spot the three errors with the following summarise statement
PISA_2022 %>% 
  group(CNT)
  summarise(num_stus = n)
answer
PISA_2022 %>% 
  group_by(CNT) %>% #1 group_by NOT group #2 missing pipe %>%
  summarise(num_stus = n()) #3 = n() not = n
  1. Write a group_by and summarise statement to work out the mean and median cultural capital value ESCS for each student by country CNT
answer
PISA_2022 %>%
  group_by(CNT) %>%
  summarise(escs_mean = mean(ESCS, na.rm=TRUE),
            escs_median = median(ESCS, na.rm=TRUE))
  1. Using summarise work out, Yes or No, by country CNT and gender ST004D01T, whether students “Agree/disagree: There are enough [digital resources] for every student at my school” IC172Q01JA. Filter out any NA values on IC172Q01JA:
answer
PISA_2022 %>% 
  filter(!is.na(IC172Q01JA)) %>%
  group_by(CNT, ST004D01T, IC172Q01JA) %>% 
  summarise(n=n())

5 mutate

Sometimes you will want to adjust the values stored in a field, e.g. converting a distance in miles into kilometres; or compute a new fields based on other fields, e.g. working out a total grade given the parts of a test. To do this we can use mutate. Unlike summarise, mutate retains all the other columns either adding a new column or changing an existing one

mutate(<field> = <field_calculation>)

The PISA_2022 dataset has results for maths PV1MATH, science PV1SCIE and reading PV1READ. We could combine these to create an overall PISA_grade, and PISA_mean:

PISA_2022 %>%                                       
1  mutate(PV1_total = PV1MATH + PV1SCIE + PV1READ,
2         PV1_mean = PV1_total/3) %>%
3  select(CNT, ST004D01T, ESCS, PV1_total, PV1_mean)
1
mutate creates a new field called PV1_total made up by adding together the columns for maths, science and reading. Each column acts like a vector and adding them together is the equivalent of adding each students individual grades together, row by row. See §Vectors for more details on vector addition.
2
inside the same mutate statement, we take the PV1_total calculated on line two and divide it by 3, to give us a mean value, this is then assigned to a new column, PV1_mean.
3
this line selects only the fields that we are interested in, dropping the others
# A tibble: 613,744 × 5
   CNT     ST004D01T   ESCS PV1_total PV1_mean
   <fct>   <fct>      <dbl>     <dbl>    <dbl>
 1 Albania Female     1.11       763.     254.
 2 Albania Male      -3.05       882.     294.
 3 Albania Male      -0.187      911.     304.
 4 Albania Female    -3.22       809.     270.
 5 Albania Female    -1.05      1335.     445.
 6 Albania Male       1.09      1464.     488.
 7 Albania Male      -0.762     1115.     372.
 8 Albania Female    -1.02       904.     301.
 9 Albania Female    -1.17      1145.     382.
10 Albania Female     0.286     1235.     412.
# ℹ 613,734 more rows

We can use mutate to create subsets of data in fields. For example, if we wanted to see how many students in each country were high performing readers, specified by getting a reading grade of greater than 550, we could do the following:

PISA_2022 %>%
2  mutate(PV1READ_high = PV1READ > 550) %>%
3  group_by(CNT, PV1READ_high) %>%
4  summarise(n = n())
2
this line creates a new column called PV1READ_high for every students, storing a boolean value, TRUE or FALSE depending on whether their reading grates PV1READ > 550.
3
a grouping is made on the country and the new field PV1READ_high, e.g. Albania + TRUE, Albania + FALSE, etc.
4
using summarise we can find the number of student rows in each grouping using n(), and drop all the other fields
# A tibble: 159 × 3
# Groups:   CNT [80]
   CNT                  PV1READ_high     n
   <fct>                <lgl>        <int>
 1 Albania              FALSE         6055
 2 Albania              TRUE            74
 3 United Arab Emirates FALSE        20711
 4 United Arab Emirates TRUE          3889
 5 Argentina            FALSE        11197
 6 Argentina            TRUE           914
 7 Australia            FALSE         9005
 8 Australia            TRUE          4432
 9 Austria              FALSE         4461
10 Austria              TRUE          1690
# ℹ 149 more rows

Comparisons can also be made between different columns, if we wanted to find out the percentage of Males and Females that got a better grade in their maths test PV1MATH than in their reading test PV1READ:

PISA_2022 %>%
2  mutate(maths_better = PV1MATH > PV1READ) %>%
3  select(CNT, ST004D01T, maths_better, PV1MATH, PV1READ) %>%
4  filter(!is.na(ST004D01T), !is.na(maths_better)) %>%
5  group_by(ST004D01T) %>%
6  mutate(students_n = n()) %>%
7  group_by(ST004D01T, maths_better) %>%
8  summarise(n = n(),
9            per = n/unique(students_n))
2
mutate creates a new field called maths_better made up by comparing the PV1MATH grade with PV1READ and creating a boolean/logical vector for the column.
3
selects a subset of the columns
4
filters out any students that don’t have gender data ST004D01T and where the calculation on line 2 failed, i.e. PV1MATH or PV1READ was NA
5
group on the gender of the student
6
using the group on line 5, use mutate to calculate the total number of Males and Females by looking for the number of rows in each group n(), store this as students_n
7
re-group the data on gender ST004D01T and whether the student is better at maths than reading maths_better
8
count the number of students, n in each group specified by line 7.
9
create a percentage figure for the number of students in each grouping given by line 7. Use the n value from line 8 and the students_n value from line 6. NOTE: we need to use unique(students_n) to return just one value for each grouping rather than a value for every row of the line 7 grouping
# A tibble: 4 × 4
# Groups:   ST004D01T [2]
  ST004D01T maths_better      n   per
  <fct>     <lgl>         <int> <dbl>
1 Female    FALSE        180350 0.590
2 Female    TRUE         125409 0.410
3 Male      FALSE        118341 0.384
4 Male      TRUE         189565 0.616

For more information on how to mutate fields using ifelse, see Section 8.1

6 arrange

The results returned by pipes can be huge, so it’s a good idea to store them in objects and explore them in the Environment window where you can sort and search within the output. There might also be times when you want to order/arrange the outputs in a particular way. We can do this quite easily in the tidyverse by using the arrange(<column_name>, <column_name>) function.

PISA_2022 %>%
  select(CNT, ST004D01T, PV1MATH) %>%
  arrange(PV1MATH)
# A tibble: 613,744 × 3
   CNT       ST004D01T PV1MATH
   <fct>     <fct>       <dbl>
 1 Cambodia  Male          0  
 2 Guatemala Female       57.8
 3 Cambodia  Female       84.7
 4 Cambodia  Male         87.9
 5 Cambodia  Male         88.6
 6 Paraguay  Male         90.4
 7 Cambodia  Male         94.4
 8 Cambodia  Female       98.0
 9 Albania   Male         99.7
10 Cambodia  Female      100. 
# ℹ 613,734 more rows

If we’re interested in the highest achieving students we can add the desc() function to arrange:

PISA_2022 %>%
  select(CNT, LANGN, ST004D01T, PV1MATH) %>%
  arrange(desc(PV1MATH))
# A tibble: 613,744 × 4
   CNT               LANGN     ST004D01T PV1MATH
   <fct>             <fct>     <fct>       <dbl>
 1 Singapore         Invalid   Male         943.
 2 Chinese Taipei    Mandarin  Male         917.
 3 Singapore         Invalid   Male         897.
 4 Korea             Korean    Male         893.
 5 Korea             Korean    Female       889.
 6 Chinese Taipei    Mandarin  Male         887.
 7 Singapore         Invalid   Male         883.
 8 Hong Kong (China) Cantonese Male         880.
 9 Korea             Korean    Male         868.
10 Singapore         Invalid   Male         867.
# ℹ 613,734 more rows

7 Bring everyting together

We know that the evidence strongly indicates that repeating a year is not good for student progress, but how do countries around the world differ in terms of the percentage of their students who repeat a year?

1data_repeat <- PISA_2022 %>%
2  filter(!is.na(REPEAT)) %>%
3  group_by(CNT) %>%
4  mutate(total = n()) %>%
5  select(CNT, REPEAT, total) %>%
6  group_by(CNT, REPEAT) %>%
7  summarise(student_n = n(),
8            total = unique(total),
9            per = student_n / unique(total)) %>%
10  filter(REPEAT == "Repeated at lease once") %>%
11  arrange(desc(per))

print(data_repeat)                                  

12write_csv(data_repeat, "<folder_location>/repeat_a_year.csv")
1
uses the PISA_2022 dataframe, note that this line includes <- to store the result opf the piping into a new object called data_repeat
2
filter out any NA values in the REPEAT field
3
group on the country of student CNT
4
create a new column total for total number of rows n() in each country CNT grouping
5
select on the CNT, REPEAT and total columns
6
regroup the data on country CNT and whether a student has repeated a year REPEAT, i.e. Albania+Did not repeat a grade; Albania+Repeated a grade; etc.
7
using the above grouping, count the number of rows in each group n() and assign this to student_n
8
for each grouping keep the total number of students in each country, as calculated on line 4. Note: unique(total) is needed here to return a single value of total, rather than a value for each student in each country
9
using student_n from line 7 and the number of students per country total, from line 4, create a percentage per for each grouping
10
as we have percentages for both Repeated at lease once and Never repeated, we only need to display one of these.
11
finally, we sort the data on the per/percentage column, to show the countries with the highest level of repeating a grade. This data is self-recorded by students, so might not be totally reliable!
12
save the data to your own folder as a csv
# A tibble: 78 × 5
# Groups:   CNT [78]
   CNT                REPEAT                 student_n total   per
   <fct>              <fct>                      <int> <int> <dbl>
 1 Morocco            Repeated at lease once      3156  6796 0.464
 2 Colombia           Repeated at lease once      2783  7401 0.376
 3 Cambodia           Repeated at lease once      1697  5156 0.329
 4 Guatemala          Repeated at lease once      1382  5110 0.270
 5 Panama             Repeated at lease once      1050  4080 0.257
 6 Philippines        Repeated at lease once      1793  7071 0.254
 7 Dominican Republic Repeated at lease once      1603  6566 0.244
 8 Belgium            Repeated at lease once      1941  8055 0.241
 9 Jamaica            Repeated at lease once       851  3600 0.236
10 Netherlands        Repeated at lease once      1118  4858 0.230
# ℹ 68 more rows

8 Advanced topics

8.1 Recoding data (ifelse)

Often we want to plot values in groupings that don’t yet exist, for example might want to give all schools over a certain size a different colour from others schools, or flag up students who have a different home language to the language that is being taught in school. To do this we need to look at how we can recode values. A common way to recode values is through an ifelse statement:

ifelse(<statement(s)>, <value_if_true>, <value_if_false>)

ifelse allows us to recode the data. In the example below, we are going to add a new column to the PISA_2022 dataset (using mutate) noting whether a student got a higher grade in their Maths PV1MATH or Reading PV1READ tests. if PV1MATH is bigger then PV1READ, the maths_better is TRUE, else maths_better is FALSE, or in dplyr format:

maths_data <- PISA_2022 %>%
  mutate(maths_better = 
           ifelse(PV1MATH > PV1READ,
                  TRUE, 
                  FALSE)) %>%
  select(CNT, ST004D01T, maths_better, PV1MATH, PV1READ)

print(maths_data)
# A tibble: 613,744 × 5
   CNT     ST004D01T maths_better PV1MATH PV1READ
   <fct>   <fct>     <lgl>          <dbl>   <dbl>
 1 Albania Female    FALSE           180.    248.
 2 Albania Male      TRUE            308.    258.
 3 Albania Male      FALSE           268.    285.
 4 Albania Female    FALSE           273.    322.
 5 Albania Female    FALSE           435.    464.
 6 Albania Male      TRUE            534.    451.
 7 Albania Male      FALSE           382.    391.
 8 Albania Female    FALSE           273.    308.
 9 Albania Female    FALSE           355.    429.
10 Albania Female    TRUE            430.    420.
# ℹ 613,734 more rows

We now take this new dataset maths_data and look at whether the difference between relative performance in maths and reading is the same for girls and boys:

maths_data %>% 
  filter(!is.na(ST004D01T), !is.na(maths_better)) %>%
  group_by(ST004D01T, maths_better) %>%
  summarise(n = n()) 
# A tibble: 4 × 3
# Groups:   ST004D01T [2]
  ST004D01T maths_better      n
  <fct>     <lgl>         <int>
1 Female    FALSE        180350
2 Female    TRUE         125409
3 Male      FALSE        118341
4 Male      TRUE         189565

Adjust the code above to work out the percentages of Males and Females ST004D01T in each group. Check to see if the pattern also exists between science PV1SCIE and reading PV1READ:

adding percentage column
PISA_2022 %>%
  mutate(maths_better = 
           ifelse(PV1MATH > PV1READ,
                  TRUE, 
                  FALSE)) %>%
  select(CNT, ST004D01T, maths_better, PV1MATH, PV1READ) %>% 
  filter(!is.na(ST004D01T), !is.na(maths_better)) %>%
  group_by(ST004D01T) %>%
  mutate(students_n = n()) %>%
  group_by(ST004D01T, maths_better) %>%
  summarise(n = n(),
            per = n/unique(students_n))
comparing science and reading
PISA_2022 %>%
  mutate(sci_better = 
           ifelse(PV1SCIE > PV1READ,
                  TRUE, 
                  FALSE)) %>%
  select(CNT, ST004D01T, sci_better, PV1SCIE, PV1READ) %>% 
  filter(!is.na(ST004D01T), !is.na(sci_better)) %>%
  group_by(ST004D01T) %>%
  mutate(students_n = n()) %>%
  group_by(ST004D01T, sci_better) %>%
  summarise(n = n(),
            per = n/unique(students_n))
comparing science and maths
PISA_2022 %>%
  mutate(sci_better = 
           ifelse(PV1SCIE > PV1MATH,
                  TRUE, 
                  FALSE)) %>%
  select(CNT, ST004D01T, sci_better, PV1SCIE, PV1MATH) %>% 
  filter(!is.na(ST004D01T), !is.na(sci_better)) %>%
  group_by(ST004D01T) %>%
  mutate(students_n = n()) %>%
  group_by(ST004D01T, sci_better) %>%
  summarise(n = n(),
            per = n/unique(students_n))

ifelse statements can get a little complicated when using factors (see: Section 8.2). Take this example. Let’s flag students who have a different home language LANGN to the language that is being used in the PISA assessment tool LANGTEST_QQQ. We make an assumption here that the assessment tool will be the language used at school, so these students will be learning in a different language to their mother tongue. if LANGN equals LANGTEST_QQQ, the lang_diff is FALSE, else lang_diff is TRUE, this raises an error:

lang_data <- PISA_2022 %>%
  mutate(lang_diff = 
           ifelse(LANGN == LANGTEST_QQQ,
                  FALSE, 
                  TRUE)) %>%
  select(CNT, lang_diff, LANGTEST_QQQ, LANGN)
Error in `mutate()`:
ℹ In argument: `lang_diff = ifelse(LANGN == LANGTEST_QQQ, FALSE, TRUE)`.
Caused by error in `Ops.factor()`:
! level sets of factors are different

The levels in each field are different, i.e. the range of home languages is larger than the range of test languages. To fix this, all we need to do is change the datatype of the factors LANGN and LANGTEST_QQQ to characters using as.character(<field>). This will then allow the comparison of the text stored in each row:

lang_data <- PISA_2022 %>%
  mutate(lang_diff = 
           ifelse(as.character(LANGN) == as.character(LANGTEST_QQQ),
                  FALSE, 
                  TRUE)) %>%
  select(CNT, lang_diff, LANGTEST_QQQ, LANGN)

print(lang_data)
# A tibble: 613,744 × 4
   CNT     lang_diff LANGTEST_QQQ LANGN   
   <fct>   <lgl>     <fct>        <fct>   
 1 Albania FALSE     Albanian     Albanian
 2 Albania FALSE     Albanian     Albanian
 3 Albania FALSE     Albanian     Albanian
 4 Albania FALSE     Albanian     Albanian
 5 Albania FALSE     Albanian     Albanian
 6 Albania FALSE     Albanian     Albanian
 7 Albania FALSE     Albanian     Albanian
 8 Albania FALSE     Albanian     Albanian
 9 Albania FALSE     Albanian     Albanian
10 Albania FALSE     Albanian     Albanian
# ℹ 613,734 more rows

We can now look at this dataset to get an idea of which countries have the largest percentage of students learning in a language other than their mother tongue:

lang_data_diff <- lang_data %>% 
  group_by(CNT) %>%
  mutate(student_n = n()) %>%
  group_by(CNT, lang_diff) %>%
  summarise(n = n(),
            percentage = 100*(n / max(student_n))) %>%
    filter(!is.na(lang_diff),
         lang_diff == TRUE)

print(lang_data_diff)
# A tibble: 80 × 4
# Groups:   CNT [80]
   CNT                  lang_diff     n percentage
   <fct>                <lgl>     <int>      <dbl>
 1 Albania              TRUE        720      11.7 
 2 United Arab Emirates TRUE      13933      56.6 
 3 Argentina            TRUE        788       6.51
 4 Australia            TRUE       1827      13.6 
 5 Austria              TRUE       1455      23.7 
 6 Belgium              TRUE       2445      29.5 
 7 Bulgaria             TRUE        961      15.7 
 8 Brazil               TRUE        559       5.18
 9 Brunei Darussalam    TRUE       4831      86.6 
10 Canada               TRUE       5971      25.9 
# ℹ 70 more rows

This looks like a promising dataset, but there are some strange results:

lang_data_diff %>% filter(percentage > 92)
# A tibble: 8 × 4
# Groups:   CNT [8]
  CNT                          lang_diff     n percentage
  <fct>                        <lgl>     <int>      <dbl>
1 Hong Kong (China)            TRUE       5626       95.2
2 Macao (China)                TRUE       4384      100  
3 Montenegro                   TRUE       5596       96.6
4 Norway                       TRUE       6611      100  
5 Philippines                  TRUE       6693       93.0
6 Ukrainian regions (18 of 27) TRUE       3747       96.7
7 Singapore                    TRUE       6567       99.4
8 Chinese Taipei               TRUE       5821       99.4

Exploring data for Ukraine, we can see that a different spelling has been used in each field, Ukrainian and Ukranain, an incorrect spelling.

lang_data %>% filter(CNT == "Ukrainian regions (18 of 27)")
# A tibble: 3,876 × 4
   CNT                          lang_diff LANGTEST_QQQ LANGN    
   <fct>                        <lgl>     <fct>        <fct>    
 1 Ukrainian regions (18 of 27) TRUE      Ukranian     Ukrainian
 2 Ukrainian regions (18 of 27) TRUE      Ukranian     Ukrainian
 3 Ukrainian regions (18 of 27) TRUE      Ukranian     Russian  
 4 Ukrainian regions (18 of 27) TRUE      Ukranian     Ukrainian
 5 Ukrainian regions (18 of 27) TRUE      Ukranian     Ukrainian
 6 Ukrainian regions (18 of 27) TRUE      Ukranian     Ukrainian
 7 Ukrainian regions (18 of 27) TRUE      Ukranian     Russian  
 8 Ukrainian regions (18 of 27) TRUE      Ukranian     Ukrainian
 9 Ukrainian regions (18 of 27) TRUE      Ukranian     Ukrainian
10 Ukrainian regions (18 of 27) TRUE      Ukranian     Ukrainian
# ℹ 3,866 more rows

ifelse can help here too. If we pick the spelling we want to stick to, we can recode fields to match:

lang_data %>% 
  mutate(LANGTEST_QQQ = 
           ifelse(as.character(LANGTEST_QQQ) == "Ukranian",
                 "Ukrainian",
                 as.character(LANGTEST_QQQ))) %>%
  mutate(lang_diff = 
           ifelse(as.character(LANGN) == as.character(LANGTEST_QQQ),
                  FALSE, 
                  TRUE)) %>%
  filter(CNT == "Ukrainian regions (18 of 27)")
# A tibble: 3,876 × 4
   CNT                          lang_diff LANGTEST_QQQ LANGN    
   <fct>                        <lgl>     <chr>        <fct>    
 1 Ukrainian regions (18 of 27) FALSE     Ukrainian    Ukrainian
 2 Ukrainian regions (18 of 27) FALSE     Ukrainian    Ukrainian
 3 Ukrainian regions (18 of 27) TRUE      Ukrainian    Russian  
 4 Ukrainian regions (18 of 27) FALSE     Ukrainian    Ukrainian
 5 Ukrainian regions (18 of 27) FALSE     Ukrainian    Ukrainian
 6 Ukrainian regions (18 of 27) FALSE     Ukrainian    Ukrainian
 7 Ukrainian regions (18 of 27) TRUE      Ukrainian    Russian  
 8 Ukrainian regions (18 of 27) FALSE     Ukrainian    Ukrainian
 9 Ukrainian regions (18 of 27) FALSE     Ukrainian    Ukrainian
10 Ukrainian regions (18 of 27) FALSE     Ukrainian    Ukrainian
# ℹ 3,866 more rows

Unfortunately, if you explore this dataset a little further, the language fields don’t conform well with each other and a lot more work with ifelse will be needed before you could put together any full analysis around students who speak different languages at home and at school.

Tip

It’s possible to nest our ifelse statements, by writing another ifelse where you would have the <value_if_false>, for example we might want to give describe the type of school in England:

# TODO: use PISA for this
plot_data <- schools %>% 
  mutate(sch_type = 
           ifelse(EstablishmentGroup == "Special schools", "Special",
                  ifelse(EstablishmentGroup == "Independent schools", "Independent",
                         ifelse(AdmissionsPolicy=="Selective", 
                                "Grammar", "Comprehensive"))))

8.2 Factors and statistical data types

The types of variable will heavily influence what statistical analysis you can perform, e.g. you’ll need numeric values for a t-test. R is there to help by assigning datatypes to each field. We have different sorts of data that can be stored:

  • Categorical - data that can be divided into groups or categories
    • Nominal - categorical data where the order isn’t important, e.g. gender, or colours
    • Ordinal - categorical data that may have order or ranking, e.g. exam grades (A, B, C, D) or lickert scales (strongly agree, agree, disgaree, strongly disagree)
  • Numeric - data that consists of numbers
    • Continuous - numeric data that can take any value within a given range, e.g. height (178cm, 134.54cm)
    • Discrete - numeric data that can take only certain values within a range, e.g. number of children in a family (0,1,2,3,4,5)

But here we are going to look at how R handles factors. Factors have two parts, levels and codes. levels are what you see when you view a table column, codes are an underlying order to the data. Factors allow you to store data that has a known set of values that you might want to display in an order other than alphabetical. For example, if we look at the month field ST003D02T using the levels(<field>) command:

levels(PISA_2022$ST003D02T)
 [1] "January"        "February"       "March"          "April"         
 [5] "May"            "June"           "July"           "August"        
 [9] "September"      "October"        "November"       "December"      
[13] "Valid Skip"     "Not Applicable" "Invalid"        "No Response"   

We can see that the months of the year are there along with other possible levels. With this particular column there are levels for missing or wrong responses (“Valid Skip”, “Not Applicable” “Invalid”, “No Response”), though PISA rarely uses them. You are more likely to find that missing/wrong data items are coded as NA, as you can see below:

PISA_2022 %>% count(ST003D02T)
# A tibble: 13 × 2
   ST003D02T     n
   <fct>     <int>
 1 January   48760
 2 February  43030
 3 March     48671
 4 April     47014
 5 May       49235
 6 June      48890
 7 July      51262
 8 August    51681
 9 September 51755
10 October   51703
11 November  48179
12 December  48156
13 <NA>      25408

Codes are the underlying numbers/order for each level, in this case 1 = January, 2 = February, etc. R stores factors as codes, then uses the levels to display the data. You can see the codes by using the as.numeric command on a factor:

as.numeric(PISA_2022$ST003D02T)
 [1]  5  2  8  7  1  5  5 12  8  9 10  4  5  8  7  2  3  6  3  8 10  1 10  2  5
[26]  6  8  1  3  2  5  7 12  4  1 10 10  3  3  9
 [ reached getOption("max.print") -- omitted 613704 entries ]

How can this be useful? Firstly it’s more efficient for R to store data this way, numbers (codes) are smaller and easier to sort/search than text (levels). But it also helps when we come to presenting data. A good example is how plots are made, they will use the codes to give an order to the display of columns, in the plot below, February (2) comes before August (8), even though there were more students born in August and A is before F in the alphabet:

grph_data <- PISA_2022 %>% 
         group_by(ST003D02T) %>% 
         summarise(n=n())

grph_data %>% arrange(desc(n)) %>% pull(ST003D02T)
 [1] September October   August    July      May       June      January  
 [8] March     November  December  April     February  <NA>     
attr(,"label")
[1] Student (Standardized) Birth - Month
16 Levels: January February March April May June July August ... No Response
ggplot(data=grph_data, aes(x=ST003D02T, y=n)) + 
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

To re-order the columns to match the number of students in each month, we can either try to do this manually, which is rather cumbersome:

my_levels <- c("September", "October", "August", "July", "May", "June", 
               "January", "March", "November", "December", "April", "February",
               "Valid Skip", "Not Applicable", "Invalid", "No Response")

grph_data$ST003D02T <- factor(grph_data$ST003D02T, levels=my_levels)

ggplot(data=grph_data, aes(x=ST003D02T, y=n)) + 
  geom_bar(stat = "identity")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Or we can get R to do this for us:

# get the levels in order and pull/create a vector of them
my_levels <- grph_data %>% arrange(desc(n)) %>% pull(ST003D02T)

# reassign the re-ordered levels to the dataframe column
grph_data$ST003D02T <- factor(grph_data$ST003D02T, levels=my_levels)

ggplot(data=grph_data, aes(x=ST003D02T, y=n)) + 
  geom_bar(stat = "identity")

To learn a lot more about factors, see Hadley’s chapter

9 Seminar tasks

9.1 Student dataset

  1. How many unique values are there in the OCOD3 field for student intended future occupation? How does the most desired career vary by gender?
answer
PISA_2022$OCOD3 %>% unique() %>% length()

PISA_2022 %>% 
  group_by(ST004D01T, OCOD3) %>%
  summarise(n =n()) %>%
  arrange(desc(n))
  1. write code to work out the mean and median PV1MATH score for each country CNT.
answer
PISA_2022 %>% 
  group_by(CNT) %>%
  summarise(mean_PV1MATH = mean(PV1MATH, na.rm=TRUE),
            median_PV1MATH = median(PV1MATH, na.rm=TRUE)) %>%
  arrange(desc(median_PV1MATH))
  1. what is the fourth most popular language at home LANGN spoken by students in schools in the Ireland, how does this compare to Germany?
answer
PISA_2022 %>% 
  filter(CNT %in% c("Germany", "Ireland")) %>%
  group_by(CNT, LANGN) %>%
  summarise(n = n()) %>%
  arrange(desc(n))
  1. Spot the five errors with the following code. Can you make it work? What does it do?
# Work out when science scores are better than maths
PISA_2022_scimath < PISA_2022 %>%
  rename(gender = ST004D01T) %>%
  mutate(sci better = PV1SCIE - PV1MATH) %>%
  filter(is.na(scibetter) %>%
  group_by(CNT gender) %>%
  summarise(students = n,
            sci_win = sum(scibetter >= 0),
            per_scibetter = 100*(sci_win/students))
answer
# Work out when more time spent in language lessons than maths lessons
PISA_2022_scimath <- PISA_2022 %>%  #1 make sure you have the assignment arrow <-
  rename(gender = ST004D01T) %>%
  mutate(sci_better = PV1MATH - PV1SCIE) %>% #2 _ not space in name of field
  filter(!is.na(sci_better)) %>%  #3 this needs to be !is.na, otherwise it'll return nothing
  group_by(CNT, gender) %>% #4 missing comma
  summarise(students = n(),   #5 missing brackets on the n() command
            sci_win = sum(sci_better >= 0),
            per_sci_win = 100*(sci_win/students))
  1. By country and gender work out the mean, median and standard deviations of STUBMI, order by the descending mean.
answer
pisa_bmi <- PISA_2022 %>%
  rename(gender = ST004D01T) %>%
  group_by(CNT, gender) %>%
  summarise(n=n(),
            bmi_mean = mean(STUBMI, na.rm=TRUE),
            bmi_median = median(STUBMI, na.rm=TRUE),
            bmi_sd = sd(STUBMI, na.rm=TRUE)) %>%
  arrange(desc(bmi_mean))

9.2 Teacher dataset

To further check your understanding of this section you will be attempting to analyse the 2022 teacher dataset. This dataset includes records for 68054 teachers from 18 countries, including 544 columns, covering attitudinal, demographic and workplace data. You can find the dataset here in the .parquet format.

example loading code
# download the file then
# Work out when more time spent in language lessons than maths lessons
PISA_2022_teacher <- read_parquet("C:/Users/Peter/Downloads/PISA_2012_teacher.parquet")
  1. Work out how many teachers are in the dataset for Portugal
answer
PISA_2022_teacher %>% 
  group_by(CNTRYID) %>%
  summarise(n=n()) %>%
  filter(CNTRYID == "Portugal")
  1. For each country CNTRYID by gender TC001Q01NA, what is the mean time that a teacher has been in the teaching profession TC007Q02NA? Include the number of teachers in each group. Order this to show the country with the longest serving workforce:
answer
PISA_2022_teacher %>%
  group_by(CNTRYID, TC001Q01NA) %>%
  summarise(avg_years = mean(TC007Q02NA, na.rm=TRUE),
            n = n()) %>%
  arrange(desc(avg_years))
  1. For each country CNT find out which teachers report that they ‘Help students think critically’ TC199Q07HA. Hin: you’ll need to look at the levels of this question to find the correct filter:
answer
crit_thinking <- PISA_2022_teacher %>% 
  rename(crit_think = TC199Q07HA) %>%
  group_by(CNT) %>%
  mutate(teachers=n()) %>%
  group_by(CNT, crit_think) %>%
  summarise(n = n(),
            per = n()/unique(teachers)) %>%
  arrange(desc(per)) %>%
  filter(crit_think == "A lot")

# interestingly the highest performing countries also 
# have some of the lowest scores in helping children 
# think critically. To plot this:

left_join(crit_thinking,
          PISA_2022 %>% group_by(CNT) %>% summarise(maths = mean(PV1MATH))) %>%
  ggplot(aes(x=per, y=maths)) + 
  geom_point() +
  geom_smooth()
  1. Explore the data on use of technology in the classroom TC169____
answer
PISA_2022_teacher %>% select(CNT, TC001Q01NA, starts_with("TC169"))
  1. Save the results of one of the above questions using write_csv().

  2. [EXTENSION] explore the dataset and find out some more interesting facts to share with your group