06 T-tests

1 Pre-reading and pre-session tasks

1.1 Pre-reading

Before the session, you might if find it helpful to read chapter 7, introduction to t-tests, in Geher and Hall (2014) available here.

1.2 Pre-session task

2 T-tests

T-tests are statistical tests used to determine if the mean of a sample is different to a presumed population mean, or if the means of two different samples are different to one another.

Here are some examples in educational research of when t-tests might be applied:

  1. A school wants to measure whether there is a statistical difference between Year 11’s Progress 8 scores and the national average( which is approximately 0 on average in England). A t-test is used to determine if there is a statistical difference between the mean of the Year 11’s progress 8 scores in the school and the national average.
  2. A researcher implements an intervention with a random sample of pupils from a school. A second random sample is also taken, which is called the control group. After a period of time of teaching, one with the intervention and without, the pupils are tested. A t-test is performed to see if there is a statistical difference between the mean of the control group and the mean of the intervention group.
  3. A teacher wants to see how well a class is able to recall knowledge learned 6 months after they are given a test. The pupils are given similar test questions before and after the period. A t-test is used to see if there is a statistical difference between test scores in the first test and in the test 6 months later.

2.1 Conditions of t-tests

There are six conditions required for a t-test to be used:

  1. the population data is continuous;
  2. the sample size is small;
  3. the sample was randomly sampled;
  4. the variance of the data in each group is similar (homogeneity of variance);
  5. the population data follows a normally distribution;
  6. the observations are independent of one another.
Note

Due to the robustness of the t-test some of the conditions can be adjusted. For example,

  • Discrete interval data can be used (so long as there is sufficient granularity);
  • The sample data may be approximately normally distributed;
  • Variances can be unequal (this is covered by Welch’s t-test);
  • Sample sizes can be large (this does not affect the validity of the test, but with a known population variance a z-test may be more appropriate).
Note

A rule of thumb researchers use is that if the ratio of variances of the larger sample to the smaller is less than 4, then a t-test can be used.

2.2 Hypothesis Testing and t-tests

When using the t-test, we are testing the null hypothesis that there is no difference between the mean of a sample and, depending on the type of t-test, either the hypothesised value of the population mean or the mean of another sample. In other words, if it is equal.

show the code
# Create random data sets of 10,000 boys' and girls' scores
girlscores <- as.data.frame(rnorm(n = 10000, mean = 100, sd = 10))

girlscores <- girlscores %>%
  mutate(gender="female") %>%
  rename(Scores="rnorm(n = 10000, mean = 100, sd = 10)")

boysscores <- as.data.frame(rnorm(n = 10000, mean = 105, sd = 10))

boysscores <- boysscores %>%
  mutate(gender="male") %>%
  rename(Scores="rnorm(n = 10000, mean = 105, sd = 10)")

# Plot the data
totalscores <- rbind(boysscores, girlscores)

ggplot(totalscores) +
  geom_histogram(data = boysscores,
                 aes(x = Scores, fill = gender, alpha=0.5), binwidth = 1) +
  geom_histogram(data=girlscores,
                 aes(x = Scores, fill = gender, alpha=0.5), binwidth = 1) +
  geom_vline(xintercept = mean(boysscores$Scores), colour="turquoise4", 
             linetype = 3) + # Add a vertical line for the mean scores
  geom_vline(xintercept = mean(girlscores$Scores), colour="red2", linetype=3) +
  geom_segment(aes(x = mean(boysscores$Scores), y=440, # Add arrow between means
                   xend = mean(girlscores$Scores),
                   yend = 440), 
               arrow = arrow(length = unit(0.01, "npc"), ends = "both")) +
  geom_text(x = 125, y = 440, label = "Difference in means") + # Add text label
  xlab("score") +
  ylab("Numer of students") +
  guides(alpha="none") # Remove alpha from the legend

The alternative hypothesis then depends on whether we are testing if the mean of a sample is less than, greater than, or not equal to the hypothesised value of the population mean or mean of another sample. In other words, if it less than or greater than, then it’s a one-tailed test (since it is only considering one direction of the mean), and if it’s not equal to, then it’s a two-tailed test (since it is considering both sides of the mean).

Code
library(gridExtra) # for grid.arrange

plot1 <- data.frame(y = rnorm(n = 1000, mean = 100, sd = 10), group = "A")
plot2 <- data.frame(y = rnorm(n = 1000, mean = 130, sd = 10), group = "B")
plot3 <- data.frame(y = rnorm(n = 1000, mean = 70,  sd = 10), group = "C")
plot4 <- data.frame(y = rnorm(n = 1000, mean = 100, sd = 10), group = "D")
plot5 <- plot3

plot5$group<-"E"
plot5$y <- plot5$y + 60
onetail <- rbind(plot1, plot2)
twotail <- rbind(plot3, plot4, plot5)

p1<-ggplot(onetail, aes(x = y, fill = group)) +
  geom_density(alpha = 0.5) +
  geom_segment(aes(x = mean(plot1$y), y = 0.045, # Add an arrow between means
                   xend = mean(plot2$y),
                   yend = 0.045), 
               arrow = arrow(length = unit(0.01, "npc"), ends = "both"))+
  geom_vline(xintercept = mean(plot1$y), linetype="dashed")+
  geom_vline(xintercept = mean(plot2$y), linetype="dashed")+
  ggtitle("One tailed tests look for differences in one direction")

p2 <- ggplot(twotail, aes(x = y, fill = group)) +
  geom_density(alpha = 0.5) +
  geom_segment(aes(x = mean(plot3$y), y = 0.043, # Add an arrow between means
                   xend = mean(plot4$y),
                   yend = 0.043), 
               arrow = arrow(length = unit(0.01, "npc"), ends = "both")) +
  geom_segment(aes(x = mean(plot4$y), y=0.045, # Add an arrow between means
                   xend = mean(plot5$y),
                   yend = 0.045), 
               arrow = arrow(length = unit(0.01, "npc"), ends = "both"))+
  geom_vline(xintercept = mean(plot3$y), linetype="dashed")+
  geom_vline(xintercept = mean(plot4$y), linetype="dashed")+
  geom_vline(xintercept = mean(plot5$y), linetype="dashed")+
  ggtitle("Two tailed tests look for differences in two directions")

grid.arrange(p1, p2)

To determine the outcome of the t-test we need to decide a sufficiently low enough probability (p-value) that, if it’s below this, then it is unlikely that the null hypothesis is true and so we reject the null hypothesis. Typically, this is set to 0.05, or 5%, but in other cases it might be as high as 0.1 or as low as 0.01 or 0.001 depending on the circumstances. Therefore, if the p-value is greater than 0.05 we accept the null hypothesis and if it less than 0.05 we reject the null hypothesis and accept the alternative hypothesis.

When considering the number of tails, we then need to consider the p-value here also. If we set the overall probability to be 0.05 for the threshold that we reject the null hypothesis, then for a one-tailed test we compare our result with 0.05. However, with a two-tailed test, since we are looking at a two-tailed test, then the probability of each of the tails is 0.025 and therefore we compare our value with 0.025 instead.

Conventionally, we use 0.05 (1 in 20) as the cut off for statistical significance (a convention that has been much critiqued e.g. (Jacob Cohen 1994; Baker 2016)).

2.3 Types of t-test

There are a two main types of t-tests:

  • One sample t-test - checks if the mean of a sample is different to a hypothesised value for the population mean.
  • Two sample t-test - checks if the mean of one sample is different to the mean of another sample. There are two main types:
  • Paired tests - compares the means of measurements from the same individual or object (e.g. in a pre- and post-test - the comparison of means and before and after scores compares the same student’s scores in the first test with the second).
  • Unpaired t-tests - compares the means of two unrelated groups (for example, are the mean science scores of boys and girls in a school different).

For more information on t-tests, see chapter 13, in Navaro’s Learning Statistics with R.

2.4 One Sample t-tests

One sample t-tests are used to compare the mean of a sample of a data set with a value. This value is usually a known population mean or a theoretical population mean depending on the context of the test.

For a two-tailed test, the hypotheses are as follows

  • Null hypothesis: sample mean = population mean;
  • Alternative hypothesis: sample mean ≠ population mean.

For a one-tailed test, the hypotheses are as follows (depending on the direction of the test)

Either,

  • Null hypothesis: sample mean = population mean;
  • Alternative hypothesis: sample mean > population mean.

Or,

  • Null hypothesis: sample mean = population mean;
  • Alternative hypothesis: sample mean < population mean.

Example: Are Females’ Reading Scores in the UK the Same or Different on Average to the UK Average?

Let’s look at an example of where we might want to consider a sample, in this case females’ reading scores in the UK, with a presumed population mean, the average reading score in the UK. This is a presumed mean, since we don’t know the true population mean here, but we do know the average reading score from the PISA data, which is 494 (taken from PISA 2022).

Following the steps for performing a hypothesis test, we will start by selecting the appropriate test. Since we are comparing the mean of one sample with a population mean, then we need to use a one sample t-test.

First, you will need to load the PISA data:

# Load PISA Data
PISA_2022 <- read_parquet(r"[add the link to your data file]")

To ensure we can use this test we need to make sure the conditions are met. Since PISA tests are constructed in a way which means they are normally distributed, then it is likely our sample is also normally distributed. However, it is good practice to check the distribution using a histogram (or qqplot) at the very least, and where appropriate, a normality test. Due to the large sample size normality testing isn’t usually required, so we will just consider a histogram (or qqplot) here.

# Define PISA UK Girls Dataset
PISA_2022_Read_Girls <- PISA_2022 %>%
2  select(CNT, PV1READ, ST004D01T) %>%
3  filter(CNT == "United Kingdom" & ST004D01T == "Female")

# Plot histogram
4ggplot(data = PISA_2022_Read_Girls,
5       aes(x = PV1READ)) +
6  geom_histogram(binwidth = 5, fill = "darkseagreen4")
2
select the variables of interest
3
filter for the UK and for girls
4
pass the data to ggplot
5
put reading scores (PV1READ) on the x-axis
6
use geom_histogram to create a histogram. We can define that the reading data is divided into bins of width 5 points (binwidth = 5) and set the colour of the bars (fill = "darkseagreen4")

We can see from the histogram that the sample is approximately normally distributed (as it follows a bell shaped curve).

Next, we state the null and alternative hypotheses. Since we are determining whether the average reading score is the same or different, we are using a two-tailed test. This gives us the following hypotheses:

  • Null hypothesis: mean reading score for females in UK = mean reading score in UK
  • Alternative hypothesis: mean reading score for females in UK ≠ mean reading score in UK

Having set up the hypotheses, we now need to calculate the probability of the null hypothesis being true. To do this, we use R’s t.test function and compare the scores for females in reading in the UK with the OECD average score for reading, which is defined as 476 (OECD website).

OECD_ave_read_UK <-(476)
t.test(PISA_2022_Read_Girls$PV1READ, mu=OECD_ave_read_UK)

    One Sample t-test

data:  PISA_2022_Read_Girls$PV1READ
t = 19.064, df = 6396, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 476
95 percent confidence interval:
 497.7143 502.6918
sample estimates:
mean of x 
  500.203 

We can see from the result that we get a p-value of < 2.2e-16, which is very small. As this is lower than 0.025 (for a two-tailed test) we reject the null hypothesis.

Therefore, we accept the null hypothesis that the average reading score for females in the UK is different to the overall average reading score in the UK. Looking at the average stated in the t-test, the average score is higher for girls, at 500.23.

2.5 Unpaired Two Sample t-tests

Unpaired two sample t-tests are used to compare the means of two, unconnected samples. They are unconnected in the sense that the samples taken are independent and are measurements of two unrelated groups.

For a two-tailed test, the hypotheses are as follows

  • Null hypothesis: sample mean of first group = sample mean of second group;
  • Alternative hypothesis: sample mean of first group ≠ sample mean of second group.

For a one-tailed test, the hypotheses are as follows (depending on the direction of the test)

Either,

  • Null hypothesis: sample mean of first group = sample mean of second group;
  • Alternative hypothesis: sample mean of first group > sample mean of second group.

Or,

  • Null hypothesis: sample mean of first group = sample mean of second group;
  • Alternative hypothesis: sample mean of first group < sample mean of second group.

Example: Are Males’ and Females’ Maths Scores in the USA the Same or Different to One Another?

Let’s say we want to discern if the average maths score differ for males and females in the USA. We want to test to see if there is a statistical difference between the means.

Following the steps for performing a hypothesis test, we will start by selecting the appropriate test. Since we are comparing the means of two, unrelated samples (since they are the measurements of two different groups that are not connected) then we use an unpaired t-test.

Again, we need to check if the groups are approximately normally distributed so check the histogram of each group.

# Create a US maths subsets
USMATH <- PISA_2022 %>%
2  select(CNT, PV1MATH,ST004D01T) %>%
3  filter(CNT == "United States")

USMATHGIRLS <- USMATH %>%
4  filter(ST004D01T == "Female")

USMATHBOYS <- USMATH %>%
6  filter(ST004D01T == "Male")
# Plot histogram
# For girls
7ggplot(data = USMATHGIRLS,
8       aes(x = PV1MATH)) +
  geom_histogram(binwidth = 5, fill = "darkseagreen4")
2
select the variables of interest, country, maths scores, and gender
3
filter for the US
4
create a new data frame USMATHGIRLS of the data for girls
6
pass the data to ggplot
7
put maths scores (PV1MATH) on the x-axis
8
use geom_histogram to create a histogram. We can define that the reading data is divided into bins of width 5 points (binwidth = 5) and set the colour of the bars (fill = "darkseagreen4")

# For boys
ggplot(data=USMATHBOYS,
       aes(x=PV1MATH)) +
  geom_histogram(binwidth = 5, fill = "red")

Again, we can see both samples of males and females are approximately normally distributed.

Next, we state the null and alternative hypotheses. Since we are determining whether the average maths scores are the same or different, we are using a two-tailed test. This gives us the following hypotheses:

  • Null hypothesis: mean maths score for males in US = mean maths score for females in UK
  • Alternative hypothesis: mean maths score for males in US ≠ mean maths score for females in UK.

Next, we calculate the probability of the null hypothesis being true. As before, we use R’s t.test function and compare the mean scores of males and females in the US.

# Plot a two-sided, unpaired t-test
t.test(PV1MATH ~ ST004D01T, data = USMATH,
       alternative = "two.sided")

    Welch Two Sample t-test

data:  PV1MATH by ST004D01T
t = -3.9455, df = 4499.5, p-value = 8.086e-05
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -16.59477  -5.57744
sample estimates:
mean in group Female   mean in group Male 
            456.1248             467.2109 

We can see from the t-test result that the p-value returned is 8.086e-05. As this is lower than 0.025 (for a two-tailed test) we reject the null hypothesis.

As such, we accept the alternative hypothesis that the average scores for males and females in the US are statistically different, with the average score for males being higher than average score for females.

2.6 Paired Two Sample t-tests

Paired two sample t-tests compare the mean of a sample to the mean of a related sample. Usually, this is where there is a repeated measure taken of the same individuals or objects, such as a pre and post test. It can also be used for comparing a pair of measurements taken of the same individuals or objects.

For a two-tailed test, the hypotheses are as follows

  • Null hypothesis: sample mean of first group = sample mean of second group;
  • Alternative hypothesis: sample mean of first group ≠ sample mean of second group.

For a one-tailed test, the hypotheses are as follows (depending on the direction of the test)

Either,

  • Null hypothesis: sample mean of first group = sample mean of second group;
  • Alternative hypothesis: sample mean of first group > sample mean of second group.

Or,

  • Null hypothesis: sample mean of first group = sample mean of second group;
  • Alternative hypothesis: sample mean of first group < sample mean of second group.

Example: Are there differences between science and maths scores on average in the UK?

Let’s say we want to compare the science and maths scores of students in the UK to see if there is a difference in their average score or not.

Following the steps for performing a hypothesis test, we will start by selecting the appropriate test. Since we are comparing the means of two, related samples (since they are two measurements of the same students, therefore related) then we use a paired t-test.

Again, we need to check if the groups are approximately normally distributed so check the histogram of each group.

# Create a US maths subsets
UKMATH<-PISA_2022 %>%
  select(CNT, PV1MATH,ST004D01T) %>%
  filter(CNT == "United Kingdom")

UKSCIENCE<-PISA_2022 %>%
  select(CNT, PV1SCIE, ST004D01T) %>%
  filter(CNT == "United Kingdom")

# Plot histogram
ggplot(data = UKMATH,
       aes(x = PV1MATH)) +
  geom_histogram(binwidth = 5, fill = "darkseagreen4")

ggplot(data = UKSCIENCE,
       aes(x = PV1SCIE)) +
  geom_histogram(binwidth = 5, fill = "red")

Again, we can see the two samples are approximately distributed so can use the t-test.

Next, we state the null and alternative hypotheses. Since we are determining whether the average maths score is the same or different to the average science score, we are using a two-tailed test. This gives us the following hypotheses:

  • Null hypothesis: mean maths score in UK = mean science score in UK
  • Alternative hypothesis: mean maths score in UK ≠ mean science score in UK

Having set up the hypotheses, we now need to calculate the probability of the null hypothesis being true. Again, we use R’s t.test function and compare the average maths scores with the average science scores.

# Perform a two-sided, paired t-test
t.test(UKMATH$PV1MATH, UKSCIENCE$PV1SCIE, 
       paired = TRUE, alternative = "two.sided")

    Paired t-test

data:  UKMATH$PV1MATH and UKSCIENCE$PV1SCIE
t = -21.651, df = 12971, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -10.602602  -8.842174
sample estimates:
mean difference 
      -9.722388 

We return a p-value of < 2.2e-16, which is less than 0.025 (for a two-tailed test), meaning we reject the null hypothesis that there is no difference between the average score in maths and science.

Therefore, we accept the alternative hypothesis that there is a difference in scores in maths and science in the UK. This difference is approximately 1.5 points.

Tip

In R, there are often multiple ways of doing the same tests. Note with t.test there are two forms you can use, depending on the format of the data you want to test.

First, in the example above, we had two data frames, for math scores, and science scores, so we can pass two data frames to t.test.

# Carry out a two-sided, paired t-test
t.test(UKMATH$PV1MATH, UKSCIENCE$PV1SCIE, 
       paired = TRUE, alternative = "two.sided")

    Paired t-test

data:  UKMATH$PV1MATH and UKSCIENCE$PV1SCIE
t = -21.651, df = 12971, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -10.602602  -8.842174
sample estimates:
mean difference 
      -9.722388 

Alternatively, our data might be in one data frame, with a variable indicating the two groups. For example, if we wanted to perform a test of UK science scores, we might create a single data frame in which the two groups are indicated by the column ST004D01T, gender. In that case, we can use the ~ operator to do a t.test on the PISAUK$PV1SCIE (i.e. the science scores) by (~) gender.

UKSci<-PISA_2022 %>%
  select(CNT, PV1SCIE, ST004D01T) %>%
  filter(CNT=="United Kingdom")

# Plot a two-sided, unpaired t-test
t.test(UKSci$PV1SCIE ~ ST004D01T, data = UKSci,
       alternative = "two.sided")

    Welch Two Sample t-test

data:  UKSci$PV1SCIE by ST004D01T
t = -3.9298, df = 12964, p-value = 8.547e-05
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -10.553500  -3.529142
sample estimates:
mean in group Female   mean in group Male 
            488.6962             495.7375 

3 Anova (multiple group comparisons)

We have performed t-tests of the difference in mathematics scores between boys and girls in the UK. We may now want to perform the same t-test for all the countries in the sample. This can create a problem. Remember that a p-value report the probability that an outcome is due to chance. The more times we repeat a t-test, the more likely we are to see an effect, even when none-exists.

To avoid potential artefacts from doing multiple t-tests, when we want to compute the means of more than two groups, we use an anova (analysis of variance) test.

To determine if there is a statistically significant difference between groups we run an anova calculation, using the aov function in R. The dependent variable is the one we are interested in explaining, the independent variables are the factors we think might explain the variance.

aov(data, dependent_var ~ independent_var + independent_var + ...)

For example, we used t-tests to look at differences in scores between a pair of countries (e.g the UK and the US). We can now compare across all the countries.

We pass the function the data we wish to focus on (PV1MATH) and then indicate we wish to look for variation in the PV1MATH, by CNT. We then summarise the result to get a table.

ALLPISAMATH<-PISA_2022 %>%
1  select(CNT, PV1MATH)

2resaov <- aov(data = ALLPISAMATH, PV1MATH ~ CNT)
3sumresaov <- summary(resaov)
4sumresaov
1
select the variables of interest, country, and maths scores
2
use aov to perform the anova calculation. We set the data we are passing (data = ALLPISAMATH) and set that we want to determine the variance in maths scores by country (PV1MATH ~ CNT) - we put the output into a new variable resaov
3
we produce a summary of the output summary(resaov)
4
print the result
                Df    Sum Sq  Mean Sq F value Pr(>F)    
CNT             79 1.811e+09 22927902    3089 <2e-16 ***
Residuals   613664 4.554e+09     7421                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova looks to determine if there is a difference in the variance of means across countries. After determining any variation of means from those two variables, any remaining variation is associated with residuals - you can think of this as the unexplained variation that isn’t associated with the vectors we specified (i.e. the country).

First, look at the Pr(>F) column. This is a test of significance and reports if the groups in that row show statistically significant variation. In our case, country returns Pr(>F) <2e-16 *** - which is lower than 0.05. This suggests there are significant differences across countries.

This is not terribly interesting so far(!) - but we can do two post-hoc tests (i.e. tests after the first test, to find out the effect size and between which countries differences exist)

3.1 Eta-squared

If we have found out, from an anova that there are significant differences between the two groups, we can next determine how much variation is explained by each group To do this we calculate a variable called eta-squared.

Eta-squared gives the proportion of variance explained by each variable. The eta squared function is in the package lsr so we will install lsr, and then use the result of our anova (resaov) to calculate the eta squared variable using the function etaSquared. To report the value of eta as a percentage we need to multiply the output of eta by 100.

Eta squared tells us the proportion of the total variance that is explained by a variable (you can also think of it as an effect size). An eta squared value of 1 indicates all the variance of a sample is explain by some variable and 0 means the variable is not responsible for any of the observed variance. We can multiply the eta-squared score by 100 to get a value for the percentage of variance explained.

The percentage of variance explained is a useful figure. For example, it has been reported that schools only account for 14% of the variance in progress 8 scores, whilst family explains 43% (Wilkinson, Bryson, and Stokes (2018)).

library(lsr)
eta <- as.data.frame(etaSquared(resaov))
eta <- eta*100
eta
      eta.sq eta.sq.part
CNT 28.45522    28.45522

The important column here is the eta.sq column - it tells us that the country explains 28% of the variance in test scores. (The second column contains information about partial eta-squared, which we won’t go into, but are used when the results of each measure are not independent i.e. one result influences another).

Note

Interpretations of effect size vary. In some fields, 2% - 13% is read as small, 13% -26% medium and over 26% large (Shevlin and Miles 2000; J. Cohen 1988). In educational research, effect sizes are typically small, hence different categorisations are applied.

Interpretation Effect Size(Kraft 2020) Effect Size (Sanders, Mitchell, and Ni Chonaire 2020)
Small < 5% 3% - 10%
Medium 5% to 20% 10% - 20%
Large > 20% > 20%

3.2 Tukey’s HSD

When an anova test reports that are some statistically significant differences between groups, it does not imply there are statistically significant differences between all subgroups. For example, if the anova reports statistically significant differences by age, statistically significant differences might exist between 11, 12 and 13-year old students, but not between 13 and 14-year olds.

We can use an additional anova test, Tukey’s Honest Significant Difference test (or Tukey’s HSD for short), to find out which pairs of subgroups have statistically significant differences in means.

Consider the question: Are there statistically significant differences in the variance of science scores of the UK, US, France and Germany? To determine if such differences exist, we create a new subset for those countries’ science scores, and then run an anova test by country, reporting the eta squared value.

PISAMULTI <- PISA_2022 %>%
1  select(CNT, PV1SCIE) %>%
2  filter(CNT == "United Kingdom" | CNT == "United States" |
           CNT == "Germany" | CNT == "France")

3resaov<-aov(PV1SCIE ~ CNT, data = PISAMULTI)
4sumresaov<-summary(resaov)
sumresaov

5eta<-as.data.frame(etaSquared(resaov))
6eta<-eta*100
7eta
1
select the variables of interest, country, and maths scores
2
filter for countries of interest. Note a shorter way to do the filter is filter(CNT %in% c("United Kingdom", "United States", "Germany", "France"))
3
use aov to perform the anova calculation. We set the data we are passing (data = ALLPISAMATH) and set that we want to determine the variance in maths scores by country (PV1MATH ~ CNT) - we put the output into a new variable resaov
4
produce a summary of resaov
5
use etaSquared(resaov) to find the eta squared value. We convert this to a data frame so we can perform the next steps
6
to convert the eta squared value into a percentage of variance explained, we multiply by 100
7
print the result
               Df    Sum Sq Mean Sq F value Pr(>F)    
CNT             3    993358  331119   30.24 <2e-16 ***
Residuals   30406 332902262   10949                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
       eta.sq eta.sq.part
CNT 0.2975057   0.2975057

The anova results tell us there are significant differences between the countries, which account for 0.3 % of variance in scores. We can then run a Tukey HSD test to determine which countries have significant differences between mean scores.

TukeyHSD(resaov)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = PV1SCIE ~ CNT, data = PISAMULTI)

$CNT
                                   diff        lwr       upr     p adj
France-Germany               -13.871343 -18.613536 -9.129150 0.0000000
United Kingdom-Germany        -2.989572  -7.159141  1.179997 0.2536226
United States-Germany          2.995899  -2.266149  8.257948 0.4602611
United Kingdom-France         10.881771   6.851391 14.912151 0.0000000
United States-France          16.867243  11.714786 22.019699 0.0000000
United States-United Kingdom   5.985472   1.354627 10.616316 0.0049773

From that we get a table with p values (p adj) for different pairs of countries. Note that these are below 0.05 for France-Germany, UK-France, US-UK and US-France, but greater than 0.05 for UK-Germany, US-Germany. So we can conclude there are significant differences in science scores between France-Germany, UK-France, US-UK and US-France but not between the UK and Germany, and the US and Germany.

4 Seminar Tasks

4.1 Task 1 - Choice of tests

Consider the following scenarios. Which (if any) of the t-tests would be most of appropriate, including whether it would be a one or two tailed test. Your answer should choose from one of the following options:

  • unpaired, one-tailed test
  • unpaired, two-tailed test
  • paired, one-tailed test
  • paired, two-tailed test
Note

One and two-tailed tests

Remember that a one-tailed test only looks for differences in one direction from the mean (for example, that one sample has a higher mean than the other). A two-tailed tests tests for the possibility of the means of the two groups being higher or lower than each other.

  1. An educational researcher wants to test a teaching approach with a group of 12 students to see if an intervention increases performance in their year 6 SATs results. Pupils are given a SATS paper before the intervention and again after. The results of these are normalised and then compared.

  2. A teacher wants to see if their pupils’ GCSE computer science test scores are in line with the national average or not. The tests are out of 180 marks.

  3. The DfE want to compare uptake of STEM subjects pre and post pandemic to see if there has been any change in the percentage of students studying at least one STEM subject for A level.

  4. A school wants to compare salaries of staff between this academic year and last academic year to see whether there has been a significant increase.

  5. The same school as in d) wants to compare salaries of support staff and teaching staff to see if there is a significant difference.

  6. A researcher wants to find out how much time pupils spend looking at screens during lesson time, to see if there is a difference between year 7 and year 10 students. 10 pupils in year 7 and 10 pupils in year 10 are randomly selected across a school and observed throughout a school day in November. The times are recorded to the nearest minute.

  7. A business manager wants to check if the financial expenditure on STEM teachers per year is similar to the national average (based on average numbers of pupils on role)

Answer
# Answers

# a)  An educational researcher wants to test a teaching approach with a group of 12 students to see if an intervention increases performance in their year 6 SATs results. Pupils are given a SATS paper before the intervention and again after. The results of these are normalised and then compared.

# Paired (because we compare pairs of results (before and after) for the same student) and two-tailed (because we are interested in the means being higher or lower)

# b)  A teacher wants to see if their pupils' GCSE computer science test scores are in line with the national average or not. The tests are out of 180 marks.

# Unpaired (the teacher compares their students with another pool) and two-tailed (because we are interested in the means being higher or lower)

# c)  The DfE want to compare uptake of STEM subjects pre and post pandemic to see if there has been any change in the percentage of students studying at least one STEM subject for A level.

# Unpaired (the students taking STEM subjects before and after the pandemic will be different individuals from different cohorts) and two-tailed (because we are interested in the means being higher or lower)

# d)  A school wants to compare salaries of staff between this academic year and last academic year to see whether there has been a significant increase.

# Possibly paired (if the staff body has remained the same), and two-tailed (because we are interested in the means being higher or lower)

# e)  The same school as in d) wants to compare salaries of support staff and teaching staff to see if there is a significant difference.

# Unpaired (because there are two different groups, teachers and support staff), and two-tailed (because we are interested in the mean salaries being higher or lower)

# f)  A researcher wants to find out how much time pupils spend looking at screens during lesson time, to see if there is a difference between year 7 and year 10 students. 10 pupils in year 7 and 10 pupils in year 10 are randomly selected across a school and observed  throughout a school day in November. The times are recorded to the nearest minute.

# Unpaired (because there are two different groups, Y7 and Y10), and two-tailed (because we are interested in the means of times being higher or lower)

# g)  A business manager wants to check if the financial expenditure on STEM teachers per year is similar to the national average (based on average numbers of pupils on role)

# Unpaired (because there are two different groups, the school and the national cohort), and two-tailed (because we are interested in the expenditure means being higher or lower)

4.2 Task 2 - One Sample T-tests

  1. Use the PISA dataset and the PV1SCIE (Science score) variable to perform a two-sided, one sample t-test to determine if students in the UK perform similarly or differently from the OECD average of 468 points.
Answer
# Are there differences between the mean scores of UK students and the OECD average of 468 in science?

# Select the science score columns (PV1SCIE)
# Filter the data to select UK responses

SCIEUK <- PISA_2022 %>%
  select(CNT, PV1SCIE) %>%
  filter(CNT == 'United Kingdom')

# The conditions to do a t-test include that the data are normally distributed
# Let us check the conditions are met by calculating if the data sets is normally
# distributed using a histogram or qqplot
ggplot(data=SCIEUK,
       aes(x=PV1SCIE)) +
  geom_histogram(binwidth = 5, fill="darkseagreen4")

qqnorm(SCIEUK$PV1SCIE)
qqline(SCIEUK$PV1SCIE, col = "red")

# The plots produces relatively straight lines so the distributions can be assumed to be normal

# Plot a two-sided one sample t-test to compare with OECD Science average of 468
t.test(SCIEUK$PV1SCIE, mu = 468)


# The p-value is <0.05 (2.2e-16) meaning we reject the null hypothesis that there is no statistical differences between the UK and OECD average and that instead accept the alternative hypothesis that there is a statistical difference
  1. Choose a different country of interest to their scores with one of the OECD averages for either PV1MATH (472 points), PV1READ (476 points) or PV1SCIE (468 points).
Answer
# Are there differences between the mean score in Korea and the OECD in mathematics?
# Select the maths score columns (PV1MATH)
# Filter the data to select Korea responses

MATHKOR <- PISA_2022 %>%
  select(CNT, PV1MATH) %>%
  filter(CNT == 'Korea')

# The conditions to do a t-test include that the data are normally distributed
# Let us check the conditions are met by calculating if the data sets is normally
# distributed using a histogram or qqplot
ggplot(data=MATHKOR,
       aes(x=PV1MATH)) +
  geom_histogram(binwidth = 5, fill="darkseagreen4")

qqnorm(MATHKOR$PV1MATH)
qqline(MATHKOR$PV1MATH, col = "red")

# The plots produces relatively straight lines so the distributions can be assumed to be normal

# Plot a two-sided one sample t-test to compare with OECD Science average of 468
t.test(MATHKOR$PV1MATH, mu = 472)


# The p-value is <0.05 (2.2e-16) meaning we reject the null hypothesis that there is no statistical differences between Korea and OECD average and that instead accept the alternative hypothesis that there is a statistical difference

4.3 Task 3 - Two Sample Unpaired T-tests

  1. Use the PISA dataset and the gender ST004D01T and PV1SCIE (Science score) variables to perform a two-sided, unpaired t-test to determine if boys and girls perform differently on the test in the UK.
Answer
# Are there differences between the mean scores of UK boys and girls in PISA mathematics?

# Select the gender (ST004D01T) and maths score columns (PV1MATH)
# Filter the data to select UK responses

MaleUK <- PISA_2022 %>%
  select(CNT, ST004D01T, PV1SCIE) %>%
  filter(CNT == 'United Kingdom',
         ST004D01T == 'Male')

FemaleUK <- PISA_2022 %>%
  select(CNT, ST004D01T, PV1SCIE) %>%
  filter(CNT == 'United Kingdom',
         ST004D01T == 'Female')

# The conditions to do a t-test include that the data are normally distributed
# and there is homogeneity (similarity) of the variances (the squared standard deviations)
# Let us check the conditions are met by calculating first if the data sets are normally
# distributed using a histogram or qqplot
ggplot(data=MaleUK,
       aes(x=PV1SCIE)) +
  geom_histogram(binwidth = 5, fill="darkseagreen4")

ggplot(data=FemaleUK,
       aes(x=PV1SCIE)) +
  geom_histogram(binwidth = 5, fill="darkseagreen4")

qqnorm(MaleUK$PV1SCIE)
qqline(MaleUK$PV1SCIE, col = "red")

qqnorm(FemaleUK$PV1SCIE)
qqline(FemaleUK$PV1SCIE, col = "red")

# The plots produces relatively straight lines so the distributions can be assumed to be normal
#
# We will then check the variances of the two data sets

VarM <- var(MaleUK$PV1SCIE, na.rm = TRUE)
VarF <- var(FemaleUK$PV1SCIE, na.rm = TRUE)
VarM / VarF

# The variance ratio is close to 1 (1.1)
# So our two conditions are met and can we can perform the t-test


# Plot a two-sided, unpaired t-test
t.test(MaleUK$PV1SCIE, FemaleUK$PV1SCIE, 
       paired = FALSE, alternative = "two.sided")

# The p-value is <0.05 (8.547e-05) meaning we reject the null hypothesis that there is no statistical differences between boys and girls on average and accept the null hypothesis that there is a statistical difference.
  1. Choose a different country of interest to compare performance in males and females in that country in either mathematics, reading or science.
Answer
# Are there differences between the mean scores of boys and girls in Japan PISA mathematics?
#
# Select the gender (ST004D01T) and math score columns (PV1MATH)
# Filter the data to select UK responses

MaleJapan <- PISA_2022 %>%
  select(CNT, ST004D01T, PV1MATH) %>%
  filter(CNT == 'Japan', ST004D01T == 'Male')

FemaleJapan <- PISA_2022 %>%
  select(CNT, ST004D01T, PV1MATH) %>%
  filter(CNT == 'Japan', ST004D01T == 'Female')

# The conditions to do a t-test include that the data are normally distributed


qqnorm(MaleJapan$PV1MATH)
qqline(MaleJapan$PV1MATH, col = "red")

qqnorm(FemaleJapan$PV1MATH)
qqline(FemaleJapan$PV1MATH, col = "red")


# The plots produces relatively straight lines so the distributions can be # assumed to be normal
#
# We will then check the variances of the two data sets

VarM<-var(MaleJapan$PV1MATH, na.rm = TRUE)
VarF<-var(FemaleJapan$PV1MATH, na.rm = TRUE)
VarM/VarF

# The variance ratio is close to 1 (1.26)
# So our two conditions are met and can we can perform the t-test

# Create a Japan Math subset
CHIMATH <- PISA_2022 %>%
  select(CNT, PV1MATH, ST004D01T) %>%
  filter(CNT == "Japan")
# Plot a two-sided, unpaired t-test
t.test(MaleJapan$PV1MATH, FemaleJapan$PV1MATH,  
       paired = FALSE, alternative = "two.sided")

# The p-value is <0.025 (0.0004128) suggesting there are statistically
# differences between boys and girls in mathematics in Japan
  1. Now try to compare the performance of males and females in Maths in all the OECD countries in the PISA data. You may need to think about how to do this using R without repeating the test for each country.
Answer
MaleOECD <- PISA_2022 %>%
  select(OECD, ST004D01T, PV1MATH) %>%
  filter(OECD == 'Yes', ST004D01T == 'Male')

FemaleOECD <- PISA_2022 %>%
  select(OECD, ST004D01T, PV1MATH) %>%
  filter(OECD == 'Yes', ST004D01T == 'Female')

qqnorm(MaleOECD$PV1MATH)
qqline(MaleOECD$PV1MATH, col = "red")

qqnorm(FemaleOECD$PV1MATH)
qqline(FemaleOECD$PV1MATH, col = "red")


# The plots produces relatively straight lines so the distributions can be assumed to be normal

# We will then check the variances of the two data sets

VarM <- var(MaleOECD$PV1MATH, na.rm = TRUE)
VarF <- var(FemaleOECD$PV1MATH, na.rm = TRUE)
VarM / VarF

# The variance ratio is close to 1 (1.18)
# So our two conditions are met and can we can perform the t-test

# The p-value is <0.05 (<2.2e-16) suggesting there are statistically
# differences between boys and girls

OECD <- PISA_2022 %>%
  select(OECD, ST004D01T, PV1MATH) %>%
  filter(OECD=='Yes')

t.test(MaleOECD$PV1MATH, FemaleOECD$PV1MATH, 
       paired = FALSE, alternative = "two.sided")

4.4 Task 4 - Two Sample Paired T-tests

  1. Use the PISA dataset and the PV1MATH (Maths score) and PV1SCIE (Science score) variables to perform a two-sided, two sample paired t-test to determine if there is a difference in mathematics and science scores for students or not in the United Kingdom.
Answer
# Are there differences between scores in maths and science for students in the UK or not.

# Select the maths score column (PV1MATH) and the science score columns (PV1SCIE)
# Filter the data to select UK responses

MATHSCIUK <- PISA_2022 %>%
  select(CNT, PV1MATH,PV1SCIE) %>%
  filter(CNT == 'United Kingdom')

# The conditions to do a t-test include that the data are normally distributed
# and there is homogeneity (similarity) of the variances (the squared standard deviations)
# Let us check the conditions are met by calculating first if the data sets are normally
# distributed using a histogram or qqplot
ggplot(data=MATHSCIUK,
       aes(x=PV1MATH)) +
  geom_histogram(binwidth = 5, fill="darkseagreen4")

ggplot(data=MATHSCIUK,
       aes(x=PV1SCIE)) +
  geom_histogram(binwidth = 5, fill="darkseagreen4")

qqnorm(MATHSCIUK$PV1MATH)
qqline(MATHSCIUK$PV1MATH, col = "red")

qqnorm(MATHSCIUK$PV1SCIE)
qqline(MATHSCIUK$PV1SCIE, col = "red")

# The plots produces relatively straight lines so the distributions can be assumed to be normal
#
# We will then check the variances of the two data sets

VarM <- var(MATHSCIUK$PV1MATH, na.rm = TRUE)
VarF <- var(MATHSCIUK$PV1SCIE, na.rm = TRUE)
VarM / VarF

# The variance ratio is close to 1 -ish (0.86)
# So our two conditions are met and can we can perform the t-test 

# Conduct a two-sided two sample paired t-test to compare maths and science scores
t.test(MATHSCIUK$PV1MATH,MATHSCIUK$PV1SCIE,paired=TRUE)

# The p-value is <0.05 (2.2e-16) meaning we reject the null hypothesis that there is no differences between maths and science scores for students in the UK, meaning we accept the alternative hypothesis that there is a difference on average.
  1. Use the PISA dataset and the PV1MATH (Maths score) and PV1SCIE (Science score) variables to perform a two-sided, two sample paired t-test to determine if there is a difference in mathematics and science scores for students in the whole PISA set.
Answer
# Are there differences between scores in maths and science for students in the UK or not.

# Select the maths score column (PV1MATH) and the science score columns (PV1SCIE)

MATHSCIPISA <- PISA_2022 %>%
  select(PV1MATH,PV1SCIE) 

# The conditions to do a t-test include that the data are normally distributed
# and there is homogeneity (similarity) of the variances (the squared standard deviations)
# Let us check the conditions are met by calculating first if the data sets are normally
# distributed using a histogram or qqplot
ggplot(data=MATHSCIPISA,
       aes(x=PV1MATH)) +
  geom_histogram(binwidth = 5, fill="darkseagreen4")

ggplot(data=MATHSCIPISA,
       aes(x=PV1SCIE)) +
  geom_histogram(binwidth = 5, fill="darkseagreen4")

qqnorm(MATHSCIPISA$PV1MATH)
qqline(MATHSCIPISA$PV1MATH, col = "red")

qqnorm(MATHSCIPISA$PV1SCIE)
qqline(MATHSCIPISA$PV1SCIE, col = "red")

# The plots produces relatively straight lines so the distributions can be assumed to be normal
#
# We will then check the variances of the two data sets

VarM <- var(MATHSCIPISA$PV1MATH, na.rm = TRUE)
VarF <- var(MATHSCIPISA$PV1SCIE, na.rm = TRUE)
VarM / VarF

# The variance ratio is close to 1 -ish (0.94)

# Conduct a two-sided two sample paired t-test to compare maths and science scores
t.test(MATHSCIPISA$PV1MATH,MATHSCIPISA$PV1SCIE,paired=TRUE)

# The p-value is <0.05 (2.2e-16) meaning we reject the null hypothesis that there is no differences between maths and science scores, meaning we accept the alternative hypothesis that there is a difference on average.

4.5 Task 5 - Multiple groups comparisons with anova

  1. For the UK, France, Thailand, and the US determine if there are statistically significant differences in variation in mathematics scores.
answer
PISAMULTI<-PISA_2022 %>%
  select(CNT, PV1MATH)%>%
  filter(CNT == "United Kingdom" | CNT == "Thailand" | CNT == "United States" |
         CNT == "France")


resaov<-aov(PV1MATH ~ CNT, data = PISAMULTI)
sumresaov<-summary(resaov)
sumresaov
               Df    Sum Sq Mean Sq F value Pr(>F)    
CNT             3  24352745 8117582   905.3 <2e-16 ***
Residuals   32785 293984339    8967                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
answer
eta<-as.data.frame(etaSquared(resaov))
eta<-eta*100
eta
      eta.sq eta.sq.part
CNT 7.649987    7.649987
answer
# the differences between countries explains 7.6 % of the variance in scores

TukeyHSD(resaov)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = PV1MATH ~ CNT, data = PISAMULTI)

$CNT
                                   diff       lwr        upr     p adj
United Kingdom-France         14.447781  10.80031  18.095251 0.0000000
Thailand-France              -53.254801 -57.21819 -49.291415 0.0000000
United States-France          -6.221682 -10.88462  -1.558739 0.0034071
Thailand-United Kingdom      -67.702582 -71.09801 -64.307154 0.0000000
United States-United Kingdom -20.669463 -24.86035 -16.478575 0.0000000
United States-Thailand        47.033119  42.56457  51.501668 0.0000000
answer
# There are statistically significant differences in mathematics scores for all pairs of countries
  1. For the UK, are there statistically significant differences in reading score by students who speak different languages (LANGN)
answer
# Create a data frame with Reading and school type variables, filte for the uK
UKLangRead<-PISA_2022 %>%
  select(CNT, PV1READ, LANGN) %>%
  filter(CNT == "United Kingdom")

# Perform an anova test for reading scores, school type

resaov<-aov(PV1READ ~ LANGN, data = UKLangRead)
sumresaov<-summary(resaov)
sumresaov
               Df    Sum Sq Mean Sq F value Pr(>F)    
LANGN           9   3653767  405974   37.84 <2e-16 ***
Residuals   12962 139057822   10728                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
answer
# p<2e-16 so there are statistically significant differences in reading scores between speakers of different languages 

# Calculate eta squared

eta<-as.data.frame(etaSquared(resaov))
eta<-eta*100
eta
        eta.sq eta.sq.part
LANGN 2.560245    2.560245
answer
# Differences between language speakers explains 2.6% of the total variance in scores


# Perform the post-hoc Tukey test to determine between which language speaker groups significant differences lie

TukeyHSD(resaov)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = PV1READ ~ LANGN, data = UKLangRead)

$LANGN
                                                                     diff
English-Scots                                                 62.84014574
Welsh-Scots                                                  -12.93526998
Scottish Gaelic-Scots                                         57.85860281
Irish-Scots                                                  -33.91929005
Other European languages (QSC)-Scots                          51.39604436
Ulster Scots-Scots                                            20.20519166
A non-European Union language-Scots                           23.91065526
Another language (QUK)-Scots                                  51.46413115
Missing-Scots                                                 31.77456179
Welsh-English                                                -75.77541571
                                                                     lwr
English-Scots                                                  45.851374
Welsh-Scots                                                   -45.517499
Scottish Gaelic-Scots                                         -67.131182
Irish-Scots                                                   -98.058036
Other European languages (QSC)-Scots                           20.170235
Ulster Scots-Scots                                            -33.622454
A non-European Union language-Scots                            -9.506853
Another language (QUK)-Scots                                   31.207521
Missing-Scots                                                  13.175387
Welsh-English                                                -103.973070
                                                                    upr
English-Scots                                                 79.828918
Welsh-Scots                                                   19.646959
Scottish Gaelic-Scots                                        182.848388
Irish-Scots                                                   30.219456
Other European languages (QSC)-Scots                          82.621854
Ulster Scots-Scots                                            74.032838
A non-European Union language-Scots                           57.328163
Another language (QUK)-Scots                                  71.720741
Missing-Scots                                                 50.373737
Welsh-English                                                -47.577761
                                                                 p adj
English-Scots                                                0.0000000
Welsh-Scots                                                  0.9629532
Scottish Gaelic-Scots                                        0.9059913
Irish-Scots                                                  0.8106915
Other European languages (QSC)-Scots                         0.0000086
Ulster Scots-Scots                                           0.9743577
A non-European Union language-Scots                          0.4128165
Another language (QUK)-Scots                                 0.0000000
Missing-Scots                                                0.0000029
Welsh-English                                                0.0000000
 [ reached getOption("max.print") -- omitted 35 rows ]
answer
# There are statistically significant differences in reading scores between speakers of some languages. For example, English-Scots (p = 0.0000000), Irish-English (p=0.0000354) but not for others, for example, Welsh-Scots (p=0.9629532)
  1. For the UK, use ifelse on the HOMEPOS variable to create categories for four quartiles (divide the data into top; upper-mid; lower-mid; and bottom). Use anova to determine between which groups there are statistically significant differences in science scores
HINT

HINT: You can use quantiles <- quantile(UKSci$PV1SCIE, prob=c(.25,.5,.75), na.rm=TRUE) to give the values of PV1SCIE that account for 25%, 50% and 75% of the responses.

answer
# Create a data frame of UK science scores and Wealth data
# Drop levels to remove other countries
UKSci <- PISA_2022 %>%
  select(CNT, PV1SCIE, HOMEPOS) %>%
  filter(CNT == "United Kingdom") %>%
  na.omit() %>%
  droplevels() %>%
  select(PV1SCIE, HOMEPOS)

# Find the cut of points for 25%, 50% and 75% (quantiles) of HOMEPOS

quantiles <- quantile(UKSci$HOMEPOS, prob = c(.25,.5,.75), na.rm=TRUE)

# Note the results are stored in quantiles[1] (25%), quantiles[2] 
# (50%), quantiles[3] (75%)

# Mutate the HOMEPOS column, using ifelse, to substitute the labels, "Bottom",
# "Mid", "upper-mid" and "top" for the quantiles.

UKSci<-UKSci%>%
  mutate(HOMEPOS = ifelse(HOMEPOS<quantiles[1], "Bottom", 
                        ifelse(HOMEPOS < quantiles[2], "Lower-mid",
                               ifelse(HOMEPOS < quantiles[3],
                                      "Upper-mid","Top"))))

# Perform the avova for science scores by wealth, categories into 4 groups

resaov<-aov(PV1SCIE ~ HOMEPOS, data = UKSci)
sumresaov<-summary(resaov)
sumresaov
               Df    Sum Sq Mean Sq F value Pr(>F)    
HOMEPOS         3  13965393 4655131   484.5 <2e-16 ***
Residuals   11483 110321087    9607                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
answer
# p is <2e-16 so significant differences exist between wealth groups on science
# scores

# perform the post hoc Tukey test to find out where statistically signficant differences exist

TukeyHSD(resaov)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = PV1SCIE ~ HOMEPOS, data = UKSci)

$HOMEPOS
                         diff       lwr       upr p adj
Lower-mid-Bottom     30.44388  23.79735  37.09042     0
Top-Bottom           94.24677  87.60024 100.89330     0
Upper-mid-Bottom     59.32648  52.67995  65.97302     0
Top-Lower-mid        63.80289  57.15693  70.44885     0
Upper-mid-Lower-mid  28.88260  22.23665  35.52856     0
Upper-mid-Top       -34.92029 -41.56624 -28.27433     0
answer
# There are statistically significant differences in science scores in the UK between all groups by wealth

5 Extension and additional reading

5.1 Recreating a paper in R

Looking at Cook (2014), consider the following

  1. Is it possible to replicate the results using the PISA 2018 data set? Make a graph showing the difference in male and female PV1MATH results for each country CNT. To do this we are going to have to:

    • work out the mean() maths score PV1MATH for each country CNT and gender ST004D01T grouping, call this meanmath and create a new data frame to store this
    • from this data frame, create two new data frames, one for males only and one for females only
    • rename() (see §renaming columns) the meanmath score in each data frame to male_mean and female_mean
    • bind the data frames together using using column bind function cbind(<male_df>, <female_df>) and store this in a new dataframe called Mathgendergap. NOTE: cbind only accepts tables with different names, so you’ll need to select select(CNT, male_mean) from the male dataframe and select(female_mean) from the female dataframe.
    • use mutate (see §mutate) to calculate the difference in male and female mean maths scores for each country
    • plot the results for each country
creating difference dataset
# A relatively simple recreation (without significance testing)
Mathgendergap <- PISA_2022 %>%
  select(CNT, PV1MATH, ST004D01T) %>%
  group_by(CNT, ST004D01T) %>%
  summarise(meanmath=mean(PV1MATH)) %>%
  ungroup()

# alternative, using column binding, cbind():
Mathgendergap<-  cbind(Mathgendergap %>% 
                           filter(ST004D01T == "Male") %>%
                           rename(male_mean = meanmath) %>%
                             select(CNT, male_mean),
                        Mathgendergap %>%
                           filter(ST004D01T == "Female") %>%
                             rename(female_mean = meanmath) %>%
                             select(female_mean))

# alternatively you can use pivot_wider
# Mathgendergap <- pivot_wider(Mathgendergap, names_from = ST004D01T, values_from = meanmath)

Mathgendergap <- Mathgendergap %>%
  mutate(difference = female_mean - male_mean) %>%
  arrange(desc(difference))

The above code doesn’t include the t-test results for each country, to do this we need to run some more complex code, you can see how it works below:

code
# A fuller recreation with t-tests
# conduct a ttest across countries on a specified column
library(broom)

ttest_by_country <- function(data, column = PV1MATH){

  # work out which countries have full 30+ datasets for this ttest
  countries <- data %>% ungroup() %>%
    filter(!is.na({{column}})) %>%    # {{column}} allows you to change the field of focus
    select(CNT, ST004D01T, {{column}}) %>%
    group_by(CNT) %>%
    filter(n() > 30) %>%
    pull(CNT) %>%  # the pull command returns the column as a vector, not a table
    unique()
 
  # list the countries that don't meet that criteria
  message("dropping: ", setdiff(unique(data$CNT), countries), " as too few entries for ttest")
 
  # reduce the dataset to only those countries with 30+ entries
  data <- data %>%
    filter(CNT %in% countries)

  # for each country in this new dataset perform a set of calculations
  test_result <- map_df(unique(data$CNT),
                        function(x){
   
                            # make a subset of the data just for that country
                            df <- data %>% filter(CNT == x)
                           
                            # get the results pull({{column}}) for females and males as two separate vectors
                            f_data <- df %>% filter(ST004D01T == "Female") %>% pull({{column}})
                            m_data <- df %>% filter(ST004D01T == "Male") %>% pull({{column}})
                           
                            # tell us the number of results
                            message(x, " f:", length(f_data), " m:", length(m_data))
                           
                            # work out the means of each vector
                            f_mean <- mean(f_data)
                            m_mean <- mean(m_data)
                           
                            t.test(m_data, f_data) %>%  # conduct a ttest on the male and female results
                              tidy() %>%      # convert the ttest result into a dataframe
                              mutate(CNT = x,           # add columns to record the country
                                     f_mean = f_mean,   # the mean female grade
                                     m_mean = m_mean,   # the mean male grade
                                     gender_diff = m_mean - f_mean,  # the difference between the two
                                     prop_male = length(m_data) / (length(m_data) + length(f_data)))
                                     # and the proportion who are male in the dataset
    })
  return(test_result)
}

plot_ttest_by_country <- function(data, column = "PV1MATH"){

  ggplot(data %>% mutate(sig = p.value < 0.05),
         aes(x=reorder(CNT, gender_diff), y=gender_diff, colour=sig))+
    geom_point(aes(size = prop_male)) +  
    geom_hline(yintercept = 0, lty=2) +  # add a line on 0
    coord_flip() +     # rotate the graph
    xlab("country") +
    ylab("mean(male - female)") +
    ggtitle(paste("gender differences for:", column)) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5,
                                     hjust=1, size=5))
}

# run the first function using 2018 data and the PV1MATH column
ttest_results <- ttest_by_country(PISA_2022, PV1MATH)

#plot the results
plot_ttest_by_country(ttest_results, "PV1MATH")

code
# Alternatively, you could run the following:
# run a ttest for each country
# ttest_results <- PISA_2022 %>%
#     filter(!is.na(PV1MATH)) %>% # Vietnam no results?!
#     select(CNT, ST004D01T, PV1MATH) %>%
#     group_by(CNT) %>%
#     nest(data = c(ST004D01T, PV1MATH)) %>% #create a dataframe of gender results for each country
#     summarise(tt = map(data, function(df){ # apply a ttest to each country
#       t.test(df %>% filter(ST004D01T == "Female") %$% PV1MATH,
#              df %>% filter(ST004D01T == "Male") %$% PV1MATH) %>%
#         tidy() # convert results into a dataframe
#     })) %>%
#     unnest(tt)
  1. What issues are there with using a t-test for the context given in the paper?
  2. How do your findings from question 3 and the 2018 dataset compare with those in the paper? Are there any differences or disagreements with your findings?
  3. How could the paper be improved?

6 Doing t-tests in R

You can find the code from the video below:

Show the Code
# Introduction to t-tests in R
#
# Download data from /Users/k1765032/Library/CloudStorage/GoogleDrive-richardandrewbrock@gmail.com/.shortcut-targets-by-id/1c3CkaEBOICzepArDfjQUP34W2BYhFjM4/PISR/Data/PISA/subset/Students_2018_RBDP_none_levels.rds
# You want the file: Students_2018_RBDP_none_levels.rds
# and place in your own file system
# change loc to load the data directly. Loading into R might take a few minutes
install.packages("nortest")

library(tidyverse)
library(nortest)
loc <- "/Users/k1765032/Library/CloudStorage/GoogleDrive-richardandrewbrock@gmail.com/.shortcut-targets-by-id/1c3CkaEBOICzepArDfjQUP34W2BYhFjM4/PISR/Data/PISA/subset/Students_2022_RBDP_none_levels.rds"
PISA_2022 <- read_rds(loc)

# Are there differences between the mean scores of UK boys and girls in PISA mathematics?
#
# Select the gender (ST004D01T) and math score columns (PV1MATH)
# Filter the data to select UK responses

MaleUK <- PISA_2022 %>%
  select(CNT, ST004D01T, PV1MATH) %>%
  filter(CNT == 'United Kingdom') %>%
  filter(ST004D01T == 'Male')

FemaleUK <- PISA_2022 %>%
  select(CNT, ST004D01T, PV1MATH) %>%
  filter(CNT == 'United Kingdom') %>%
  filter(ST004D01T == 'Female')

# The conditions to do a t-test include that the data are normally distributed
# and there is homogeneity (similarity) of the variances (the squared standard deviations)
# Let us check the conditions are met by calculating first if the data sets are normally
# distributed using the Pearson test of normality from the nortest package

pearson.test(as.numeric(MaleUK$PV1MATH))
pearson.test(as.numeric(FemaleUK$PV1MATH))

# The p-values are over 0.05 so both distriburtions are normal
# Pearson chi-square normality test
#
# data:  as.numeric(MaleUK$PV1MATH)
# P = 75.714, p-value = 0.1936
# Pearson chi-square normality test
#
# data:  as.numeric(FemaleUK$PV1MATH)
# P = 74.06, p-value = 0.2589
#
# We will then check the variances of the two data sets

VarM <- var(MaleUK$PV1MATH)
VarF <- var(FemaleUK$PV1MATH)
VarM/VarF

# The variance ratio is close to 1 (1.1)
# So our two conditions are met and can we can perform the t-test

t.test(MaleUK$PV1MATH, FemaleUK$PV1MATH)

# The p-value is <0.05 (4.061e-08) suggesting there are statistically
# differences between boys and girls

References

Baker, M. 2016. “Statisticians Issue Warning over Misuse of p Values.” Nature.
Cohen, J. 1988. “Statistical Power Analysis for the Behavioral Sciences 2nd Ed.(hillsdale, NJ: L. Erlbaum Associates).”
Cohen, Jacob. 1994. “The Earth Is Round (p<. 05).” American Psychologist 49 (12): 997.
Cook, Di. 2014. “Visiphilia: The Gender Gap in Math Is Not Universal.” Chance 27 (4): 48–52. https://www.tandfonline.com/doi/full/10.1080/09332480.2014.988961.
Geher, Glenn, and Sara Hall. 2014. Straightforward Statistics: Understanding the Tools of Research. Oxford University Press, USA.
Kraft, Matthew A. 2020. “Interpreting Effect Sizes of Education Interventions.” Educational Researcher 49 (4): 241–53.
Sanders, Michael, Chris Mitchell, and Aisling Ni Chonaire. 2020. “Effect Sizes in Education Trials in England.” Available at SSRN 3532325.
Shevlin, Mark, and Jeremy Miles. 2000. “Applying Regression and Correlation: A Guide for Students and Researchers.” Applying Regression and Correlation, 1–272.
Wilkinson, David, Alex Bryson, and Lucy Stokes. 2018. “Assessing the Variance in Pupil Attainment: How Important Is the School Attended?” National Institute Economic Review 243 (1): R4–16. https://docs.iza.org/dp11372.pdf.