copy the linklink copied!Annex A3. Technical notes on analyses in this volume

copy the linklink copied! Standard errors, confidence intervals, significance tests and p-values

The statistics in this report represent estimates based on samples of students, rather than values that could be calculated if every student in every country had answered every question. Consequently, it is important to measure the degree of uncertainty of the estimates. In PISA, each estimate has an associated degree of uncertainty, which is expressed through a standard error. The use of confidence intervals provides a way to make inferences about the population parameters (e.g. means and proportions) in a manner that reflects the uncertainty associated with the sample estimates. If numerous different samples were drawn from the same population, according to the same procedures as the original sample, then in 95 out of 100 samples the calculated confidence interval would encompass the true population parameter. For many parameters, sample estimators follow a normal distribution and the 95 % confidence interval can be constructed as the estimated parameter, plus or minus 1.96 times the associated standard error.

In many cases, readers are primarily interested in whether a given value in a particular country is different from a second value in the same or another country, e.g. whether girls in a country perform better than boys in the same country. In the tables and figures used in this report, differences are labelled as statistically significant when a difference of that size or larger, in either direction, would be observed less than 5 % of the time in samples, if there were actually no difference in corresponding population values. Throughout the report, significance tests were undertaken to assess the statistical significance of the comparisons made.

Some analyses in this volume explicitly report p-values (e.g. Table I.B1.10). p-values represent the probability, under a specified model, that a statistical summary of the data would be equal to or more extreme than its observed value (Wasserstein, L. and Lazar, 2016[1]). For example, in Table I.B1.10, the p-value represents the likelihood of observing, in PISA samples, a trend equal to or more extreme (in either direction) than what is reported, when in fact the true trend for the country is flat (equal to 0).

copy the linklink copied! Range of ranks (confidence interval for rankings of countries)

An estimate of the rank of a country mean, across all country means, can be derived from the estimates of the country means from student samples. However, because mean estimates have some degree of uncertainty, this uncertainty should also be reflected in the estimate of the rank. While mean estimates from samples follow a normal distribution, this is not the case of the rank estimates derived from these. Therefore, in order to construct a confidence interval for ranks, simulation methods were used.

Data are simulated assuming that alternative mean estimates for each relevant country follow a normal distribution around the estimated mean, with a standard deviation equal to the standard error of the mean. Some 10 000 simulations are carried out and, based on the alternative mean estimates in each of these simulations, 10 000 possible rankings for each country are produced. For each country, the counts for each rank are aggregated from the largest to smallest until they equal 9 750 or more. The range of ranks reported for each country includes all the ranks that have been aggregated (this procedure assumes unimodality of the distribution of rank estimates from samples, but makes no other assumption about this distribution). This means that the range-of-ranks estimates reported in Chapter 4 represent a 97.5 % confidence interval for the rank statistic.

The main difference between the range of ranks (e.g. Table I.4.4) and the comparison of countries’ mean performance (e.g. Table I.4.1) is that the former takes account of the multiple comparisons involved in ranking countries/economies, while the latter does not. Therefore, sometimes there is a slight difference between the range of ranks and counting the number of countries above a given country, based on pairwise comparisons of the selected countries’ performance. For instance, OECD countries Australia, Denmark, Japan and the United Kingdom have similar mean performance and the same set of countries whose mean score is not statistically different from theirs, based on Table I.4.1; but the range of ranks amongst OECD countries for the United Kingdom and Japan can be restricted to be with 97.5 % confidence between 7th and 15th, while the range of ranks for Australia and Denmark is narrower (between 8th and 14th for Australia; between 9th and 15th for Denmark) (Table I.4.4). When interest lies in examining countries’ rankings, this range of ranks should be used.

The confidence level of 97.5 % for the range-of-ranks estimate was chosen to limit paradoxical situations. Indeed, Tables I.4.1, I.4.2 and I.4.3 determine statistical significance using two-tailed tests, as is usual when testing for statistical significance of mean differences.

When interest lies in ranking two countries relative to each other, however, it is more appropriate to use one-tailed tests, as the procedure described above implicitly does. All cases where the mean score of country A ranks above the mean score of country B result in the same ranking between the two countries, regardless of how far A lies above B’s mean score. For example, the estimate of the mean score of Beijing, Shanghai, Jiangsu and Zhejiang (China) (hereafter “B-S-J-Z [China]”) is higher than the estimate of the mean score of Singapore in reading, and the p-value for observing a difference of that size (or larger, but in the same direction) is 3.4 %. In this situation, a two-tailed test for the difference in mean reading performance between B-S-J-Z (China) and Singapore cannot reject the null hypothesis of equal means at conventional levels of significance (the two-tailed 95 %-confidence interval includes 0), but a one-tailed test would reject equality at the 95 % level. When only two countries are involved in the comparison, a simple way of ensuring consistency between the range of ranks (one-tailed tests) and the comparison of countries’ mean performance (two-tailed tests) is to set the confidence level for the confidence interval on rank statistics at 97.5 %.

The parity index for an indicator is used by the UNESCO Institute of Statistics to report on Target 4.5 of the Sustainable Development Goals. It is defined as the ratio of the indicator value for one group to the value for another group. Typically, the group more likely to be disadvantaged is in the numerator, and the parity index takes values between 0 and 1 (with 1 indicating perfect parity). However, in some cases the group in the numerator has a higher value on the indicator. To restrict the range of the parity index between 0 and 2, and to make its distribution symmetrical around 1, an adjusted parity index is defined in these cases.

For example, the gender parity index for the share of students reaching Level 2 proficiency on the PISA scale is computed from the share of boys (pb) and the share of girls (pg) reaching Level 2 proficiency as follows:

Equation I.A3.1.

The “parity index” reported in Tables I.10.2 and I.B1.50 corresponds to the adjusted parity index as defined by the UNESCO Institute of Statistics (UNESCO Institute of Statistics, 2019[2]).

References

[2] UNESCO Institute of Statistics (2019), Adjusted parity index, http://uis.unesco.org/en/glossary-term/adjusted-parity-index (accessed on 8 October 2019).

[1] Wasserstein, R. L. and N. Lazar (2016), “The ASA Statement on p-Values: Context, Process, and Purpose”, The American Statistician, Vol. 70/2, pp. 129-133, http://dx.doi.org/10.1080/00031305.2016.1154108.