Edward 'Skip' Kifer is chair of the Department of Educational Policy Studies and Evaluation and Director of the Office of Research and Graduate Studies in the College of Education, University of Kentucky, 131 TEB, Lexington, KY, 40606, USA. He specializes in quantitative methods and evaluation and tries to promote exploratory data analysis.
JCS invites comments on this paper for publication on the JCS website. Address comments to Ian Westbury, General editor of JCS, at Westbury@uiuc.edu All such comments on this paper, and on other papers in the journal, can be accessed at the website.
This paper is copyright ã 1997 Taylor & Francis Ltd. ISSN 0022-0272. Copies may be made under the normal terms of copyright law.
When I told people in my department of educational policy studies that I had an opportunity to write a piece about why I like test scores, they, being the skeptics about empiricism one expects from a group in foundations, wondered what journal would allow such thing. When I told them the journal did not matter, what counted was that a recent issue quoted Groucho Marx (see Hunter and Benson 1997: 87), they understood immediately.
Although there is a rather long tradition of using test scores to make inferences about the efficacy of a curriculum, modern testing and assessment seems to have forgotten it. Tyler (1950), for example, emphasized coherence among goals, curricular experiences, and testing outcomes, broadly construed. Blooms (1968) Mastery Learning placed a premium on testing what is taught in his insistence on formative evaluation. Nowadays, US educational critics use highly aggregated test data to bash schools, teachers, or students and spend not so much time talking about curriculum. Instead, they have a perverse notion, that if one define standards it matters little what people do &endash; just so they reach the standards. Their confusion only starts with the ends-means problem, but that is an issue for another day.
I am going to hearken back to the good old days and give some examples where results of testing provide conjectures about what is in a curriculum and/or how it has changed. I am going to do this mainly by using more recent methods of exploratory data analysis and visual displays. I understand that I am flying high over many extremely important curricular issues -- fine-grained analyses are called for. The portrayals are a success, however, if you say 'Yes, that raises an interesting curriculum issue'.
International Association for the Evaluation of Education (IEA) results
My first example comes from the Second International Mathematics (SIMS) study where fairness of the comparisons among countries is a crucial issue. My notion is that students in various countries should be exposed to similar curricula if one is to rank order their performance in any thing resembling a fair way. If a test unfairly samples a hypothetical international mathematics curriculum, then the comparisons are of a curriculum not of students' performance, the ostensible reason for the comparison. The pie charts presented in figure 1 of Population A students (8th grade in most systems; 7th grade in Japan) depict one kind of unfairness.
Figure 1. Variance components for the United States and Japan.
The charts in figure 1 present results by applying a statistical technique called variance decomposition to student scores. One estimates variation within classrooms, between classrooms within schools, and between schools. If schools and classrooms within schools were equal in the sense of delivering mathematics curricula, there would be only variation between students. There can be variation between schools for reasons such as unequal distribution of resources or different types of students attending schools. Differences between classrooms, because these are pretest data, comes from the practice of tracking students into 'homogeneous' groups. There are always individual differences among students within classrooms for any of a variety of reasons.
Since almost all the variation in performance in Japan is between students, one can infer that conditions for mathematics learning are pretty much the same regardless of the school a student attends or the classroom she is in. That is not true of the US. The huge between-classroom component suggests quite different things. We have additional analyses that show the major source of variation in the US is due to tracking of students and that the various tracks &endash; remedial, general, pre-algebra and algebra &endash; are given differential opportunities to learn mathematics (Kifer 1992, Kifer, Wolfe, and Schmidt 1992). Westbury (1993) discusses the advantages of being placed in an algebra class and its curricular implication, because the flip side of tracking is differential curricular experiences. In the USA, some students are exposed to mathematics in eighth grade that others simply do not see and may never see.
One could argue, I imagine, that tracking is not a curriculum issue. The issue, after all, is that some students are more talented than others and can move through a curriculum at a more rapid pace. Figure 2 speaks to that issue.
Figure 2. Box plots of arithmetic performance by type of eighth grade mathematics class.
(The width of the boxes are proportional to the size of the sample. About six percent of students are in Mathtype 1 -- Remedial; 54 percent in Mathtype 2 -- General; 28 percent in Mathtype 3 -- Pre-Algebra; and 12 percent in Mathtype 4 -- Algebra.)
Arithmetic performance is a proxy for the selection into the different types of class types. If that were the only pre-requisite and one took the 25th percentile (bottom of the box) of the performance in algebra class as a cut-off for getting into algebra, then over half of the pre-algebra students, half of the general students, and five percent of the remedial students would be in algebra. That would change the percentages in the groups to 5, 27, 12 and 56. Even if one believes in homogeneous grouping (which I do not), these results suggest that the curriculum differentiation is much greater than differences in ability or prior achievement would call for.
I suppose I should make my assumptions explicit. Tracking in mathematics is wrong in the American school system because it denies opportunities for students to be exposed to what is perceived to be the best mathematics we offer. I can imagine other systems and other curricula where such tracking might be defensible. Right now I cannot think of one, however.
I will give two more displays of the international data that I think show curriculum issues. These plots, and the reminder in the paper, are based on the percent of students who answer an item correctly. That is, the unit being displayed is not student scores as in the above pictures. Rather it is of properties of test items -- the percent of respondents who answer them correctly.
The pictures presented in figure 3 are of differences of percentage correct on a pretest in September and the same questions given in May, i.e. they represent changes or growth in student performance across a school year. The plot symbols represent broad content areas &endash; arithmetic, algebra, measurement, and geometry &endash; of the international mathematics test. The most intriguing features of this picture are the 'strange' data. That is, the points way above the diagonal. They represent items where there has been dramatic growth from September to May. Notice that the items are dominated by algebra and geometry content areas.
Figure 3. Post-test percent correct plotted on pretest with content symbols.
Figure 4 gives these same data with a slightly different spin. This time growth is plotted on pretest values and the symbols represent the types of 'behaviour' students need to answer the question correctly. Those are 'calculation', 'estimating', and 'thinking'. I assume that the least cognitively complex response is calculating, the most complex is thinking. Again the data away from the pack is the most interesting. Notice how those items where growth was highest are the ones requiring calculation. There are no dramatic increases in estimating or thinking items.
Figure 4. Growth plotted on pretest values with symbols for type of behaviour needed to answer a question correctly.
Taken together these pictures yield a tempting generalization about what students in this international study learn most dramatically: it is newly introduced content (algebra and geometry) where a solution requires little more than a rote calculation. The item on the test which produced the largest growth across eight systems was -2*-3 = .
One can imagine a student attempting this question without learning the content or rule for its solution. It would be extremely difficult to figure out that the product is +6. On the other hand, students who have been given a rule &endash; a negative times a negative is positive &endash; would solve this problem easily. If the content implied by the item is part of the curriculum, the item is easy. In the USA, this is content normally in eighth grade algebra. It is also a topic studied in the pre-algebra course. Students may or may not be exposed to such material in either a remedial course or the general course. In Japan, all students are exposed to algebra. So, the question is when one compares Japanese performance to the US performance, is the comparison a knowledge comparison or a curriculum coverage comparison. You choose and the implication follows.
National Assessment of Educational Progress (NAEP) results
NAEP is called in the US 'the Nations Report Card' because it draws representative national samples of students for its assessments. Some of the assessments began almost 30 years ago and NAEP has collected trend data at various time points. The next picture is of mathematics trend data for nine year old students from 1978 to 1992. Are there changes in performance on the items that would lend themselves to curricular interpretations? Figure 5 shows differences between 1992 and 1978 plotted on the 1978 values.
Figure 5. Differences between performance in 1992 and 1978 plotted on 1978 performance.
Notice the four points represented by a triangle pointing up. They are item results that have changed between 15 and 30 points over the 14 years. That is compared to an average change somewhere around 4 to 5 percent. Table 1 presents the results for those four items.
|
Short description of item |
Difference |
1992 |
1978 |
|
Compute using data in table |
9.3 |
64.2 |
54.9 |
|
Read data in bar graph |
29.0 |
82.8 |
53.9 |
|
Interpret data in bar graph |
19.5 |
43.4 |
23.9 |
|
Compute with data in bar graph |
29.2 |
57.8 |
28.6 |
It seems likely that changes in emphasis in the curriculum would account for these large differences. Of course, there are other questions of interest in this plot, especially those at the bottom (triangle pointing down) where performance is getting worse by over 10 percent. The short text for those items is Relate part to whole and Solve number sentence. I need more information in order to interpret those results. I will bet there is a curricular implication, however.
It is fun to see how these changes occur across time. Figure 6 presents a matrix plot of the items across each of the measurement time points. Look at the plot in the upper right hand corner of figure 6. It is 1992 values plotted on 1978 ones. Notice the triangles in the far right column. Those are the ones with big positive changes. One can go down that column and see how the items emerged from the pack. For example, the bottom right plot is the 1982 data on 1978. The item values line up on this plot. The big change items start to move in 1986, move more in 1990 and are more or less stable from 1990 to 1992. Could the introduction of the National Council of Teachers of Mathematics (NCTM) standards (1986) have anything to do with these changes? Since there was an emphasis in the standards on data and statistics I would have to bet that this results are the results of curricular changes.
I am going to end my picture drawing with a testing example that shows no curriculum effect. That phenomena is of interest, too. Figure 7 comes from the NAEP Trial State Assessment. In the matrix plot I have the percent correct for students in the US states of Kentucky, California, and Connecticut in grade 4 and grade 8. Again the unit being plotted is percent correct on individual items. The striking thing about this plot is how tight the scatter is for similar grade levels and how loose it is across grade levels. One can see this by comparing the upper left corner (plots for fourth graders ) and the bottom right corner (plots for eighth graders) to those in the other corners (either fourth on eighth or vice versa). How could it be that Kentucky eighth grade performance is more highly related to the other states eighth grade performance than it is to its own fourth grade performance? Notice that result is true for the other states as well and, I know from other analyses, for those not portrayed in this picture.


If one had a national curriculum without state-by-state variation, this kind of result would be possible. Or, if one had test items that were not related to curricula but to general ability, it might be possible to get such results. In general, however, one would expect grade four performance to predict grade eight performance.
I doubt that we in the US have a national curriculum or that for these grade levels there is no state-by-state variation in curricular emphasis. Therefore, one has to ask what state-by-state comparisons (the purpose for state NAEP) might mean. I am not sure I know the answer so I am not sure either about the comparisons.
I have presented some results from tests that I think raise important issues about curriculum and in turn about test score comparisons be they international or within the US. Lurking behind this pictures to my way of thinking is the formal curriculum. What these test items do is provide an imperfect, selective reflection of curricula writ large. I hope that they are not too imperfect or selective. Rather I hope the results are fascinating and provocative. I think they raise important curricular issues. I hope you do, too. Who knows, I might even gather some converts to the love of test scores.
Acknowledgements
This research was supported by a grant from the American Educational Research Association which receives funds for its 'AERA Grants Program' from the National Science Foundation and the National Center for Education Statistics (U.S. Department of Education) under NSF Grant #RED-9452861. The opinions expressed here reflect those of the author and do not necessarily reflect those of the granting agencies.
References
Bloom, B. S. (1968) Learning for mastery. Center for the Study of Evaluation of Instructional Programs, University of California, Los Angeles Evaluation Comment, 1, (2).Hunter, W. J. and Benson, G. D. (1997) Arrows in time: the misapplications of chaos theory to education. Journal of Curriculum Studies, 79 (1), 87-100.
Kifer, E. (1993) Opportunities, talents and participation. In L. Burstein (ed.), The IEA Study of Mathematics III: Student Growth and Classroom Processes (Oxford: Pergamon), 279-307.
Kifer, E., Wolfe, R. G. and Schmidt, W. (1993) Understanding patterns of student growth. In L. Burstein (ed.), The IEA Study of Mathematics III: Student Growth and Classroom Processes (Oxford: Pergamon), 101-127.
Tyler R. W. (1950) Basic Principles of Curriculum and Instruction (Chicago: University of Chicago Press).
Westbury, I. (1992) Comparing American and Japanese achievement: is the United States really a low achiever? Educational Researcher, 21 (5), 18-24.