11 Interpreting test scores 11.1 Frequency Marks awarded by counting the number of correct answers on a test script distribution are known as raw marks. '15 marks out of a total of 20' may appear a high mark to some, but in fact the statement is virtually meaningless on its own. For example, the tasks set in the test may have been extremely simple and 15 may be the lowest mark in a particular group of scores. TABLE 1 TABLE 2 TABLE 3 Testee Mark Testee Mark Rank Mark Tally Frequency A 20 B 25 C 33 D 35 E 29 F 25 G 30 H 26 1 19 J 27 K 26 L 32 M 34 N 27 0 27 P 29 Q 25 R 23 S 30 T 26 U 22 V 23 W 33 X 26 Y 24 Z 26 D 35 1 M 34 2 C 33 3.5 (or 3=) W 33 3.5 (or 3=) L 32 5 G 30 6.5 (or 6=) S 30 6.5 (or 6=) E 29 8.5 (or 8=) P 29 8.5 (or 8=) J 27 11 (or 10=) N 27 11 (or 10=) 0 27 11 (or 10=) H 26 15 (or 13=) K 26 15 (or 13=) T 26 15 (or 13=) X 26 15 (or 13=) Z 26 15 (or 13=) B 25 19 (or 18=) F 25 19 (or 18=) Q 25 19 (or 18=) Y 24 21 R 23 22.5 (or 22=) V 23 22.5 (or 22=) U 22 24 A 20 25 1 19 26 40 39 38 37 36 35 / 1 34 / 1 33 // 2 32 / 1 31 30 // 2 29 // 2 28 27 /// 3 26 JHľ 5 25 /// 3 24 / 1 23 // 2 22 / 1 21 20 / 1 19 / 1 18 17 16 15 Total 26 Conversely, the test may have been extremely difficult, in which case IS may well be a very high mark Numbers still exert a strange and powerful influence on our society, but the shibboleth that 40 per cent should always represent a pass mark is nevertheless both surprising and disturbing The tables on the previous page contain the imaginary scores of a group of 26 students on a particular test consisting of 40 items Table I conveys very little, but Table 2, containing the students' scores in order of merit, shows a little more Table 3 contains a frequency distribution showing the number of students who obtained each mark awarded, the strokes on the left of the numbers (e g ////) are called tallies and arc included simply to illustrate the method of counting the frequency of scores Note that normally the frequency list would have been compiled without the need for Tables 1 and 2, consequently, as the range of highest and lowest marks would then not be known, all the possible scores would be listed and a record made of the number of students obtaining each score in the scale (as shown in the example) Note that where ties occur in Table 2, two ways of rendering these arc shown The usual classroom practice is that shown in the parentheses Where statistical work is to be done on the ranks, it is essential to record the average rank (e g testees J, N and O, each with the same mark, occupy places 10, 11 and 12 in the list, averaging II) The following frequency polygon illustrates the distribution of the scores 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 11 2 Measures of central tendency Mode Median The mode refers to the score which most candidates obtained in this case it is 26, as five testees have scored this mark The median refers to the score gained by the middle candidate in the order of merit in the case of the 26 students here (as in all cases involving even numbers of testees), there can obviously be no middle person and thus the score halfway between the lowest score in the top half and the highest score in the bottom half is taken as the median The median score in this case is also 26 Mean The meöw score of any test is the arithmetical average le the sum of the separate scores divided by the total number of testees The mode, median, and mean are all measures of central tendency The mean is the most efficient measure of central tendency, but it is not always appropriate In the following Table 4 and formula, note that the symbol x is used to denote the score, N the number of the testees, and m the mean The 175 Symbol f denotes the frequency with which a score occurs The symbol 1, means the sum of TABLE 4 x f fx 35 x 1 35 34 x 1 34 33 x 2 66 32 x 1 32 30 x 2 60 29 x 2 58 27 x 3 81 26 x 5 130 25 x 3 75 24 X 1 24 23 x 2 46 22 x 1 22 20 x 1 20 19 x 1 19 Total = 702 -2fx Note that x = 702 is the total number of items which the group of 26 students got right between them Dividing by N = 26 (as the formula states), this obviously gives the average It will be observed that in this particular case there is a fairly close correspondence between the mean (27) and the median (26) Such a close correspondence is not always common and has occurred in this case because the scores tend to cluster symmetrically around a central point 11 3 Measures of Whereas the previous section was concerned with measures of central dispersion tendency this section is related to the range or spread of scores The mean by itself enables us to describe an individual student's score by comparing it with the average set of scores obtained by a group, but it tells us nothing at all about the highest and lowest scores and the spread of marks Range One simple way of measuring the spread of marks is based on the difference between the highest and lowest scores Thus, if the highest score on a 50-item test is 43 and the lowest 21, the range is from 21 to 41 i e 22 If the highest score, however, is only 39 and the lowest 29, the range is 10 (Note that in both cases, the mean may be 32 ) The range of the 26 scores given in Section 11 1 is 35 - 19 = 16 The standard deviation (s d ) is another way of showing the spread of scores It measures the degree to which the group of scores deviates from the mean, in other words, it shows how all the scores are spread out and thus gives a fuller description of test scores than the range, which simply describes the gap between the highest and lowest marks and ignores the information provided by all the remaining scores Abbreviations used for the standard deviation are either s d or o (the Greek letter sigma) or s m = Sfx N 702 26 -27 Standard deviation 176 One simple method of calculating s.d. is shown below: a Ad2 N is the number of scores and d the deviation of each score from the mean. Thus, working from the 26 previous results, we proceed to: 1. lind out the amount by which each score deviates from the mean (d); 2. square each result (d2); 3. total all the results (2d2); 4. divide the total by the number of testees (£d2/N); and 5. find the square root of this result (V2d2/N). Score Mean Deviation (d) Squared (d2) 35 deviates from 27 by 8 (Step 2) 64 34 7 49 33 6 36 33 6 36 32 5 25 30 3 Q 30 3 9 29 2 4 29 2 4 27 0 Ü 27 0 0 27 0 0 26 -1 26 -1 26 -1 26 -1 26 -1 25 _2 4 25 ~2 4 25 -2 4 24 -3 9 23 -4 16 23 -4 16 22 -5 25 20 -7 49 19 -8 64 702 (Step 3) Total = 432 (Step 4) s.d. = v432 V 26 (Step 5) s.d. = VI6.62 s.d. = 4.077 = 4.08 Note: If deviations (d) are taken from the mean, their sum (taking account of the minus sign) is zero + 42 — 42 = 0. This affords a useful check on the calculations involved here. A standard deviation of 4.08, for example, shows a smaller spread of scores than, say, a standard deviation of 8.96. If the aim of the test is simply to determine which students have mastered a particular programme 177 of work or are capable of carrying out certain tasks in the target language, a standard deviation of 4.08 or any other denoting a fairly narrow spread will be quite satisfactory provided it is associated with a high average score However, if the test aims at measuring several levels of attainment and making line distinctions within the group (as perhaps in a proficiency test), then a broad spread will be required. Standard deviation is also useful for providing information concerning characteristics of different groups If, for example, the standard deviation on a certain test is 4.08 for one class, but 8.96 on the same test for another class, then it can he inferred that the latter class is far more heterogeneous than the former 11.4 Item analysis Earlier careful consideration of objectives and the compilation of a table of test specifications were urged before the construction of any test was attempted What is required now is a knowledge of how far those objectives have been achieved by a particular test Unfortunately, too many teachers think that the test is finished once the raw marks have been obtained But this is far from the case, for the results obtained from objective tests can be used to provide valuable information concerning' - the performance of the students as a group, thus (in the case of class progress tests) informing the teacher about the effectiveness of the teaching, - the performance of individual students; and - the performance of each of the items comprising the test. Information concerning the performance of the students as a whole and of individual students is very important for teaching purposes. especially as many test results can show not only the types of errors most frequently made but also the actual reasons for the errors being made As shown in earlier chapters, the great merit of objective tests arises from the fact that they can provide an insight into the mental processes of the students by showing very clearly what choices have been made, thereby indicating definite lines on which remedial work can be given. The performance of the test items, themselves, is of obvious importance in compiling future tests Since a great deal of time and effort are usually spent on the construction of good objective items, most teachers and test constructors will be desirous of either using them again without further changes or else adapting them for future use. It is thus useful to identify those items which were answered correctly by the more able students taking the test and badly by the less able students. The identification of certain difficult items in the test, together with a knowledge of the performance of the individual distractors in multiple-choice items, can prove just as valuable in its implications for teaching as for testing. All items should be examined from the point of view of (l) their difficulty level and (2) their level of discrimination. Hem difficulty The index of difficulty (or facility value) of an item simply shows how easy or difficult the particular item proved in the test. The index of difficulty (FV) is generally expressed as the fraction (or percentage) of the students who answered the item correctly. It is calculated by using the formula: FV = — 178 R represenls the number of correct answers and N the number of students laking ihe test. Thus, if 21 out of 26 students tesled obtained the correct answer for one of the items, that item would have an index of difficulty (or a facility value) of 77 or 77 per cent fv= | = 11 In this case, the particular item is a fairly easy one since 77 per cent of the students taking the test answered it correctly. Although an average facility value of .5 or 50 per cent may be desirable for many public achievement tests and for a few progress tests (depending on the purpose for which one is testing), the facility value of a large number of individual items will vary considerably While aiming for test items with facility values falling between 4 and 6, many (est constructors may be prepared in practice to accept items with facility values between 3 and 7 Clearly, however, a very easy item, on which 90 per cent of the testees obtain the correct answer, will not distinguish between above-average students and below-average students as well as an item which only 60 per cent of the testees answer correctly On the other hand, the easy item will discriminate amongst a group of below-average students, in other words, one student with a low standard may show that he or she is better than another siudcnt with a low standard through being given the opportunity to answer an easy item Similarly, a very difficult item, though failing to discriminate among most students, will certainly separate the good student from the very good student A further argument for including items covering a range of difficulty levels is that provided by motivation While the inclusion of difficult items may be necessary in order to motivate the good student, the inclusion of very easy items will encourage and motivate the poor student In any case, a few easy items can provide a "lead-in" for the student - a device which may be necessary if the test is at all new or unfamiliar or if there are certain tensions surrounding ihe test situation Note that it is possible for a test consisting of items each with a facility value of approximately .5 to fail lo discriminate at ail between the good and the poor students If, for example, half the items arc answered correctly by the good students and incorrectly by the poor students while the remaining items are answered incorrectly by the good students but correctly by the poor students, then the items will work against one another and no discrimination will be possible. The chances of such an extreme situation occurring are very remote indeed; it is highly probable, however, that at least one or two items in a test will work against one another in this way Item discrimination The discrimination index of an item indicates the extent to which the item discriminates between the testees, separating the more able testees from the less able The index of discrimination (D) tells us whether those students who performed well on the whole test tended to do well or badly on each item in the lest It is presupposed thai the total score on the test is a valid measure of the student's ability (i e the good student tends to do well on the test as a whole and the poor student badly) On this basis, the score on the whole test is accepted as the criterion measure, and it thus becomes possible to separate the 'good' students from the 'bad' ones in performances on individual items If the 'good' students tend to do well on 179 an item (as shown by many of them doing so - a frequency measure) and the 'poor' students badly on the same item, then the item is a good one because it distinguishes the 'good' from the 'baď in the same way as the total test score This is the argument underlying the index of discrimination There are various methods of obtaining the index of discrimination all involve a comparison of those students who performed well on the whole test and those who performed poorly on the whole test However, while it is statistically most efficient to compare the top 27] per cent with the bottom 27S per cent, it is enough for most purposes to divide small samples (e g class scores on a progress test) into halves or thirds For most classroom purposes, the following procedure is recommended 1 Arrange the scripts in rank order of total score and divide into two groups of equal size (i e the top half and the bottom half) If there is an odd number of scripts, dispense with one script chosen at random 2 Count the number of those candidates in (he upper group answering the first item correctly, then count the number of lower-group candidates answering the Hem correctly 3 Subtract the number of correct answers in the lower group from the number of correct answers in the upper group i e find the difference in the proportion passing in the upper group and the proportion passing in the lower group 4 Divide this difference by the total number of candidates in one group p. _ Correct U - Correct L n (D = Discrimination index, n = Number of candidates in one group*. U ■= Upper half and L = Lower half The index D is thus the difference between the proportion passing the item in U and L ) 5 Proceed in this manner for each item The following item, which was taken trom a test administered to 40 students, produced the results shown I left Tokyo Friday morning A in (S) on C at D by D = i^^ = -^ = 45 20 20 — Such an item with a discrimination index of 45 functions fairly effectively, although clearly it does not discriminate as well as an item with an index of 6 or 7 Discrimination indices can range from + 1 (= an item which discriminates perfectly - i e it shows perfect correlation with the tcstees' results on the whole test) through 0 (= an item which does not discriminate in any way at all) to -1 (= an item which discriminates in entirely the wrong way) Thus, for example, if all 20 students in the upper group answered a certain item correctly and all 20 students in the lower group got the wrong answer, the item would have an index of discrimination of 1 0 I he reddu should carefully distinguish heiwten n(= the numlur of cdndiddles m either the U or I group) and N (= the number in the whole group) ds used previou*.l\ Obviously n = 1/2 N 18(1 If, on the other hand, only 10 students in the upper group answered it correctly and furthermore 10 students in the lower group also got correct answers, the discnminaton index would be 0 However, if none of the 20 students in the upper group got a correct answer and all the 20 students in the lower group answered it correctly, the item would have a negative discrimination, shown by -1 0 It is highly inadvisable to use again, or even to attempt to amend, any item showing negative discrimination Inspection of such an item usually shows something radically wrong with it Again, working from actual test results, we shall now look at the performance of three items The hrst of the following items has a high index of discrimination, the second is a poor item with a low discrimination index, and the third example is given as an illustration of a poor item with negative discrimination 1 High discrimination index NEARLY When Jim crossed the road, he ^ ran into a car D = *ft ~ ^ = — ~ 7S FV = — = it S?*! U 20 20 40 y±2^~ (The item is at the right level ot difficulty and discriminates well ) 2 Low discrimination index If you the bell, the door would have been opened A would ring C would have rung (b) had rung D were ringing D=4