Library Collections: Document: Full Text

The Reliability Of Intelligience Tests

Creator: Walter Lippmann (author)
Date: November 8, 1922
Publication: The New Republic
Source: Available at selected libraries

Next Page   All Pages 

Page 1:




SUPPOSE, for example, that our aim was to test athletic rather than intellectual ability. We appoint a committee consisting of Walter Camp, Percy Haughton, Tex Rickard and Bernard Darwin, and we tell them to work out tests which will take no longer than an hour and can be given to large numbers of men at once. These tests are to measure the true athletic capacity of all men anywhere for the whole of their athletic careers. The order would be a large one, but it would certainly be no larger than the pretensions of many well known intelligence testers.


Our committee of athletic testers scratch their heads. What shall be the hour's test, they wonder, which will "measure" the athletic "capacity" of Dempsey, Tilden, Sweetser, Siki, Suzanne Lenglen and Babe Ruth, of all sprinters, Marathon runners, broad jumpers, high divers, wrestlers, billiard players, marksmen, cricketers and pogo bouncers? The committee has courage. After much guessing and some experimenting the committee works out a sort of condensed Olympic games which can be held in any empty lot. These games consist of a short sprint, one or two jumps, throwing a ball at a bull's eye, hitting a punching machine, tackling a dummy and a short game of clock golf. They try out these tests on a mixed assortment of champions and duffers and find that on the whole the champions do all the tests better than the duffers. They score the result and compute statistically what is the average score for all the tests. This average score then constitutes normal athletic ability.


Now it is clear that such tests might really give some clue to athletic ability. But the fact that in any large group of people sixty percent made an average score would be no proof that you had actually tested their athletic ability. To prove that, you would have to show that success in the athletic tests correlated closely with success in athletics. The same conclusion applies to the intelligence tests. Their statistical uniformity is one thing; their reliability another. The tests might be a fair guess at intelligence, but the statistical result does not show whether they are nor not. You could get a statistical curve very much like the curve of "intelligence" distribution if instead of giving each child from ten to thirty problems to do you had flipped a coin the same number of times for each child and had credited him with the heads. I do not mean, of course, that the results are as chancy as all that. They are not, as we shall soon see. But I do mean that there is no evidence for the reliability of the tests as tests of intelligence in the claim, made by Terman, (1) that the distribution of intelligence quotients corresponds closely to "the theoretical normal curve of distribution (the Gaussian curve)." He would in a large enough number of cases get an even more perfect curve if these tests were tests not of intelligence but of the flip of a coin.

(1) Stanford Revision Binet-Simon Scale, p. 42.


Such a statistical check has its uses of course. It tends to show, for example, that in a large group the bias and errors of the tester have been cancelled out. It tends to show that the gross result is reached in the mass by statistically impartial methods, however wrong the judgment about any particular child may be. But the fairness in giving the tests and the reliability of the tests themselves, must not be confused. The tests may be quite fair applied in the mass, and yet be poor tests of individual intelligence.


We come then to the question of the reliability of the tests. There are many different systems of intelligence testing and, therefore, it is important to find out how the results agree if the same group of people take a number of different tests. The figures given by Yoakum and Yerkes (2) indicate that people who do well or badly in one are likely to do more or less equally well or badly in the other tests. Thus the army test for English-speaking literates, known as Alpha, correlates with Beta, the test for non-English speakers or illiterates at .80. Alpha with a composite test of Alpha, Beta , and Stanford-Binet gives .94. Alpha with Trabue B and C completion-tests combined gives .72. On the other hand, as we noted in the first article of this series, the Stanford-Binet system of calculating "mental ages" is in violent disagreement with the results obtained by the army tests.

(2) Army Mental Tests, p. 20.


Nevertheless, in a rough way the evidence shows that the various tests in the mass are testing the same capacities. Whether these capacities can fairly be called intelligence, however, is not yet proved. The tests are all a good deal alike. They all derive from a common stock, and it is entirely possible that they measure only a certain kind of ability. The type of mind which is very apt in solving Sunday newspaper puzzles, or even in playing chess, may be specially favored by these tests. The fact that the same people always do well with puzzles would in itself be no evidence that the solving of puzzles was a general test of intelligence. We must remember, too, that the emotional setting plays a large role in any examination. To some temperaments the atmosphere of the examination room is highly stimulating. Such people "outdo themselves" when they feel they are being tested; other people "cannot do themselves justice" under the same conditions. Now in a large group these differences of temperament may neutralize each other in the statistical result. But they do not neutralize each other in the individual case.

Next Page

Pages:  1  2  3    All Pages