Chapter 16






The quality of a test is usually judged by its validity and reliability, two properties that characterize all tests. A test is valid to the extent that its scores permit one to draw appropriate, meaningful, and useful inferences, and it is reliable to the extent that whatever it measures, it measures consistently. One of the two properties, validity is the more important. It is much more difficult, however, to determine a test’s validity than its reliability.

       The concepts are related in such a way that it is  sometimes possible to know something about a test’s reliability if one already knows how valid the test is.





The Meaning Of Test Validity

A test is valid only for some purpose and with some group of people. Throughout the following discussion of the relationship between validity and reliability. The term validity will be used to mean that scores on a test permit appropriate Inferences to be made about a specific group of people for specific purpose.


The Meaning Of Test Reliability

What does it mean when we say a test is reliable? We are saying that the test provides a consistent measure. Reliability refers to consistency of measurement; the more reliable a test is, the more consistent the measure.


The Relationship Between Validity And Reliability

If we know a test is valid, can we say anything about its reliability? Yes, we can. If a test provides scores that permit us to draw appropriate inferences, it must be doing so consistently. Consequently, a valid test must be a reliable test. Suppose, on the other hand, we know that a test is not valid. Can we say anything about the reliability of the test? No, we cannot. The fact that the scores on a test do not permit us to draw appropriate inferences does not mean that the test is not measuring anything consistently.

       Consider a different set of circumstances. Suppose we know that a test is reliable. Can we say anything about its validity? No, we cannot. All we know is that the test is measuring something consistently, but we have no way of knowing it the test scores permit us to draw appropriate inferences or not.  The fact that a test is reliable tells us nothing about the test’s validity. A reliable test may or may not be valid. Suppose we now that a test is not reliable. Can we say anything about its validity? Yes, we can. If the test is not measuring anything consistently, we cannot possibly use the test scores to make appropriate inferences. If the test is not reliable, it is not valid.


The relationship between validity and reliability


If a test is known to be:                       One can conclude:

(1)    Valid        -------------->               the test must be reliable

(2)    Not valid   -------------->              nothing about the reliability of the test

(3)    Reliable    -------------->              nothing about the validity of the test.

(4)    Not reliable ------------->              the test cannot be valid


Test-Retest Reliability

The procedure of checking a test’s reliability by administering the same test twice is called the test-retest method. The same people must take the test twice. Administering the test once to one group of people and again to another group of people does not give an information about the test’s reliability.


Parallel Forms Reliability

A second way researchers overcome the impossibility of completely erasing test taker’s memories between two administrations of the same test is to develop two forms of the same test. In other words, two tests are develop to measure the same variable are usually called parallel forms or equivalent forms.

       The advantage of using parallel forms rather than the test-retest method is that, with parallel forms, the problem of memory is completely eliminated because the two forms consist of different sets of items. The disadvantage of using parallel forms, however, is the precisely because the sets of items on the two forms are different, one cannot be completely sure that they are really measuring the same thing.


Split-Half Reliability

One problem associated with both parallel forms and the test-retest method is the need of having to get the same group of people together twice for the purpose of administering the tests. Often it is difficult to get a group of people together once for administering a test, and to get the same people together twice can be extremely difficult.

        To determine the reliability of a test without having to arrange for testing the same group of people twice, researchers have developed a method known as the split-half reliability method. The split-half method is probably the most commonly used way of determining a test’s reliability. It is used so frequently because it is so easy to do.  the statistical calculations are exactly the same as those used with the test-retest method and with parallel forms, but the decided advantage is that the test needs to be administered only once. Furthermore, the split-half method also overcomes the problem of test takers remembering questions from one test to another.


The Kuder-Richardson Formulas and cronbach’s Alpha

Researchers sometimes determine the reliability of a test by using one of two statistical formulas called kuder-richardson Formula Number 20 and kuder-richardson Formula Number 21 . The statistical calculations involved with K-R 20 are so complicated that K-R 21 was developed to allow a reasonably close approximation of the reliability coefficient generated by K-R 20 without having to carry out so many calculations.

       The Kuder-Richardson formulas may be used only with measuring instruments whose items are scored either correct or incorrect. The items on some measuring instruments, however, may take on several scores, such as questionnaires that consist of statements to which respondents are instructed to indicate their extent of agreement of disagreement. The possible responses for each statement often consist of strongly agree, agree, undecided, disagree, and strongly disagree . Consequently, the score a respondent receives on each statement may take on one of five values, depending on which response was selected. Cronbach’s alpha , which is conceptually related to the Kuder-Richardson formulas, may be used to determine the reliability of such instruments.

       The closer the reliability coefficient is to +1.00, the more reliable the test. The closer the reliability coefficient is to 0, the less reliable the test.


Cronbach’s alpha / coefficient alpha

a measure of internal consistency based on information about (a) the number of items on the test, (b) the VARIANCE of the scores of each item, and (c) the VARIANCE of the total test scores. Mathematically speaking, it is equivalent to the average of the reliability estimates for all possible splits. When items are dichotomously scored, Cronbach’s alpha results are equal to those of KR20, which is why KR20 is considered a special case of Cronbach’s alpha.



There are two basic principles to follow in designing questionnaires:

a.       Make the items as relevant as possible;

b.      Word questions in a way that they are likely to be interpreted the same way by every respondent.

The longer the questionnaire, the less likely it is that many people will take the time to complete it. Before conducting a questionnaire, it is best to have a clear idea of how the data are  to be analyzed. Include only questions that yield information pertinent to the data analysis.

       Open-ended questions that permit people to answer in their own words are likely to be interpreted differently by different people. It is best to follow each question with a set of possible responses so that people need only place an X next to the response of their choice. It is also best to include among the possible responses a category called other, so that people have the opportunity to give a response that was unanticipated by the researcher.


One Question Per Item.

Make sure that each item asks only one question. It is better to have several items, each of which asks one question, than one item that asks several questions. For example, it is better to use item 1 and 2 rather than item 3.


     Item 1. Did you find learning how to use the computer interesting?

     Item 2. Did you find learning how to use the computer easy?

     Item 3. Did you find learning how to use the computer easy and interesting?


Item 3 cannot be answered accurately by persons who found learning how to use the computer interesting but difficult, or by persons who found it easy but uninteresting. Item 1 and 2, however, permit such people to respond accurately.


Eliminating ambiguous wording

Responses that call for people to indicate how frequently they do something are particularly prone to ambiguous wording. Consider the following item:


            How often do you check your child’s homework?



        It is likely that different people will interpret the same response choice differently. The term rarely might be interpreted by one person as “less than once a month” and by another person as “less than twice a week.” One way to eliminate the ambiguity of responses is to express the choices in terms of actual units of time. Instead of using a term such as rarely , for example, use a term such as less than twice a month .

       Another way is to include a unit of time in the question itself. You might change the wording of the previous item to “How many days last week did you check your child’s homework?”

       Notice the ambiguity of the phrase check your child’s homework . Some parents might interpret the term check to mean “make sure the child has spent time working on homework.” Others might interpret  it mean “go over the child’s  work thoroughly.” In short, it is desirable to use terms that are as precise as possible.


Attitudinal Scales

There are numerous ways to measure attitudes. Probably, however, researchers most frequently use Likert scales .


Figure 16.1:    Examples if Likert Scale Item measuring Attitudes Toward Mathematics



Directions:        circle the response the best indicates the extent to which

                        you agree or disagree with each statement below, where


                        SA        =          Strongly Agree

                        A          =          Agree

                        U         =          Undecided

                        D         =          Disagree

                        SD       =          Strongly Disagree


            1.       Math is my favorite class ………….          SA    A   U    D    AD


      2.       Learning math is a waste of time ………   SA     A   U      D      AD





Likert Scales.

Likert scales are not a particular ser of attitudinal scales but rather a technique by which attitudinal scales are constructed. You have probably already had the experience of completing a measuring instrument comprised of Likert scale items. Such instruments consist of a series of statements, each of which is followed by a range of responses going, for example, from “Strongly agree” to “strongly disagree.” Subjects select for each item the response that best reflects their feelings.

       The subject’s response to each item is scored, and the scores for all items are summed to get a total score that represents the subject’s attitude. An example of an item on a Likert scale measuring attitude toward math is “Math is my favorite class.” The statement is positively worded, so agreement with the statement indicates a positive attitude toward math. A response of “Strongly Agree” would receive a score of +2, a response of “Agree” +1, “undecided” a 0, “Disagree” a -1, and “Strongly Disagree” a -2.

       Another example of an item on the same attitudinal measure is “Learning math is a waste of time.” This item is worded in such a way that agreement with the statement indicates a negative attitude toward math. Consequently, the scoring system for responses to the item is reversed. “Strongly Agree” receives a score of -2 rather than +2, “Agree” a -1, “Undecided” a 0, “Disagree” a +1, and “Strongly Disagree” a +2. Some people prefer touse ratings that range from 1 to 5 rather than  from -2 to +2 to avoid negative number. Either set of ratings is acceptable.

        The use of ratings ranging from 1 to 5 rather than from -2 to +2 does not affect the results of a study based on data collected with a set of likert scales. Suppose one were using 20 items similar to those in Figure 7.3 to determine if there is a significant difference between females’ and males’ attitudes toward mathematics. If one used ratings ranging from 1 to 5, the lowest score a student could receive would be 20 (i.e., the student received a rating of on each of the 20 items). The highest possible score would be 100 (i.e., 20 x 5).

       If one used ratings ranging from -2 to +2, the lowest score a student could receive would be -40 (i.e., the student received a rating of -2 on each of the 20 items). The highest possible score would be +40 (i.e., 20 x +2). The important point is that regardless of which set of ratings is used, the difference between the lowest and highest possible scores remains 80.



Semantic Differential Scales.

Another way of measuring attitudes is semantic differential scales. Such scales consist of two bipolar adjectives (i.e., adjectives that are antonyms), separated by a line divided into seven parts. The person completing the scale is instructed to place an X in the interval corresponding to his or her attitude toward some topic.


Figure 16.2:      Example of Semantic Differential Scales Measuring Attitudes Toward Mathematics.



Interesting ---------------------------------------------------   Boring


Useful        ----------------------------------------   Useless



Checklist and Rating Scales.

Researchers who carry out observational studies often rely on checklists or rating scales to obtain data. A checklist is merely a listing of the  kinds of behavior the researcher is interested in studying. Researchers may themselves observe and record subjects’ behavior on the checklist or have someone familiar with the subject, such as a teacher, use a checklist to record students’ behavior.


Thu, 12 May 2011 @12:47




Melawan Kemustahilan










Twitter Facebook Instagram Google Plus Youtube Channel




Copyright © 2019 bejo sutrisno · All Rights Reserved