by Saul McLeod published 2013
The term reliability in psychological research refers to the consistency of a research study or measuring test.
For example, if a person weighs themselves during the course of a day they would expect to see a similar reading. Scales which measured weight differently each time would be of little use.
The same analogy could be applied to a tape measure which measures inches differently each time it was used. It would not be considered reliable.
If findings from research are replicated consistently they are reliable. A correlation coefficient can be used to assess the degree of reliability. If a test is reliable it should show a high positive correlation.
Of course, it is unlikely the exact same results will be obtained each time as participants and situations vary, but a strong positive correlation between the results of the same test indicates reliability.
There are two types of reliability – internal and external reliability.
Internal reliability assesses the consistency of results across items within a test. External reliability refers to the extent to which a measure varies from one use to another.
The split-half method assesses the internal consistency of a test, such as psychometric tests and questionnaires. There, it measures the extent to which all parts of the test contribute equally to what is being measured.
This is done by comparing the results of one half of a test with the results from the other half. A test can be split in half in several ways, e.g. first half and second half, or by odd and even numbers. If the two halves of the test provide similar results this would suggest that the test has internal reliability.
The reliability of a test could be improved through using this method. For example any items on separate halves of a test which have a low correlation (e.g. r = .25) should either be removed or re-written.
The split-half method is a quick and easy way to establish reliability. However it can only be effective with large questionnaires in which all questions measure the same construct. This means it would not be appropriate for tests which measure different constructs.
For example, the Minnesota Multiphasic Personality Inventory has sub scales measuring differently behaviors such depression, schizophrenia, social introversion. Therefore the split-half method was not be an appropriate method to assess reliability for this personality test.
The test-retest method assesses the external consistency of a test. Examples of appropriate tests include questionnaires and psychometric tests. It measures the stability of a test over time.
A typical assessment would involve giving participants the same test on two separate occasions. If the same or similar results are obtained then external reliability is established. The disadvantages of the test-retest method are that it takes a long time for results to be obtained.
Beck et al. (1996) studied the responses of 26 outpatients on two separate therapy sessions one week apart, they found a correlation of .93 therefore demonstrating high test-restest reliability of the depression inventory.
This is an example of why reliability in psychological research is necessary, if it wasn’t for the reliability of such tests some individuals may not be successfully diagnosed with disorders such as depression and consequently will not be given appropriate therapy.
The timing of the test is important; if the duration is to brief then participants may recall information from the first test which could bias the results. Alternatively, if the duration is too long it is feasible that the participants could have changed in some important way which could also bias the results.
The test-retest method assesses the external consistency of a test. This refers to the degree to which different raters give consistent estimates of the same behavior. Inter-rater reliability can be used for interviews.
Note, it can also be called inter-observer reliability when referring to observational research. Here researcher when observe the same behavior independently (to avoided bias) and compare their data. If the data is similar then it is reliable.
Where observer scores do not significantly correlate then reliability can be improved by:
For example, if two researchers are observing ‘aggressive behavior’ of children at nursery they would both have their own subjective opinion regarding what aggression comprises. In this scenario it would be unlikely they would record aggressive behavior the same and the data would be unreliable.
However, if they were to operationalize the behavior category of aggression this would be more objective and make it easier to identify when a specific behavior occurs.
For example, while “aggressive behavior” is subjective and not operationalised, “pushing” is objective and operationalized. Thus researchers could simply count how many times children push each other over a certain duration of time.
Beck, A. T., Steer, R. A., & Brown, G. K. (1996). Manual for the beck depression inventory The Psychological Corporation. San Antonio, TX.
Hathaway, S. R., & McKinley, J. C. (1943). Manual for the Minnesota Multiphasic Personality Inventory. New York: Psychological Corporation.
McLeod, S. A. (2007). What is Reliability?. Retrieved from www.simplypsychology.org/reliability.html