Criterion Validity: Definition & Examples

Criterion validity is a way of confirming that a test, scale, or survey is actually effective by comparing its results against an established, trusted standard, often called the “criterion.”

It includes concurrent validity (correlation with existing measures) and predictive validity (predicting future outcomes).

Key Takeaways

  • Definition: Criterion validity confirms how effectively a new measure aligns with or predicts an existing, trusted external measure (the criterion).
  • Criterion: Selecting a suitable and reliable criterion is often the most challenging step because if the criterion itself is flawed, the entire validity assessment of the new test will be compromised.
  • Types: Researchers primarily use two forms: predictive validity, which assesses how well a test forecasts a future outcome, and concurrent validity, which assesses correlation with an outcome measured at the same time.
  • Relevance: This form of validity is especially crucial in applied fields like education, clinical psychology, and human resources, where tests are used to make high-stakes decisions about people, such as university admission or job suitability.
  • Measured: The strength of criterion validity is statistically reported as a correlation coefficient, where a higher absolute value indicates a stronger, more desirable relationship between the test scores and the criterion scores.

Types of Criterion Validity

An appropriate criterion must be a reliable and valid measure of the real-world outcome or construct the test is designed to predict or relate to.

It often serves as a “gold standard” benchmark.

Criterion validity is often divided into subtypes based on the timing of the criterion measurement:

  • Concurrent validity: This examines the relationship between a test score and a criterion measured at the same time.
  • Predictive validity: This assesses the relationship between a test score and a criterion measured in the future.

Concurrent (same time)

Concurrent criterion validity is established by demonstrating that a measure correlates with an external criterion assessed simultaneously.

This can be shown when scores on a new test correlate highly with scores on an established test measuring similar constructs (Barrett et al., 1981).

This approach is valuable for:

  1. Measuring similar but not perfectly overlapping constructs, where the new measure should predict variance beyond existing measures.
  2. Evaluating practical outcomes rather than theoretical constructs (Barrett et al., 1981).

Examples:

  • Comparing a new, shorter depression inventory score with a clinician’s formal diagnosis or a score on an established, validated long-form inventory at the same point in time.
  • Comparing scores on a new measure of job-related knowledge with current manager performance ratings for existing employees.

While correlational analyses are most common, researchers may also use regression.

Validation methods include comparing responses between new and established measures given to the same group, or comparing responses to expert judgments (Fink, 2010).

Note that concurrent validity does not guarantee predictive validity.

Predictive (looking ahead)

Predictive validity demonstrates that a test score can predict future performance on another criterion (Cohen & Swerdik, 2005).

Here, the criterion is measured in the future, after the test has been taken. The test is used to predict what will happen later.

Good predictive validity is important when choosing measures for employment or educational purposes, as it increases the likelihood of selecting individuals who will perform well.

Examples:

  • Entrance exam scores predicting future grades
  • Aptitude tests predicting job performance
  • School tests predicting dropout rates
  • Early assessments predicting later academic success

Predictive criterion validity is established by demonstrating that a measure correlates with an external criterion measured at a later point in time.

The correlation between scores on standardized tests like the SAT or ACT and a student’s first-year GPA is often used as evidence for the predictive validity of these tests.

These tests aim to predict future academic performance, and a strong positive correlation between test scores and subsequent GPA would support their ability to do so.

How to measure criterion validity

Measuring criterion validity is a process of checking if your new test score matches or predicts a trusted, real-world outcome (the criterion).

The procedure involves a set of clear steps, primarily using a statistical tool called a correlation coefficient.

1. Defining the Test and the Criterion

Before any calculation, you must precisely define the two things you are comparing:

  • The Test (The Predictor): The new tool, questionnaire, or assessment you are trying to validate (e.g., a pre-employment aptitude test, a new inventory of anxiety).
  • The Criterion (The Trusted Outcome): The external, established measure of the construct you are interested in. This is the real-world result you are trying to match or predict.
Type of ValidityThe CriterionExample
Concurrent ValidityThe trusted, current measure (measured at the same time as the test).An existing, long-form personality scale; a clinician’s immediate diagnosis.
Predictive ValidityThe future outcome (measured later).Sales figures in six months; First-year college GPA; Re-offense rates.

Examples of Appropriate Criteria

Criteria can range from objective, verifiable records to subjective, expert judgments, provided they are reliable and relevant to the construct being measured.

  • Objective Institutional Outcomes: Factory production records, sales figures, college grades, or dichotomous outcomes (e.g., retention/turnover, dropping out).
  • Clinical/Diagnostic Judgments: Formal psychiatric diagnoses, recommendations for commitment, or clinical assessments rendered by professionals.
  • Alternative Measures (Gold Standards): An existing, validated, and established measure to which a new test is expected to correlate highly.

Critical Check: Avoid Contamination

You must ensure that the person measuring the criterion does not know the scores from the test. If a manager knows an employee scored high on a hiring test, their performance rating (the criterion) might be unconsciously biased, creating an artificially high correlation.


2. The Measurement Process

Step 1: Collect Paired Data

Administer your Test and measure the Criterion for the same group of people.

  • Concurrent Example: Give 100 people a new, short anxiety scale and immediately have them complete the older, established anxiety scale.
  • Predictive Example: Give 100 job applicants a hiring test. Hire them based on something else. Six months later, collect their manager’s job performance rating.

You will end up with two scores for every person.

Step 2: Choose the Right Correlation Statistic

The choice of statistic depends on the type of data (scale) you have for both your test and your criterion.

  • Pearson’s r (Most Common): Used when both the Test score and the Criterion score are continuous (like scores from 0-100, or GPA). This is the standard method.
  • Spearman’s rho: Used when one or both variables are ordinal (based on rank, like ranking employees from 1st to 10th).
  • Point-Biserial or Phi: Used when one or both variables are dichotomous (e.g., the criterion is a simple Yes/No outcome, like “Employed” or “Not Employed,” or “Pass” or “Fail”).

3. Calculation and Interpretation

The correlation calculation results in a single number between -1.0 and +1.0.

Step 3: Calculate the Correlation Coefficient (r)

(Statistical software like Excel, R, or SPSS is typically used for this calculation).

Step 4: Interpret the Result

The value of r tells you two things: Strength and Direction.

There is no single, universally “acceptable” cut-off, as the required benchmark depends on the context, the construct, and the practical utility of the test.

However, general guidelines for interpreting the magnitude of the correlation coefficient, r, are often cited in psychometrics:

Result (r)Interpretation of StrengthInterpretation of DirectionEvidence for Validity
Close to +1.0Strong Relationship: Data points cluster tightly around a straight line.Positive: High scores on the Test predict high scores on the Criterion.Strong Validity
Around 0.40Moderate Relationship: Points are somewhat scattered.Positive: High scores on the Test tend to predict slightly higher scores on the Criterion.Acceptable/Useful Validity
Close to 0.0No Linear Relationship: Points are randomly scattered.None: The Test score does not predict the Criterion score.Poor Validity
Close to -1.0Strong Relationship: Data points cluster tightly around a straight, downward sloping line.Negative: High scores on the Test predict low scores on the Criterion (e.g., high anxiety test score $\rightarrow$ low coping skills).Strong Validity

Step 5: Check for Statistical Significance

A correlation is considered evidence of validity only if it is statistically significant.

This means the correlation you found is highly unlikely to have occurred by random chance.

Software will provide a p-value.

If the p-value is small (typically less than 0.05), the correlation is considered statistically significant, strengthening your claim of criterion validity.


Examples of criterion-related validity

Intelligent tests

Researchers developing a new, shorter intelligence test might administer it concurrently alongside a well-established test, such as the Stanford-Binet.

If there is a high correlation between the scores from the two tests, it suggests the new test measures the same construct (intelligence), supporting its concurrent validity.

Risk assessment and dental treatment

Bader et al. (2005) studied the predictive validity of a subjective method for dentists to assess patients’ caries risk.

They analyzed data from practices that had used this method for several years to see if the risk categorization predicted the subsequent need for caries-related treatment.

Their findings showed that patients categorized as high-risk were four times more likely to receive treatment than those categorized as low-risk, while those categorized as moderate-risk were twice as likely.

This supports the predictive validity of this assessment method.

Minnesota Multiphasic Personality Inventory

The initial validation of the MMPI involved identifying items that differentiated between individuals with specific psychiatric diagnoses and those without, contributing to the development of scales for various psychopathologies.

This method of establishing validity, where the test is compared to an existing criterion measured at the same time, exemplifies concurrent validity.

Limitations and Evaluation of Criteria

The fundamental requirement for a sound validation procedure is that the criterion measure must itself be assessed for its quality and independence from the test being validated.

The major threats to the validity of a criterion measure can be organized into three overlapping areas:

1. Threats to Conceptual and Measurement Integrity

These issues concern how well the chosen criterion captures the theoretical construct it is supposed to embody, a concept often framed in terms of criterion bias.

Criterion Contamination (Construct-Irrelevant Variance)

Criterion contamination happens when the outcome (criterion) score is affected by things that aren’t actually part of what it’s supposed to measure.

Importantly, these extra factors are often linked to the predictor test scores, which can make the relationship between the test and the outcome seem stronger or weaker than it really is.

  • Bias from Predictor Knowledge: The primary form of contamination arises when knowledge of the test scores biases the measurement of the criterion. To ensure validity, the criterion measure must be determined independently of the test score.
    • For instance, if supervisors rate job performance (the criterion) knowing the employees’ aptitude scores (the predictor), their ratings may be biased, leading to a spurious correlation.
  • Irrelevant Elements: Contamination results from including irrelevant elements in the criterion measure. In terms of statistical weights, contamination means assigning positive weights to components that should theoretically receive zero weight.

Criterion Deficiency (Construct Underrepresentation)

Criterion deficiency occurs when the criterion measure fails to capture important aspects of the theoretical construct it is intended to measure.

Consequently, the criterion is an incomplete or “deficient” representation of the desired outcome.

  • Imperfect Sampling: Criteria often function merely as an imperfect sample of the full range of the disposition or attribute the test is designed to measure.
  • Omitted Elements: Deficiency is formally characterized by giving zero weight to elements or behaviors that should theoretically receive substantial weight.
  • Reliance on Single Measures: A major source of this problem is relying on only one type of criterion measure to represent a complex domain, leading to the omission of crucial facets.

2. Threats to Quality and Consistency

These issues relate to the internal consistency, stability, and susceptibility of the criterion measure to random variation.

Unreliability

A criterion measure must be reliable (consistent).

If the criterion measure is unreliable, the presence of random measurement error attenuates (reduces) the magnitude of the observed test-criterion correlation.

This often results in a test being falsely judged as having low predictive power, thereby underestimating the actual “true” relationship between the predictor and the criterion.

Arbitrary Distinction

In research focused on assessing a latent attribute, the theoretical distinction between the “merit of the test and criterion variables” is often artificial, as both measures are conceptually derived from a “common factor” (the underlying construct).

Treating the criterion as an unquestionable standard against which the test is validated can be flawed, as the criterion may be no more valid than the test itself.

3. Statistical and Practical Limitations

The utility of a criterion measure can be compromised by statistical artifacts inherent in the sampled population and practical difficulties in measurement.

Restriction of Range

The practical maximum predictive validity of a test is limited by the variability of scores in both the test and the criterion.

When the population used for validation is selected, the resulting restriction of range in the criterion scores reduces the variance available for prediction, thus attenuating the observed validity coefficient.

Unsuitability or Impracticality

The complexity or timing requirements of the theoretical criterion may render it impractical for actual use.

Criteria may be dismissed as inadequate if they are deemed too amorphous or are temporally too far in the future to measure effectively.

Furthermore, when criterion data relies on human judgment (e.g., supervisors’ or teachers’ ratings), researchers must invest careful attention in developing an effective rating system to mitigate human errors and biases.

FAQs

What is the difference between criterion and construct validity?

Criterion validity examines the relationship between test scores and a specific external criterion the test aims to measure or predict.

This criterion is a separate, independent measure of the construct of interest.

This approach emphasizes practical applications and focuses on demonstrating that the test scores are useful for predicting or estimating a particular outcome.

Construct validity seeks to establish whether the test actually measures the underlying psychological construct it is designed to measure.

It goes beyond simply predicting a criterion and aims to understand the test’s theoretical meaning.

How do you increase criterion validity?

There are several ways to increase criterion validity, including (Fink, 2010):

– Making sure the content of the test is representative of what will be measured in the future
– Using well-validated measures
– Ensuring good test-taking conditions
– Training raters to be consistent in their scoring

References

Aboraya, A., France, C., Young, J., Curci, K., & LePage, J. (2005). The validity of psychiatric diagnosis revisited: the clinician’s guide to improve the validity of psychiatric diagnosis. Psychiatry (Edgmont), 2(9), 48.

Bader, J. D., Perrin, N. A., Maupomé, G., Rindal, B., & Rush, W. A. (2005). Validation of a simple approach to caries risk assessment. Journal of public health dentistry65(2), 76-81.

Barrett, G. V., Phillips, J. S., & Alexander, R. A. (1981). Concurrent and predictive validity designs: A critical reanalysis. Journal of Applied Psychology, 66(1), 1.

Conte, J. M. (2005). A review and critique of emotional intelligence measures. Journal of Organizational Behavior, 26(4), 433-440.

Fink, A. Survey Research Methods. In McCulloch, G., & Crook, D. (2010). The Routledge international encyclopedia of education. Routledge.

Prince, M. Epidemiology. In Wright, P., Stern, J., & Phelan, M. (Eds.). (2012). Core Psychiatry EBook. Elsevier Health Sciences.

Schmidt, F. L. (2012). Cognitive tests used in selection can have content validity as well as criterion validity: A broader research review and implications for practice. International Journal of Selection and Assessment, 20(1), 1-13.

Swerdlik, M. E., & Cohen, R. J. (2005). Psychological testing and assessment: An introduction to tests and measurement.

Saul McLeod, PhD

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Editor-in-Chief for Simply Psychology

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.


Olivia Guy-Evans, MSc

BSc (Hons) Psychology, MSc Psychology of Education

Associate Editor for Simply Psychology

Olivia Guy-Evans is a writer and associate editor for Simply Psychology, where she contributes accessible content on psychological topics. She is also an autistic PhD student at the University of Birmingham, researching autistic camouflaging in higher education.

Charlotte Nickerson

Research Assistant at Harvard University

Undergraduate at Harvard University

Charlotte Nickerson is a graduate of Harvard University obsessed with the intersection of mental health, productivity, and design.