Assessing the Validity of a Psychological Assessment

March 22, 2017 | By Paul Barrett

If you would like to be notified when new content is posted to our website, subscribe to our email alerts.

Subscribe Now

Within test-publisher, psychological assessment training, or assessment/test review ‘guidelines’, the terms: construct, content, face, predictive, concurrent, and ecological validity are presented as the definitive ‘types of validity’.

However, these ‘types’ were introduced in the mid-20th Century and should have been quietly retired in the late 1990s-early 2000s with the near-simultaneous but quite independent stream of publications from three philosophers of science and measurement who revolutionised how we should think about validity and the use of that word ‘quantitative’ when applied to psychological attribute assessment.

So who are these “Big Three” of Psychological Measurement & Validity?

First came Mike Maraun (1997), who demonstrated so clearly the huge logical mistake made by Cronbach and Meehl when they initially proposed their definition of construct validity.

Then came Joel Michell (1998) who laid out the facts and axioms which constitute the definition of a quantity and quantitative measurement, as embodied in all sciences for two or more centuries.

Finally, it was Denny Borsboom (and colleagues) in 2004 who provided a simple definition of validity which clarified the distinction between the ‘scientific’ perspective and the pragmatic-practical perspective. Note how this scientific definition is focused on measurement:

“Validity is not complex, faceted, or dependent on nomological networks and social consequences of testing. It is a very basic concept and was correctly formulated, for instance, by Kelley (1927, p. 14) when he stated that a test is valid if it measures what it purports to measure

A test is valid for measuring an attribute if and only if (a) the attribute exists and (b) variations in the attribute causally produce variations in the outcomes of the measurement procedure.” (p. 1061)

Investigating Validity from a Scientific Perspective

It’s all about phenomenon detection and/or an initial theory-claim proposing an attribute’s ‘existence’ and qualities, then proposing and testing the ‘rules’ by which the attribute variations are claimed to be measurable. Those ‘rules’ embody particular axiomatic equalities/inequalities which must hold if a rule is to be adjudged valid. This investigative work requires careful experimentation and a ‘metrology’ mind-set in contrast to the iterations of descriptive correlational workups and covariance analyses by psychologists; as these are already predicated upon an assumption of quantity.

And therein lies the problem for all psychological attribute measurement which attempts to claim it is making ‘quantitative’ measurement of an attribute. There is no evidence, to date, that any psychological attribute varies as a quantity. For physical quantity examples, think length, mass, electrical current;

Investigating Validity from a Pragmatic Perspective

What this actually boils down to is ‘validation’; as Borsboom and colleagues (2004) put it:

“Validity is a property {of tests}, whereas validation is an activity. In particular, validation is the kind of activity researchers undertake to find out whether a test has the property of validity. Validity is a concept like truth: It represents an ideal or desirable situation. Validation is more like theory testing: the muddling around in the data to find out which way to go.” p. 1063.

In applied practice, the answers to certain obvious, relevant-to-deployment questions are what test publishers/authors need to convey to those wishing to use their assessments. Why? Because these answers form the ‘evidence-base’ which will be used by others to form a validation judgement i.e. an informed judgement as to whether the assessment will be fit-for-purpose, a priori justifiable, and ultimately legally defensible.

So, what are some of these questions?

1.Do those who use this test find it of benefit; if so, what is/are these benefits?

Such information has to be acquired via a few standard but open-ended survey questions asked of assessment users (whether phone-call, personal meeting, or on-line survey). That qualitative information can be formally categorized and summarised, then written up as a simple one-page infographic. If the previous deployments of an assessment are adjudged favourably by users for the various reasons they provide, that’s ‘good enough’ preliminary evidence of pragmatic validity. Why? Because if the assessment was producing random results which made no coherent or consistent sense, no user would give it a positive rating.

2. Does it assess what it is claimed it assesses?

This is all about presenting information which justifies a claim that an assessment assesses magnitudes of a particular attribute, or class-category types. In practice this is more about developing a line of plausible reasoning based upon some empirical and logical analytical workups rather than referring to some abstract notion of ‘concurrent validity’. For example, we already know that in personality research, there is only moderate agreement between assessments claiming to assess the same-named ‘constructs’ (Pace and Brannick, 2010). And, as we know from Mike Maraun’s expositions, given we have no ‘technical’ definition of any attribute and no evidence that any psychological attribute varies as a quantity, we are simply looking for ‘good enough’ justifications here. This is not physics or chemistry, no matter how many wish it to be so.

3. If I give the same assessment tomorrow or next week to the same candidate, will they attain more-or-less the same results?

I know, I can hear you say “but surely this about reliability!” And so it is. But when forming a judgement about whether an assessment is appropriately justified/ validated for your particular deployment, you need to know the answer to this question. Obviously, if what you propose to assess is something you expect to change dramatically on a day-to-day basis, this question is irrelevant. But for the vast majority of assessment applications used in the workplace, we are looking at attributes which comprise a stable feature of individuals over short periods of time.

4. If an assessment is being used on the basis that its ‘scores’ or ‘indicators’ predict certain outcomes, do they actually do so? i.e. What is its predictive accuracy?

The usual Pearson-correlations-as-validity-coefficients used by many to answer this question are assumption-laden (yes, that quantity assumption again!) estimates of monotonic agreement in which all magnitude information has been carefully removed by the computations forming the estimate. In short, mildly-amusing but not what a user really wants to know here. Indexing predictive accuracy requires analyses conducted in the metric of the observations themselves, looking at observed vs predicted magnitude discrepancies, counting success/failures, producing mis-classification tables and rates, and above all, using V-fold or holdout-sample cross-validation of any model which claims to be ‘predictive’ of an outcome.

Practitioners do not set out to evaluate the property of validity of a test or assessment by looking for evidence justifying: “variations in the attribute causally produce variations in the outcomes of the measurement procedure” (the scientific question). Rather, they want to make an informed personal judgement based upon the ‘validation’ information associated with an assessment that is directly relevant to its deployment in the workplace. To that end they need clear answers to some basic questions about if and how an assessment has been validated for productive use in the workplace. Put more simply, does it do what it says on the box?


Borsboom, D., Mellenbergh, G.J., & Van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 4, 1061-1071.

Borsboom, D., Cramer, A.O.J., Kievit, R.A., Scholten, A.Z., & Franic, S. (2009). The end of construct validity. In Lissitz, R.W. (Eds.). The Concept of Validity: Revisions, New Directions, and Applications (Chapter 7, pp. 135-170). Charlotte: Information Age Publishing.

Maraun, M.D. (1998). Measurement as a Normative Practice: Implications of Wittgenstein’s Philosophy for Measurement in Psychology. Theory & Psychology, 8, 4, 435-461.

Michell, J. (1997). Quantitative science and the definition of measurement in Psychology. British Journal of Psychology, 88, 3, 355-383.

Michell, J. (2009). Invalidity in Validity. In Lissitz, R.W. (Eds.). The Concept of Validity: Revisions, New Directions, and Applications (Chapter 6, pp. 111-133). Charlotte: Information Age Publishing.

Pace, V.L., & Brannick, M.T. (2010). How similar are personality scales of the “same” construct? A meta-analytic investigation. Personality and Individual Differences, 49, 7, 669-676.