What does the "medical screening base-rate" cognitive-reasoning question test?

Note on framing: This is the cog_sample_2 item-level explainer for the AIEH cognitive-reasoning sample-test family. The item draws on the canonical Kahneman & Tversky base-rate-neglect literature and is structurally similar to the well-known mammography problem documented in Eddy 1982 and Casscells, Schoenberger & Grayboys 1978.

This item presents a medical-screening scenario: a disease has a 1-in-1000 prevalence in the screened population, the screening test has 99% sensitivity and 95% specificity, and a randomly-selected person tests positive. The candidate is asked the probability that the person actually has the disease. The intuitive answer is something like 95% or 99%, matching the test’s sensitivity or specificity. The correct answer — derivable from Bayes’s theorem — is closer to 2%. The scenario probes base-rate neglect, the systematic tendency to under-weight prior probabilities when updating beliefs in light of new evidence. It is one of the most- replicated findings in cognitive psychology and one of the strongest predictors of on-the-job performance in evidence-evaluation roles.

What this question tests

The item targets Bayesian-reasoning competence with a deliberately counterintuitive base-rate setup. The construct measures three related capacities. First, the respondent must recognize that the disease’s base rate matters at all — many respondents anchor entirely on the test’s sensitivity and specificity and never reach for the prior. Second, the respondent must execute the Bayesian update correctly, combining prior probability with likelihood ratio. Third, the respondent must trust the counterintuitive result rather than reverting to the intuitive 95%-or-99%-feeling answer.

The skill matters for hiring because base-rate-neglect errors compound across most evidence-evaluation roles. Medical screening, fraud detection, security alerting, hiring assessment itself, and AI-output evaluation all involve interpreting positive signals against low-base-rate populations, and all are systematically miscalibrated by evaluators who under-weight base rates. Casscells et al.’s 1978 finding that even physicians systematically misjudge the equivalent of this scenario is one of the most-cited results in clinical-reasoning research, and the pattern replicates across professional populations including lawyers, financial analysts, and software engineers.

The base-rate-neglect construct sits at the intersection of Carroll’s fluid-reasoning factor (Gf) and Stanovich’s rational-thinking dispositions. Schmidt & Hunter’s 1998 meta-analysis identifies general mental ability as the single strongest predictor of job performance, but the applied component most-correlated with on-the-job evidence-evaluation work is specifically the Bayesian-reasoning competence that base-rate-neglect items target.

Why this is the right answer (concrete worked example)

The correct probability is approximately 1.94% (often rounded to 2%). The Bayesian computation proceeds in three steps using a 100,000-person reference population.

Step 1: Identify the prior. With a 1-in-1000 prevalence, out of 100,000 people, 100 have the disease and 99,900 do not.

Step 2: Apply the test characteristics. With 99% sensitivity, 99 of the 100 sick people test positive (and 1 false negative). With 95% specificity, 5% of the 99,900 healthy people test positive — that is 4,995 false positives. The total positive-test count is 99 + 4,995 = 5,094.

Step 3: Compute the posterior. Of the 5,094 people who test positive, only 99 actually have the disease. The probability of disease given positive test is 99 / 5,094 = 0.0194, or 1.94%.

The counterintuitive nature of the result comes from the asymmetry between the rare-event base rate (0.1%) and the test’s false-positive rate against the common-event healthy population (5%). Even though the test is “95% accurate,” the absolute number of false positives from the much-larger healthy population vastly exceeds the absolute number of true positives from the small sick population. A respondent who reasons through the 100,000-person reference frame recovers the right answer; a respondent who anchors on the test’s sensitivity or specificity without computing the absolute counts misses by a factor of 50.

The worked example illustrates a generalizable lesson: when the base rate is rare and the test’s false-positive rate is non-trivial, even a high-sensitivity test produces many more false positives than true positives. The implication for downstream decision-making is that a positive test result is informative but not definitive, and the appropriate response is confirmatory testing rather than treatment. The same logic applies to fraud alerts, security incident triage, hiring decisions on rare-skill candidates, and AI-output anomaly flagging.

What the wrong answers reveal

The distractors map to the canonical base-rate-neglect failure modes documented in Kahneman & Tversky 1973:

  • “99% — that’s the test’s sensitivity.” This response represents the most common base-rate-neglect failure: the respondent anchors on the sensitivity figure as if it were the post-test probability of disease. Sensitivity is the probability of a positive test given disease, not the probability of disease given a positive test. The conditional has been inverted.
  • “95% — that’s the test’s specificity.” This response signals an even more confused mental model: specificity is the probability of a negative test given no disease, which is even less directly related to the post-test probability of disease than sensitivity is. The respondent has pattern-matched to “high accuracy number” without distinguishing which conditional probability the question asks about.
  • “50% — chance of disease or not.” This response represents a flat-prior failure: the respondent treats the prior as 50/50 in the absence of explicit guidance, ignoring the stated 1-in-1000 base rate entirely. It is a sub-failure of base-rate neglect: not just under-weighting the base rate, but discarding it entirely.
  • “Cannot be determined from the information given.” This response signals overconfidence in the question’s difficulty rather than execution failure: the respondent recognizes Bayesian inference is required but does not trust their ability to compute it. Strong respondents trust the 100,000-reference-frame computation and produce the correct answer.

How the sample test scores you

In the AIEH 5-item cognitive-reasoning sample test, this item contributes one of five datapoints aggregated into the single cognitive_reasoning score via the W3.2 normalize-by-count threshold. Graded scoring with full credit for the 2% answer (or any value within the 1-3% range, accommodating reasonable rounding), partial credit for recognizing that the answer is closer to 2% than to 95%, and zero credit for the canonical sensitivity/specificity distractors.

Data Notice: Sample-test results are directional indicators only. Base-rate-neglect performance has moderate test-retest reliability and high item-specific variance because trained respondents who learn the 100,000-reference-frame technique solve all isomorphic problems while untrained respondents may solve some framings and fail others. For a verified Skills Passport credential, take the full 50-item assessment.

See the scoring methodology for how cognitive-reasoning scores map onto the AIEH 300–850 Skills Passport scale.

  • Bayes’s theorem. The mathematical statement of how prior probability and likelihood combine to produce posterior probability: P(D|+) = P(+|D)·P(D) / [P(+|D)·P(D) + P(+|¬D)·P(¬D)]. The base-rate-neglect failure mode is what happens when respondents skip the prior term.
  • Natural-frequency framing. Gigerenzer & Hoffrage 1995 documented that the same Bayesian problem is substantially easier when stated in natural-frequency format (“100 of 100,000 people have the disease”) than in conditional-probability format (“the prevalence is 0.1%”). Strong respondents reformulate to natural frequencies as a problem-solving heuristic.
  • Positive predictive value (PPV). The clinical- reasoning name for the post-test probability of disease given a positive test. PPV is what the base-rate-neglect item asks about; sensitivity and specificity are different conditional probabilities that PPV depends on but is not identical to.
  • Schmidt & Hunter 1998 cognitive-ability validity. The meta-analysis establishing that cognitive-ability measures predict job performance with corrected validity ~0.65 for complex professional roles. Bayesian-reasoning items are a high-fidelity applied-skill probe of the underlying construct.

For context on cognitive-ability validity in hiring, see the cognitive ability in hiring overview, the skills-based hiring evidence page, and the hiring bias mitigation page for how base-rate-neglect interacts with assessment fairness.


Sources

  • Carroll, J. B. (1993). Human Cognitive Abilities: A Survey of Factor-Analytic Studies. Cambridge University Press.
  • Casscells, W., Schoenberger, A., & Grayboys, T. (1978). Interpretation by physicians of clinical laboratory results. New England Journal of Medicine, 299(18), 999–1001.
  • Kahneman, D., & Tversky, A. (1973). On the psychology of prediction. Psychological Review, 80(4), 237–251.
  • Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology. Psychological Bulletin, 124(2), 262–274.

Try the question yourself

This explainer covers what the item measures. To see how you score on the full cognitive reasoning family, take the free 5-question sample.

Take the cognitive reasoning sample