Surveying Health Outcomes That Patients Are Most Qualified to Evaluate: Comments on Past, Present and Future Methods

By | February 3, 2025

While training to be a psychometrician I was blessed/cursed more than 50 years ago with a favorable response to a research proposal to improve tools that researchers increasingly used for surveying health outcomes but were rarely evaluated psychometrically. Knowing little about health, I found very useful the WHO’s 1948 definition of it as “a state of complete physical, mental, and social well-being and not merely the absence of disease or infirmity.” Particularly useful were three things: that health has distinct components, the components are beyond the absence of sickness, and it’s a complete state of well-being.

Skipping ahead and looking back, it is fair to say that results from the “psychometric era” of health survey research launched in the U.S. in the 1970’s, and worldwide thereafter, support at least the physical and mental components of WHO’s definition of health. For psychometric methodologists reading this, research trends over the years also reveal the consistency of content, psychometric support for various methods for satisfying scaling assumptions, and evidence of domain score responsiveness to differences in the underlying components that cause them.

This first blog in a planned series attempts to identify some of the lessons from prior work. This is not just because it is crucial to avoid perpetuating legacy shortcomings, but the lessons are highly relevant in the present context.  Hopefully, these takeaways will be helpful in understanding blogs to follow and in navigating among legacy and improved population and patient outcome survey methods.

A quick dive into tools for surveying health outcomes

The figure below classifies the contents of widely used health surveys constructed and scored using psychometric and/or utility methods. It includes the three most recent 6-domain PROMIS profiles differing in the number of items per domain and also utility measure advances. The rows list frequently surveyed health domains including physical, social and role functioning, mental health (primarily psychological distress), general health evaluation, pain, and vitality (primarily fatigue). The columns show the widely used health surveys (defined in the figure’s footnotes), listed in order of publication.

Not shown here are different operational definitions including functioning (what people are able to do), feelings (both ill- and well-being), and evaluations (excellent-poor). For example, the Sickness Impact Profile (SIP) is entirely behavioral asking about “crying” and not “feeling like crying.” One of the reasons for measurement success is the use of all three approaches, particularly crucial for getting more information from single items. Other crucial considerations (e.g., ceiling effects) only hinted in their domain labels will be the subject of future blogs. In the context of the apparent emerging conceptual consensus, it is noteworthy that NIH-sponsored PROMIS chose to represent five of the historically most frequently studied domains in three short forms.

Lessons from surveying health outcomes in two major studies

Figure. Comparison of the Generic Health Domains Represented in Legacy and Contemporary

Two of the columns show the health domains monitored in the famous 5-year Health Insurance Experiment (HIE) and the 4-year quasi-experimental Medical Outcomes Study (MOS). The need to quantify population health outcomes fueled these major studies to inform growing debates about the structure (providers, organizations, and finances) and quality of medical care processes.

From the HIE and MOS, health survey designers and methodologists learned valuable lessons about the feasibility, preferability, and reduced administration costs of self-administered surveys. We also learned that baseline health survey measures are among the most valid predictors of health care costs, job loss, and mortality.

Evolutions in health domains and short form surveys

The HIE arguably provided the most comprehensive general population outcome survey data of its time. From this study, classical factor analyses of correlations among 20 multi-item scales hypothesized to measure physical, mental, and social functioning and well-being yielded three principal components with patterns of rotated factor loadings consistent with WHO’s physical, mental and social components. However, the absence of significant correlations between social measures (e.g., work, job, boss) raised questions about whether such reported social/role differences are, in-fact, health-related. Results from other HIE analyses of social contact frequencies (e.g., visits to and by friends and family, phone contacts) also raised questions about whether they are health related.

The transition from much longer HIE survey booklets (100’s of items per booklet annually), called a medical care outcomes “gold standard” in a recent Wall Street Journal editorial, also required more efficient items. HIE surveys relied upon dichotomous physical and role disability items in use at that time despite their huge general population ceiling effects (e.g., 70%+ for a 25-item physical functioning, or PF, scale). The shorter 10-item MOS PF scale reduced the ceiling effect to 30% and a “shorter” HIE-influenced 149-item baseline survey of only 40 domains. The 100+ HIE mental and general distress, and well-being domain items were based upon 1976 HANES survey experimental items long before “item banks.” In time, the more practical MOS 4-year outcome survey, 36 short form (SF-36) items achieved psychometrically sound estimates of eight domains.

Game changing discovery and application of summary health measures

During MOS tests of construct validity, our research team discovered physical and mental summary components capturing 80% of the reliable variance across 8 domains, in support of WHO. This was a “game changer”. For the first time, the team scored two summary components previously used only in validation to simplify public reporting and interpretation of the most important medical care outcomes in health policy research and clinical trials. Also, because fewer items were necessary for estimating physical and mental health, this enabled a much shorter version of the tool. We released a new version with just 1-2 items per domain of health (i.e., the SF-12). Finally, improved social and role items making explicit attributions to the reasons for functional limitations improved discriminations between physical and mental health outcomes.

The power of attributions to specific health limitations

In the MOS, when asked whether various role limitations (e.g., accomplishing less) were caused by physical or mental (emotional) health problems or both and scored accordingly, the patterns of their score convergent and discriminant validity were completely reversed in psychometric and clinical tests. For example, multi-item role functioning scores for physical attributions correlated highly (r=0.76) with the physical component of health and lowly (r=0.24) with the mental component. The researchers observed the opposite pattern (r=0.29 and r=0.87) for physical and mental components, respectively, for the same role limitations items making attributions to emotional problems.

The series will continue

This evidence is of great importance in the context of the ongoing debate regarding how to construct and score widely used summary measures. A later blog will discuss whether we should construct physical and mental health summary measures to have correlations between scores > 0.60 or <0.30, and what the implications are for interpreting their outcomes. Also, the latest generation of MOS measurement model-based short forms that make this distinction for role and social domains will be a future blog topic.

A blog will also address another implication of the discovery of the power of a simple change in words, attributing functional limitations or feelings of ill-being to a specific disease (e.g., asthma vs. osteoarthritis). Such changes have also turned out to be a game changer for improving the usefulness of surveying health outcomes that are disease specific. Disease-specific attributions can improve validity in responding to clinically defined differences in a specific condition, even for those with co-morbid conditions.

John Ware

John Ware

Professor Emeritus, Population and Quantitative Health Sciences, UMass Medical School at John Ware Research Group
Dr. Ware is Professor Emeritus, Department of Public and Quantitative Health Sciences, UMass Chan Medical School. He studied psychology at Pepperdine University and psychometrics at Southern Illinois University. He is an internationally recognized and frequently cited developer and interpreter of patient reported outcomes (PROs) and an elected member of the National Academy of Medicine. His noteworthy activities over the past 50 years include: leading the development of PROs in the 5-year RAND Health Insurance Experiment; Principal Investigator for the 4-year quasi-experimental Medical Outcomes Study, for which he developed the SF-36 Health Survey; initiating and leading International Quality of Life Assessment Project SF-36 and SF-12 translations for clinical and population PRO research use worldwide; and founding NIH-sponsored small businesses among the first to apply “modern” psychometric methods to make generic and disease-specific PROs more efficient. His current work pursues the shortest possible (including single-item per domain/disease) comprehensive generic and disease-specific measures and their standardization and integration to make patient screening and outcomes monitoring more practical and useful.
John Ware

Latest posts by John Ware (see all)

Category: All Health policy Meta Methods Quality Tags: , , , ,

About John Ware

Dr. Ware is Professor Emeritus, Department of Public and Quantitative Health Sciences, UMass Chan Medical School. He studied psychology at Pepperdine University and psychometrics at Southern Illinois University. He is an internationally recognized and frequently cited developer and interpreter of patient reported outcomes (PROs) and an elected member of the National Academy of Medicine. His noteworthy activities over the past 50 years include: leading the development of PROs in the 5-year RAND Health Insurance Experiment; Principal Investigator for the 4-year quasi-experimental Medical Outcomes Study, for which he developed the SF-36 Health Survey; initiating and leading International Quality of Life Assessment Project SF-36 and SF-12 translations for clinical and population PRO research use worldwide; and founding NIH-sponsored small businesses among the first to apply “modern” psychometric methods to make generic and disease-specific PROs more efficient. His current work pursues the shortest possible (including single-item per domain/disease) comprehensive generic and disease-specific measures and their standardization and integration to make patient screening and outcomes monitoring more practical and useful.

Leave a Reply