This post is for the measurement methodologists in the house. Although, the study results have a real impact for anyone reading macro-level studies of health care services and economics. Sure, this is a bit of inside baseball, but it involves a fundamentally important issue at the center of healthcare research and policy that relies on Medicare claims data (the billing codes used to apply for reimbursement).
First things first, myocardial infarction is a major public health issue, with more than 700,000 events each year in the US, but the issue of research using surveillance systems has implications beyond this condition. This is about the quality of science we use to drive our healthcare system decision making. And as it turns out, researchers and their audiences may want to think twice before blindly accepting analyses of billing data.
Why should anyone care about the accuracy of the way that we decide who is a case (who has a disease)? It boils down to what epidemiologists, data scientists, and their measurement-nerd kin refer to as “information bias.” Or more specifically, measurement bias. When there are systematic differences in disease detection rates, it can have severe consequences for the findings of a given research program, creating significant results when there are none and obscuring real results. Were there such a thing, one of the commandments of healthcare services research would be: thou shalt test thy measurement strategy.
As far back as 1990, the Institute of Medicine released a report outlining the strengths and limitations of using Medicare claims data. It painted a rosy picture of where research was headed. However, because of the limitations of accuracy and overall data quality, the authors of that report also described the purpose of such research as limited to hypothesis generating and called for further work to improve “the database and the training of individuals capable of using it.”
Heart disease is the leading cause of death [pdf] for both men and women in the US, and there is still a lot of work to do in understanding the origins and impacts of the disease. However, reliable surveillance systems must be shown to be (equally) effective at capturing cases and non-cases if we are going to study and learn about what factors are driving this horrible condition. Otherwise, the reported findings could (some might say should) be called into question.
A group of researchers, Colantonio and colleagues, participating in the Reasons for Geographic and Racial Differences in Stroke (REGARDS) study recently attempted to assess their project by comparing two different detection methods for rates of myocardial infarction (MI). Their primary method was direct contact with their study cohort combined with follow up medical chart reviews, and their secondary method was using the billing record codes from their healthcare encounters.
They compared the ability of the Medicare claims to predict MIs identified by their study team. For the most part, their analysis relied on studying the sensitivity and positive predictive value (PPV) of Medicare claims in identifying cases obtained through the study’s research procedures.
- Sensitivity was used to measure the proportion of cases identified in the REGARDS study that were detected by the claims data.
- PPV was used to measure the number of cases identified by the Medicare claims that were also detected by REGARDS study procedures.
The authors also compared the results of their original study findings to analyses restricted to the MI events detected by the REGARDS study, by Medicare claims data, or both combined.
The REGARDS study procedures detected far more MI cases than the Medicare claims data. The sensitivity of Medicare claims was deemed to be ‘low’ at just 49.0%! That means that Medicare data only captured about half of the cases identified by directly contacting the subjects in REGARDS. If the researchers only used the first code in the Medicare record, the sensitivity dropped even lower, to 40.1%.
The Medicare claims data also contained cases not detected by the REGARDS study procedures, although there were fewer of these. If a case was listed in the primary position of the billing codes (i.e., the primary reason for the visit), there was an 89.7% chance that it was detected by the REGARDS study (the PPV). If a diagnosis of MI was found in any of the codes included, the PPV dropped to 84.3%. Both the sensitivity and PPV findings mirrored at least two other large cohort studies to which they compared REGARDS.
Worse, the sensitivity and PPV varied between groups, including gender, types of MI being coded, and where the MI occurred. However, there was some good news. In spite of these shortcomings, the findings of an association between MI and various characteristics remained more or less the same using any of these combinations of surveillance, or at least did not differ significantly.
Other projects have modeled this kind of surveillance system evaluation and tested the ability of Medicare data to detect MI events in their samples. A study of three European countries validated a randomly selected sample and found that the types of diagnostic codes used affected the ability to accurately match with records, with “best-case scenario” PPVs ranging from 100% for ICD-10, to 96.6% for ICD-9, to between 20-60% for free text coding. In the US, the Women’s Health Initiative demonstrated that agreement between the diagnosis of MI in their study and using Medicare claims data was far from perfect, but far better than random chance. The Cardiovascular Health Study similarly tested Medicare claims against their system for direct collection of MI events. The investigators showed that claims data had a PPV of 90.6%, but the sensitivity was down around 53.8%, similar to the findings of Colantonio and colleagues.
Clearly, neither approach is perfect. Using multiple methods seems to be an improvement over using single ones. That said, the low sensitivity identified in this research and cited in other large cohort studies is a cause for concern. While much higher, a PPV around 90% still raises questions about the ability of a study to successfully capture all of the outcomes of interest. The recent finding that the PPV may vary between demographic groups is especially troubling since it raises the specter of systematic bias.
The authors relied upon just two of the many metrics available for assessing accuracy, although they are both critical tests of a successful surveillance system. Also, it’s worth considering that since claims data is used for billing, codes that may not change the reimbursement rate or “bundle” associated with a visit may not be included for that encounter.
Fundamentally, if either system can find cases that were not detected by the other, then what exactly is the “gold standard” supposed to be? According to these findings and from the CHS and WHI, research looking for differences in MI rates and associations with risk factors can be safely performed with either strategy. This supports a common claim from researchers that limitations associated with the method used to identify and collect medical morbidity data shouldn’t affect the results of their research. Still, it never hurts to check!