In Part 1 of this two-part series (originally published Aug. 19, 2021), we lay out arguments for and shortcomings of imputing race/ethnicity from the perspective of health equity. In Part 2, we’ll talk about evidence gaps and research needed, as well as a few alternative approaches.
The Biden administration is focusing on health equity and improved data collection to measure and analyze disparities and inequities. Imputation is a method of inferring or assigning values, or a vector of probabilities, to missing data. How does imputing race and/or ethnicity fit with the administration’s efforts, as well as the broader reckoning with the racial equity imperative?
We (researchers, policymakers) often want to know what every person’s race/ethnicity is because it helps us understand and track quality, costs, access to care, and outcomes for different groups. For instance, many have asked for COVID-19-related data to be released by race/ethnicity, so we can measure the disproportionate impact in Black, Indigenous, and other People of Color (BIPOC) communities. Moreover, the Federal government requires that various entities collect and report race and ethnicity data. Recent efforts have focused on reporting quality measures and other outcomes stratified by race/ethnicity. However, people sometimes leave race/ethnicity questions blank on a survey, for example – and this is more likely if they don’t feel they fit into any of the answer categories.
Beyond Black and White
Race/ethnicity have historically been measured in a variety of ways. Since race and ethnicity are social constructs, rather than biological ones, definitions are bound to change. In earlier years, races and ethnicities aside from white and Black were measured inconsistently, if at all. In fact, Social Security categories for race were just Black, White, and “Other” until 1980.
From 1790 to 2020, every US Census has asked about race – using different categories nearly every time. Here’s one illustration: the Census categorized people from the Indian subcontinent as “Hindu” from 1920–1940, as “White” in the next three censuses, and as “Asian” since 1980.
In another example, some people of Middle Eastern and North African (MENA) descent have lobbied to be included in the “White” category on the Census. Others from MENA communities have lobbied to have their own category, arguing that being lumped into the white category erases their community.
A small study in two diverse clinics nearly 20 years ago found that “many patients became angry when asked about race/ethnicity, and some did not understand the question… many respondents identified with a national origin instead of a race or ethnicity.” Is it any wonder that the government’s standard categories, as seen on many surveys, are contentious? Relying on self-reporting thus means dealing with under-reporting and missing data. Increasing self-reporting takes rebuilding broken trust, which is not a quick or easy thing.
Imputing Missing Race/Ethnicity Data is a Long-Established and Common Practice
Attempts to address missing data began as early as the 1950s. Older approaches involved assigning the mean or mode value to missing data. In more complex analyses, researchers used other variables that weren’t missing to predict values in a single regression imputation.
Some Federal data resources use hot-deck imputation. This approach involves imputing data by randomly selecting the value from a similar record. The Medical Expenditure Panel Survey (MEPS), for example, imputes missing data on income and employment in this manner, but not on disability or race/ethnicity. For race/ethnicity, MEPS creates edited/imputed versions of the race/ethnicity indicators, filling in from other data sources (where available) and the race/ethnicity of family members.
Methods Have Evolved Over the Last Few Decades
In recent years, multiple imputation with chained equations (MICE) has overcome the limitations of single regression approaches. MICE uses information from multiple regression models and random, bootstrapped samples. Bayesian and random forest-based regression approaches have also shown promise in terms of reducing misclassification bias.
A complete description of approaches to imputation is beyond the scope of this blog post. However, it’s worth noting that modern approaches often generate estimated probabilities for statistical modeling, rather than assigning people to specific categories. This avoids the potentially problematic issue of directly assigning people to the wrong categories. However, when using these probabilities in models, their coefficients cannot be interpreted the same way as the coefficients estimated with categorical race/ethnicity data.
An argument can be made that, done correctly, imputation is imperfect but better than nothing. It reduces bias and variance and improves the quality of the data. Multiple imputation methods also account for uncertainty in the imputed data. Indirect estimation is certainly less “burdensome” – from the government’s perspective – than gathering this information directly.
Shortcomings of Imputing Race/Ethnicity from a Health Equity Perspective
From a health equity perspective, however, it is worth digging deeper. Can a statistical model actually be constructed to predict race/ethnicity that satisfies different kinds of validity – including face validity, construct validity, replicability, and predictive validity?
One major issue with imputation is that it implies that these missing data are non-systematically missing and/or that they should belong to the same patterns as the nonmissing data. However, research shows that people who do not volunteer identification data tend to come from underrepresented groups.
Statistically speaking, imputing race/ethnicity creates bias in terms of misidentification, which is particularly problematic in this context. If we assess the impact of the healthcare system on health outcomes through stratification by race/ethnicity, using an algorithm that introduces bias in a metric so highly related to our outcome(s) of interest seems ill-advised unless it corrects more bias than it introduces.
Ethics and Identities
Ethically, we should be concerned about filling in information that has been withheld deliberately. For example, someone who agreed to provide sensitive financial or health data may not have done so if plans to impute race or ethnicity to their data were disclosed. Choosing not to answer is a valid response category. Imputation should only be done for truly missing responses.
So much of an individual’s experience of the healthcare system can be shaped by their race/ethnicity because of systemic bias and structural racism. Racial/ethnic identities are also associated with the effects of segregation, a relative lack of generational wealth, and many other things. Race and ethnicity are very different from other kinds of characteristics that could be imputed, like cholesterol levels.
Race/ethnicity are essential parts of our identities, our cultures, and our experiences. Understandably given historical precedents, minority racial and ethnic identities also may be correlated with mistrust of the medical profession and mistrust of government. Racism and stigma are – independent of economics – factors in the care that people receive. For example, care providers often underestimate and undertreat the physical pain felt by Black people. Those with chronic illnesses face additional stigma that worsens their quality of life.
Given this, some argue that self-report is the only standard for personal identification – not a benchmark for validation, nor merely the best of many ways to determine a person’s identity. Algorithms that impute racial/ethnic data could exacerbate racial/ethnic biases in clinical decision-making and public policy-making. If imputed race and ethnicity variables do not accurately predict actual race and ethnicity, the conclusions policymakers draw from the imputed data could lead to misinformed policy choices that harm BIPOC populations.
Under-representation
The imputation methodologies currently in use, by their very nature, perpetuate underrepresentation. Less-represented identifications are going to be less likely to be assigned (by definition) and BIPOC representation will continue to suffer. This seems backward: shouldn’t the point be to understand the experiences of those least likely to be identified? Echoing the language of the disability-rights movement – “nothing about us without us” – how can we help inform good policy without good data on those who are known to experience worse care and outcomes?
For example, electronic health records are the source of race/ethnicity in some cancer registries. One recent study found that American Indians were frequently miscategorized in those registries as white. Similarly, death certificates often misidentify individuals with multiple racial/ethnic identities and Indigenous people. Some healthcare facilities are better than others in collecting race/ethnicity accurately. In these cases, how are we to analyze the care provided to under-represented groups if our data misidentifies them?
In another example, researchers used people’s surnames and where they live to assign probabilities for different races/ethnicities. The method of using surnames has obvious shortcomings. People change their surnames at marriage, people can be adopted by parents of a different racial identification, etc. The researchers note that the accuracy of their approach ranges from 88-95% for Hispanic, Black, white, and Asian/Pacific Islander people. However, American Indian/Alaska Native and multiracial people had much lower correlations between imputed and self-reported information: 12-54%.
The shortcomings of imputation approaches could be magnified with more use of algorithms. Algorithmic bias can be hard to detect and understand. We need more research on the implications of this issue.
Conclusions
Take-away messages: Don’t impute race/ethnicity crudely or thoughtlessly — think carefully about the validity of your models. Doing it “wrong” has potential repercussions. A better-than-nothing approach could be dangerous. Issues with regard to poor care or poor outcomes stemming from systemic racism can hardly be mitigated by math. The onus needs to remain on improving the collection of self-reported data on race and ethnicity, as well as other relevant factors of interest.
Imputing missing race/ethnicity information is routinely done – but just because it’s common doesn’t mean it’s right. Many practices that were once common in health and medicine have gone by the wayside. Someday, imputing race/ethnicity may be seen as another archaic practice from a less-enlightened era.
The question of how to handle missing data on race/ethnicity is not a simple matter. In Part 2, we will talk about the need for more and better data to inform health equity analyses. We will also discuss where we could be heading as a field, including approaches involving population-level and neighborhood-level data.
I agree that self-reported race/ethnicity is the gold standard and its collection needs to be improved. I also agree that race/ethnicity should not be imputed crudely or thoughtlessly. Accurate forms of imputation, when limited to inferences about groups for which there is high demonstrated accuracy can more accurately measure disparities than approaches that are limited to data with high rates of selective nonresponse or administrative error. Carefully applied imputation can complement and enrich the insights from accumulating self-reported data. Below I elaborate and clarify several points raised in this blog.
Haas et al. 2019 describes the MBISG 2.0 method, which uses a variety of administrative data elements reported to CMS and SSA, including but not limited to first and last names and residential address. The current version, MBISG 2.1, has 96-99% concordance with self-reported race ethnicity for Asian/Pacific Islander, Black, Hispanic, and White race and ethnicity (Martino et al., 2021). MBISG probabilities can be used via multiple imputation, and under some circumstances via linear regression. Using probabilities avoids the bias in disparity estimates that results when any method with less than 100% concordance uses classifications rather than probabilities (McCafferty & Elliott, 2008; Grundmeir et al. 2015).
MBISG guidance is that imputed race/ethnicity should be used only to make inferences about groups to measure and improve equity, and that only direct self-report should be used to make attributions to individuals. The use of probabilities recognizes and respects the difference between self-report and imputation. Notably, in many applications of methods such as MBISG, race/ethnicity information has not been “withheld deliberately,” but rather a respondent was not allowed to self-report their race/ethnicity using categories other than “Black,” “White,” and “Other.” In such cases, imputation seeks to align responses with what people select themselves when allowed a full set of responses.
Grundmeier RW, Song L, Ramos MJ, Fiks AG, Pace WD, Fremont A, Elliott MN, Wasserman RC, Localio AR. Imputing missing race/ethnicity in pediatric electronic health records: reducing bias with use of US Census location and surname data. Health Serv Res. 2015; 50(4): 946-960.
Martino, SC, Elliott, MN, Dembosky, JW, Hambarsoomian, K, Klein, DJ, Gildner, J, and Haviland, AM. Racial, Ethnic, and Gender Disparities in Health Care in Medicare Advantage. Baltimore, MD: CMS Office of Minority Health. 2021. Available at: https://www.cms.gov/About-CMS/Agency-Information/OMH/research-and-data/statistics-and-data/stratified-reporting.
McCaffrey D, Elliott MN. Power of tests for a dichotomous independent variable measured with error. Health Serv Res. 2008;43(3): 1085-1101.
Steven Martino
Properly used imputation is a tool to identify and address both health inequities and algorithmic bias. Much like recommended uses of MBISG to accurately measure quality of care by race and ethnicity, the Census Bureau uses imputation similarly at the group level to ensure that there isn’t systematic underrepresentation of groups that include Black and Hispanic people – see E Bazelon and M Wines 8/15/21 New York Times [https://www.nytimes.com/2021/08/12/sunday-review/census-redistricting-trump-immigrants.html]. The Census Bureau seeks and uses direct person-level data where possible. But they also know that to stop there would lead to systematic undercounting and underrepresentation of underserved groups and the misallocation of federal resources in a way that would disadvantage many groups , including Black and Hispanic Americans. One of the main reasons that we know how America’s population has changed in the last 10 years is that Census uses indirect methods, such as imputation, alongside direct self-report to ensure accurate and equitable overall measurement.
John Adams, Kaiser Permanente Center for Effectiveness & Safety Research
In my group’s work we use the BISG method in almost every multivariate analysis we do. We do this even though we have very good race and ethnicity self-report (typically more than 85% complete.) We use self-report where we have it and fill in the BISG probabilities where we don’t have it. It is the best of the available practical alternatives. The two other possibilities are complete case analysis and using an unknown category. The perils of complete case analysis are well catalogued in the literature. An unknown category is functionally equivalent to using an ordinary missing indicator, a method with known flaws.* Of course, sensitivity analysis and consideration of fitness for the decision making purpose is essential.
*A Critical Look at Methods for Handling Missing Covariates in Epidemiologic Regression Analyses, Sander Greenland, William D. Finkle, American Journal of Epidemiology, Volume 142, Issue 12, 15 December 1995, Pages 1255–1264, https://doi.org/10.1093/oxfordjournals.aje.a117592