Imputing Race & Ethnicity: Part 2

By | August 26, 2021

Part 1 of this two-part series laid out arguments for and shortcomings of imputing race/ethnicity from the perspective of health equity. In this post, we’ll talk about gaps in the evidence and a few alternatives to imputation, including approaches involving population-level and neighborhood-level data.

Imputation is a common solution to deal with “the missing-data problem.” However, few have studied the implications of imputing race/ethnicity from a health equity perspective.

We need better data. How should we collect these data? What additional research could inform the next evolution in dealing with this perennial problem?

Even more importantly, researchers and policymakers alike need to examine their own internal assumptions. Are you using race/ethnicity as a proxy for something else? Do you have data available, or could you collect data, to measure that thing? Either way, be explicit about why you are including race/ethnicity in your models.

We Need More and Better Data

Discrimination in healthcare occurs on the basis of a million things: skin color and other physical characteristics; less than perfect English or stigmatized accents; sexual orientation, gender identity, and its presentation; poverty; lack of education or literacy; a history of substance use; disability; obesity; and various other health conditions. Truly, if the goal is health equity for all, we need to be taking all kinds of potential gaps into consideration.

One person of color interviewed on the topic of racial identification questions on a health-related survey said:

What is the point of asking me [my race on this survey]? If it is [about] experiencing discrimination, why aren’t you asking if I’ve experienced discrimination because of my race? Why does it even matter what race I am? If the point is to uncover quality [issues], then that should matter regardless of race.

Following this argument, collecting more specific data on inequities, mistreatment, and gaps in high-quality care should be a priority. Similarly, every survey instrument that asks about race/ethnicity ought to allow respondents to select “Prefer not to answer” as a valid response option.

Ask the Experts

What if we asked a racially diverse panel of individuals to weigh in on what should happen if they leave the race/ethnicity question blank on a survey? Shockingly, it does not appear that anyone has published research on this to date.

We should ask them, as the experts, to weigh in on current practices and approaches. How would they feel about being: dropped from an analysis altogether? Lumped into a missing/unknown or “Other” category? What about probabilities based on their last name and where they live?

We should also ask about the race/ethnicity questions themselves. How would they feel about open-ended race and ethnicity questions? What about asking people about their ancestry or family origins instead of their race/ethnicity? We need qualitative data to get more perspectives on this issue.

Consulting the Census

The Census has conducted extensive focus groups on race/ethnicity survey collection in recent years, but it’s not clear from the report [PDF] whether the participants shed light on these specific issues. Per a comment on Part 1, the Census used imputation to address missing race/ethnicity data in the most recent decennial count. According to their explainer: “…if race is reported for a parent, we could use that information to fill in their child’s missing race. If no information is available within the household, we would impute the information using data from similar nearby households.”

Data Linkages and Improved Enrollment

In addition to more qualitative data on this issue, we also need more quantitative data. Self-reported racial/ethnic identification collected via surveys, clinical assessments (such as in nursing homes), registries, and electronic health records (EHRs) are linkable to administrative data. Ideally, we would have a national healthcare blockchain system (or at least interoperability) to protect privacy and confidentiality.

Failing that, the Federal government has some power to facilitate better data linkages and collection. For example, the Medicare enrollment form could collect more information besides the basics. The Centers for Medicare and Medicaid Services (CMS) could link survey, assessment, registry, and EHR data with enrollment data to improve the accuracy of older, Social-Security-supplied race data. CMS could work with other Federal and state agencies to leverage both qualitative and quantitative approaches to improve race/ethnicity measurement.

Even so, some states limit when and how race and ethnicity can be collected. Others may push back against collecting such data for ideological reasons concerning the scope and limits of the government’s role in citizens’ lives. These and other obstacles to collecting race/ethnicity will continue to stymie efforts to promote health equity.

Alternatives to Imputation

Admittedly, we currently lack good alternatives to imputation in the analysis phase. This is one reason why the best alternative is to collect better data to begin with.

When running regression models in many software packages, individuals with missing predictor data are, by default, removed from the model. This is known as “complete-case analysis” or “listwise deletion.” This approach decreases information and sample size and can introduce bias if data are not missing completely at random [pdf]. Still, many studies in the literature have used this approach.

Another approach is to narrow the analysis to only people with known race/ethnicity, dropping all others or putting them in an “Other” category. While this sidesteps the missing data issue, it leads to a new problem: lack of generalizability. Similarly, creating a separate category of people with missing data makes it hard to draw conclusions about people in that category.

More Alternatives

For research in which proportional representation is a primary concern, we could consider designs that sample portions of more-represented groups, rather than add imputed data to less-represented groups. This approach would be based on an understanding that statistical results based on groups with greater representation – such as white males – would be robust even with fewer observations.

In a design based on relative representation, researchers could – based on reliable sources of population characteristics – under-sample over-represented groups while including all (or nearly all) of the least-represented group, most of the next-least represented group, and so on. We still assume that people from underrepresented groups who have missing racial/ethnic information are roughly the same (with regard to outcomes) as people from the same group whose information is reported, which may not be the case.

This design would not avoid the need for more, and more reliable, data on the health and health outcomes of underserved populations. However, it could allow for fairer proportional representation without requiring the imputation of identification data. This would also be a decision made at the design stage of the study, rather than the analysis phase.

Place-based Measures

Another is to use geographic, population-level data instead of, or in addition to, individual-level data. Measures could capture residential segregation and isolation, dissimilarity indices, and historical redlining, reflecting the place-based idea of health. The Census produces a number of these measures, along with a good explainer.

Area-based measures are often independently associated with health outcomes, and including them improves health equity by better accounting for social factors.

Conclusions

Take-away messages: We need more and better data to prevent the need to impute in the first place. Also, we must not use race/ethnicity as a proxy for some other exposure or confounder.

Have you seen a better approach to handling missing data on race/ethnicity? Do you have input to share? The APHA Medical Care Section’s Health Equity Research Collective is interested in your perspective. You can leave a comment below or send an email to hltheq@outlook.com.

 

All opinions are the authors’. We gratefully acknowledge the following RTI staff for their helpful feedback on this series: Jane Allen, Dan Barch, Anupa Bir, Susan Haber, and Pam Spain.

 

Lisa M. Lines

Lisa M. Lines

Senior health services researcher at RTI International
Lisa M. Lines, PhD, MPH is a senior health services researcher at RTI International, an independent, non-profit research institute. She is also an Assistant Professor in Population and Quantitative Health Sciences at the University of Massachusetts Chan Medical School. Her research focuses on social drivers of health, quality of care, care experiences, and health outcomes, particularly among people with chronic or serious illnesses. She is co-editor of TheMedicalCareBlog.com and serves on the Medical Care Editorial Board. She served as chair of the APHA Medical Care Section's Health Equity Committee from 2014 to 2023. Views expressed are the author's and do not necessarily reflect those of RTI or UMass Chan Medical School.
Lisa M. Lines
Lisa M. Lines

Latest posts by Lisa M. Lines (see all)

Jamie Humphrey

Jamie Humphrey

Jamie Humphrey is a Health Geographer in RTI International’s Center for Health Analytics, Media, and Policy. She is also a Research Associate in Drexel University’s Dornsife School of Public Health. She has more than 10 years of experience using interdisciplinary quantitative methods to conduct innovative public health research. Using foundations from health geography as well as social and spatial epidemiological, her research is focused on neighborhoods and health and the intersection of social and environmental impacts on health and well-being in urban communities. She has published on topics including, neighborhoods and child/adolescent health, measurement and identification of neighborhoods; the role of mobility in exposure to socioeconomic contexts; the intersection of indoor air quality, lung function, and socioeconomic status; climate change and indoor air quality; and the moderating impact of community-level violence on the air pollution-cardiovascular disease relationship.
Jamie Humphrey
Jamie Humphrey

Latest posts by Jamie Humphrey (see all)

Category: All Methods Tags: , , , , , , , ,

About Lisa Lines & Jamie Humphrey

Lisa M. Lines, PhD, MPH is a senior health services researcher at RTI International, an independent, non-profit research institute. She is also an Assistant Professor in Population and Quantitative Health Sciences at the University of Massachusetts Chan Medical School. Her research focuses on social drivers of health, quality of care, care experiences, and health outcomes, particularly among people with chronic or serious illnesses. She is co-editor of TheMedicalCareBlog.com and serves on the Medical Care Editorial Board. She served as chair of the APHA Medical Care Section's Health Equity Committee from 2014 to 2023. Views expressed are the author's and do not necessarily reflect those of RTI or UMass Chan Medical School.

One thought on “Imputing Race & Ethnicity: Part 2

  1. Jess Williams

    Great post! Imputation, especially regarding race and ethnicity is not something to take lightly. When I was in econ grad school I worked on a project collecting vital statistics about infant mortality before these records were digitized starting in the 1960s. Buried in the documentation of the early records (early 1900s) was a footnote that if race was missing from the birth certificate, the state recorder would assign the race of the certificate that came before it in the pile.

Comments are closed.