Article Text

Download PDFPDF

Non-reporting and inconsistent reporting of race and ethnicity in articles that claim associations among genotype, outcome, and race or ethnicity
  1. H Shanawani1,
  2. L Dame2,
  3. D A Schwartz3,
  4. R Cook-Deegan2
  1. 1Department of Biostatistics and Research Epidemiology, Henry Ford Hospitals, Detroit, MI, USA
  2. 2Duke University Center for Genome Ethics, Law, and Policy, Durham, North Carolina, USA
  3. 3National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, USA
  1. Correspondence to:
 Hasan Shanawani
 Department of Biostatistics and Research Epidemiology, Henry Ford Hospitals, 1 Ford Place, 5C-69 Detroit, MI 48202, USA; hshanawani{at}


Background: The use of race as a category in medical research is the focus of an intense debate, complicated by the inconsistency of presumed independent variables, race and ethnicity, on which analysis depends. Interpretation is made difficult by inconsistent methods for determining the race or ethnicity of a participant. The failure to specify how race or ethnicity was determined is common in the published literature.

Hypothesis: Criteria by which they assign a research participant to racial or ethnic categories are not reported by published articles.

Methods: Methods were reviewed for assigning race and ethnicity of research participants in 268 published reports reporting associations among race (or ethnicity), health outcome and genotype.

Results: Of the 268 published reports reviewed, it was found that 192 (72%) did not explain their methods for assigning race or ethnicity as an independent variable. This was despite the fact that 180 (67%) of those reports reached conclusions about associations among genetics, health outcome and race or ethnicity.

Conclusions: More attention needs to be given to the definition of race and ethnicity in genetic studies, especially in those diseases where health disparities are known to exist.

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Different socially identified groups have different rates of diseases,1–4 leading to intense study, debate on the possible causes of such health disparities and controversy about the use of racial classification in medicine, biomedical research in general5–7 and genetics research in particular.8,9 Although some suggest that there may be no biological basis for race and therefore racial categories will be misleading in the search for biological differences underlying health disparities,10 others believe that race may correlate to some genetic factors associated with disease states, and that too much important information may be missed if race is not considered. Social epidemiologists and other investigators express concern that social factors in health disparity groups may be missed if race is not studied.11–14 Although many investigators pursue issues of racism and health inequality as the source of health disparities, they are also interested in genetic and other biological factors in causal networks associated with health disparities. Some investigators of genetics believe that it is crucial to continue to draw categories along racial lines15 and offer guidelines on how to assign membership of a research participant to a particular social group.

In an effort to understand the genetic and environmental causes of diseases where health disparities exist, an important goal of scientific research findings is generalisability and external validity.16,17 An important step to attaining that external validity is the ability of other investigators to replicate studies reported in the scientific literature. We believe that ambiguous definitions of race coupled with generalisations made about race are a principal impediment to successful genetics research.

Investigators studying the use of race in biomedical research reports have found inconsistent reporting practices.18 Journals have been noted to have inconsistent editorial policies and variable enforcement of those policies.19,20 The definitions of terms used to refer to race and ethnicity have also varied from article to article in biomedical and social science journals.21,22 We are, however, unaware of any attempt to characterise the use of race in genetics research articles, where the use of race is, in our opinion, the most challenging and controversial.

We hypothesised that clinicians and research participants use disparate definitions of race and do not report the criteria by which they assigned or were assigned to a cohort. We set out to investigate the reporting practices of investigators who published race, health outcome and genetic polymorphisms.


We completed a Medline search to identify articles investigating genetic phenomena and health outcomes that included keywords relating to “race” and “ethnicity”. Our Medline search included articles published from January 2003 through to December 2003 (search completed in April 2004). To improve our search sensitivity, a broad search strategy using Ovid (New York city, New York, USA) was used (fig 1). Articles were then obtained and individually reviewed. Articles were included if they identified a racial or ethnic cohort, a genetic procedure or test and an identifiable health outcome.

Figure 1

 Ovid search strategy

We reviewed each article with a standard evaluation form prepared for this investigation. We recorded the terms used by articles that reported the study of any group and used a term commonly understood to refer to a race, ethnicity, tribe, or national or geographical origin. We did not attempt to distinguish whether the group terms used would be considered to refer to a race, ethnicity or other classification, as this was not the focus of our study. We recorded what groups were studied. We then recorded the reporting of any term referring to race, ethnicity or geographical origin in the Methods section of the published paper. If there was such a term, we recorded whether the Methods section reported any method to determine the participant’s inclusion or exclusion from the group referred to. We separately recorded if the method used self-identification or self-report as part of it. If the Methods section referred to a previous publication, we obtained that reference and studied it as part of the review of the paper found in the original search. Finally, we determined if any conclusion of the paper explicitly included a “three-way association” between the health outcome studied, the genetic phenomenon measured, and the racial, ethnic or geographical cohort studied.

In articles that reported a three-way association, we reviewed the original hypothesis to determine if the goals of the study specified the three elements that were the basis of the three-way association reported.

We also collected information on the genetic variants studied, type of study conducted, how participants or samples were obtained for the study, the funding source, geographical location of the first author of the paper, whether the journal was a genetics journal and whether the groups studied in the paper were explicitly referred to as “minority”.

We classified the journals found in our search as “genetics” journals or “non-genetics” journals. This determination was based on querying the title of the journal and whether it included a term “gene”, “genetics”, “heredity” or “mutation”.

We examined whether the Methods section explained how membership of a participant to a particular race or ethnicity group was determined as a yes/no variable. If the answer was yes, we additionally examined “how” as a categorical variable. We examined where any three-way association was reported as both a bivariate and a categorical variable (“mention in title”, “abstract”, “other”, “not at all”). The data were stratified by whether the study was published in a genetics or non-genetics journal.

To determine whether our Medline search strategy missed articles that met our inclusion criteria, we hand searched five journals by using methods described previously.23–26 We selected The American Journal of Human Genetics; The American Journal of Medical Genetics; Cancer Epidemiology, Biomarkers, and Prevention; Neurology; and The American Journal of Epidemiology. We chose two genetics journals, two medical specialty journals and one epidemiology journal. These particular journals were selected because they represented the largest number of citations in our original search strategy.

The five journals selected accounted for 42 of the 268 articles from our original search. Our hand search of all titles and abstracts of articles published in those journals yielded an additional 25 articles, which were not found in our original Ovid search but would have been included in our study based on the title or abstract. To determine why these articles had not been found in our Medline search, we searched those articles in Medline and found that they had no Mesh heading related to race or ethnicity, but did include a Mesh term “Geographic Location”, which was not included in our original search strategy. This term is defined (PubMed) as follows:

 All of the continents and every country situated within, the UNITED STATES and each of the constituent states arranged by region, CANADA and each of its provinces, AUSTRALIA and each of its states, the major bodies of water and major islands on both hemispheres, and selected major cities. Although the geographic locations are not printed in INDEX MEDICUS as main headings, in indexing they are significant in epidemiologic studies and historical articles and for locating administrative units in education and the delivery of health care.

We used our Ovid strategy once again, including the exploded term “geographic locations” in our search. This doubled our citations (741 to 1571; search completed in September 2004). These new citations were not included in further analysis, as we believed that prevailing hypotheses of population genetics were not based on diplomatic or political borders between or within nations, which we understood to be the essence of this particular MESH heading. Conversely, we noted whether the articles in our original strategy based their cohort on a municipality, citizenship in a country or political boundary other than tribal membership.

To gain qualitative insight and to supplement the literature review, HS conducted unstructured interviews on a convenience sample of nine authors in the US, who were selected from reviewed papers that did not specify their method of determining race or ethnicity.

This work was completed as part of a fellowship funded by an independent Duke Endowment as well as an institutional T32 grant.


We excluded 300 of the 568 articles identified by our initial Medline search (table 1). We excluded articles that did not report the study of a genetic phenomenon, or were review articles, meta-analyses, letters to the editor, editorials and news reporting in scientific journals, non-human studies, case reports, case series and articles in languages other than English. Ten citations were duplicates and were included only once. Our final review was of the remaining 268 articles.

Table 1

 Excluded articles (n = 300)

We sought to determine if the study authors included an explanation on inclusion or exclusion criteria of their research participants, which could be tested for external validity—that is, were we to attempt to duplicate their study, could we establish their population frame based on their scientific report. Only 76 (28%) of the 268 articles in our analysis reported a method by which the research participant’s race, ethnicity or membership to another studied cohort was assigned (tables 2, 3). Of the remaining 192 articles, 113 referred to the population term that was the topic of their article in their Methods section, but did not report any method by which participants could be included or excluded from that population they were reporting to have studied; 61 articles had no term in their Methods section, but reported on race or ethnicity in the Results section; 12 articles referred to another paper, but the articles they referenced were reviewed and were also found to lack a described method for assigning race or ethnicity.

Table 2

 Articles reporting a three-way association: location of three-way association and relationship to reported hypothesis of study

Table 3

 Method cited in articles reporting a method of identifying race or ethnicity (n = 76)

Of the 268 articles, 180 (67%) articles reported a three-way association between (1) race or ethnicity or geographical origin, (2) a genetic variant and (3) a health outcome (table 4). Of the 180 articles, 126 articles that were associated with genotype, outcome and race or ethnicity did not describe how race or ethnicity was determined. Of the 180 articles, 137 (76%) reported a three-way association in the title or abstract of the paper (table 2), suggesting that it was an important finding.

Table 4

 Distribution of articles reporting on race or ethnicity

Of the 180 articles, 76 (42%) articles did not explicitly frame a hypothesis that was the basis of their reported findings, suggesting that in at least some cases the findings were reported when carrying out subpopulation analysis, but studying racial or ethnic differences may not have been the original purpose.

The 76 articles that explained how race and ethnicity were assigned to participants were studied further (table 3). The most common method was self-identification, used in 25 (33%) of the 76 articles, 20 of which did so as part of initial recruitment or by offering a choice of categories on a questionnaire. The remaining five studies had open recruitment and did not specify the method other than it occurred during an interview.

Of the 76 studies, 5 (7%) used data found during a clinical record review of the research participant. An illustrative example is included:27

 For each patient, records were assessed for clinical status, date and place of birth, ethnic background, family history, and other affected family members. DM patients were grouped as followed: the Ashkenazi Jews originated from European countries (excluding the Balkan countries) and North America. The Sephardim/Oriental Jews originated from North Africa, Asia, and the Balkans and the Yemenite Jews who originated from Yemen. Patients, whose origin was different, such as India, were classified as others.

Four studies relied on language to frame their study population, although they did not clearly state “self-reported” language use. For example, one study28 reported that the DNA samples used were extracted from “tumor tissues … [obtained from] Xhosa-speaking esophageal cancer patients”. How the investigators determined language ability of the subjects was, however, never reported.

Of the 76 articles, 17 (22%) framed their study population by geography. For example:29

 The subjects in the study came from a non-urban community of the dikgale district in the northern province of South Africa. This district is situated 15 km northeast of the university of the north and 40 km from the province’s capital, Polokwane (Pietersberg).

An additional nine studies relied on referral patterns to the medical centre where the study was performed. (“Patients over 16 years old with tuberculosis were identified at the clinic in the southwestern region of Croatia.”30)

Although inclusion was often initially reported in a straightforward fashion, details were often missing—for example, one report noted that “Only indigenous Zulu-speaking Black African women were included in the study and those with known nonindigenous relatives were excluded.”31 The method of determination that a relative was “non-indigenous” was, however, never reported.

Methods differed among studies of defined populations. One study on Native Americans, for example, took care to sample “from different areas of the reservation, to obtain a representative and random sample of the Navajo Tribe”,32 whereas another used self-report of tribal membership.33 In two studies on Ashkenazim in Israel, one collected data on European origin,34 whereas the other35 reported no further demographic query.

Only five (6.5%) studies stated that their demographic determination was based on more than one generation in the participant’s family; of these, four relied on racial or ethnic background of three generations in the participant’s family, as recommended previously.36

The nine authors who faced our unstructured interviews were asked how demographic data on research participants were obtained and recorded. We found that investigators intended to use self-identification, and this was usually based on pre-established categories. Many participants were, however, not directly asked nor given the opportunity to suggest their race or ethnicity. In those cases, classifications of participants were made on the basis of clinical chart review, second-hand information from clinicians asked by study recruiters or assumed by the study recruiter without asking the participant. Some study cohorts combined participants from different studies that used different methods of categorisation. In these instances where different methods of categorisation were used, some investigators “inherited” data or a cohort from the second study or from coinvestigators for analysis. Despite publishing an article, some investigators simply did not know how classifications were done.


Genomic studies illustrate the complexities and ambiguities of group labels in biomedical research. As most (72%) of the articles in our sample included no descriptions of how race or ethnicity was determined, we cannot draw conclusions on the consistency or applicability of definitions in published investigations. Also, we cannot build empirically based categories to describe how race and ethnicity are “measured”. We were unsure, for example, as to whether to classify Chinese as an ethnic group,37 a geographical population38 or a nationality.39 In these three examples, the respective article offered no guidance. The assessment of race and ethnicity in epidemiological research is fundamental to any effort to reduce excesses in poor health among racial or ethnic groups.40 The ambiguous use of terms, however, poses a challenge to this assessment.

The National Institutes of Health or other US federal agencies funded 44% of the studies. They would therefore fall under OMB directive 15, which states that investigators should generally rely on “self-reporting or self-identification” to establish the race or ethnicity of a research participant. Although we found that studies funded by the US were more likely (33%) to report methods in our study, this difference was not statistically significant. This is consistent with findings in other, non-genetic studies41 in which race and ethnicity were not being reported at all in clinical studies of diseases with known health disparities.

Nearly two thirds of the articles reported a three-way association among the genetic phenomenon, health outcome, and racial or ethnic cohort studied, and more than three quarters did so in the title or abstract of their article, suggesting that it was a central feature of their report. We are concerned that uncritical acceptance of such claims, especially with ambiguous definitions of the cohort studied, may lead to unnecessary and potentially harmful social or biological exaggerations of population differences. This concern has been expressed previously.17 The opposite is likewise true: lack of methodological consistency can also lead to understating of genetic or social risks, and missed opportunities in genetics research of complex diseases.42

Conflating race or ethnicity as a social concept and as a label for regional populations sows seeds of confusion and breeds controversy. As terms of race and ethnicity are used inconsistently to define social cohorts and distinct populations exhibiting possible genetic differences (whether differences in alleles or differences in allele frequency), questions of the relationship among race, genes and population health become harder to characterise, more complex to study and increasingly controversial. An important first step is to define terms and populations of study as precisely as possible. This should occur during study design and recruitment, as suggested by leading investigators.

Tolerance of ambiguous definitions may also lead to more ominous threats to health. It has been argued that “conservative foundations are seeking to frame debates over determinants of racial/ethnic health disparities as a matter of ‘politically correct’ unscientific ideology … vs scientific yet ‘politically incorrect’ expertise rooted in biological facts”.43 The argument over the existence of biological categories threatens to distract societies from well-described social causes of health disparities.44–47 We and others48 believe that a common ground exists between geneticists and social scientists. We argue that achieving this common ground will depend on methodological consistency between the two research communities, which starts with the consistent measuring of the populations and studying of ecological settings.

Growing sophistication in descriptive epidemiology has allowed for more precise modelling of risk factor exposures, race and ethnicity, and disease rates. Genetics investigators have an important assignment: recognising the degree of genetic variation within and between racial and ethnic groups, linking it to variations in health outcome, and tying differences to social risk factors, biological risk factors, environmental risk factors and other nodes in a complex causal network. Studies that do not clearly specify the independent variable of race or ethnicity make it difficult, if not impossible, to know exactly what factors are being studied and how to interpret results.


We thank G Corbie-Smith, J Reardon, E Hauser and W Kraus for their contribution to the important background work for this project and A Powers and C Whitener for assistance with the Medline search strategy.



  • Competing interests: None.

  • Contributors: HS generated the idea for this project and wrote the protocol that was further elaborated by the other two authors. He also obtained the data from the articles. All authors interpreted the data and analyses. HS wrote the final draft. LD, DAS and RCD commented on it critically.