Big data has come to epidemiology. The promise is exciting: biobanks full of measurements of toxic chemicals, metabolites, and early disease markers for huge populations, combined with deep medical records databases permit investigations of thousands of potentially beneficial causal links between exposures and diseases. And, just in time, machine learning algorithms to accelerate the search through these data for promising new treatment and prevention strategies. These developments would seem to transform the ‘exposome’ from an abstraction to reality. It would seem that we can now hunt through the entire set of environmental factors to which we are exposed over our lifetimes to find keys to longevity and good health. And we are already seeing the fruits, if not yet in medical breakthroughs, at least in an explosion of studies in the medical literature. 

Suchak and colleagues1 recently documented this phenomenon, searching the literature for studies based on one of many publicly available databases – the US National Health and Nutrition Examination Survey (NHANES)2. This survey is conducted on a sample of 10,000 US adults every 2 years, and in its most recent iteration gathered about 700 variables for each subject. Suchak’s literature search identified 341 peer-reviewed papers between 2014 and 2024 using NHANES data and identifying at least one association between a predictor and a health condition. What a marvellous resource to improve health!

But. It is easy to predict what happens when thousands of X-Y associations are calculated: quite a few of them will be ‘statistically significant’, even if we use some standard method to compensate for multiple comparisons. Most of these ‘hits’ will be false, and the ones that might truly represent real risks are likely to be discredited for emerging from a methodologically suspect search.3  

A current critique of science holds that its utility for guiding social policies has been damaged by a “reproducibility crisis” – too many studies claim to provide useful knowledge about the world, only to fail to be supported in repeated studies. In other words, there are too many false positive findings.4 In our post-pandemic world, public trust in experts has been seriously eroded and the alleged reproducibility crisis is just one contributing factor.5

What is to be done?

What is to be done? How can epidemiologists restore public trust in our work, reduce the risk of false positive findings, while accelerating the identification of real health risks? Guidance for a path out of this dilemma should start from two fundamental principles of epidemiology, principles which are always included in the first chapter of the epidemiology textbook, but then too often forgotten. 

First, epidemiology is a branch of biomedical science (and not of data science). It may seem too obvious to state, but a symptom, disease, or death occurs inside a human body. In identifying causes of these outcomes, we must bring to bear all we know about the underlying pathophysiology which imposes constraints, but also provides valuable evidence about which ‘risk factors’ might be real and which might be false positives. This goes beyond the well-known Bradford Hill consideration that a causal relationship must be ‘biologically plausible’.6 The explosion of genomic research of the past 30 years provides a growing library of specific information about metabolic pathways by which exposures may lead to diseases, bringing a molecular understanding of ‘plausibility’ that was previously lacking. 

One example of a powerful source of evidence on biologic mechanisms of exposure-disease associations is called the ‘meet in the middle’ approach.7 Large scale metabolomics assays are used to identify metabolic pathways which are active both when a potential toxic exposure is present in an organism and when a particular disease process has been triggered (not necessarily in the same organism, but, for example, in other mammalian species thought to be relevant to human pathology). This kind of evidence can go well beyond the simple descriptive ‘plausibility’ of an exposure-disease association.

Insisting that epidemiology is a biomedical science reminds us that no matter how ‘statistically significant’ an association may be, and no matter how large the sample size or precise the measurements, we cannot provide useful evidence for disease prevention without an understanding of how the risk factor might lead to the disease.

This leads naturally to the second fundamental principle of epidemiology: our science exists solely to provide evidence for improving health. Again, an obvious point, but important because this purpose distinguishes our work from the ‘normal science’ carried out by our colleagues in fields like geology, astronomy, physics, or chemistry.8 For all scientists, causality is elusive and uncertainty inevitable. But if one’s purpose is the pure pursuit of knowledge to better understand the world, then the standard and eminently reasonable response to uncertainty is to do another study… and wait. Waiting for additional evidence in public health may mean delaying preventive action that could save lives. Of course, acting in haste also risks the false positive mistake of a policy that will do no good, but impose costs.

Because of our single focus on improving health, we should not be asking: “Do we have proof of causality?”, but rather: “Do we have enough evidence to act as if this is a causal association?”.9 This reframing makes it clear that the appropriate strength of evidence depends on what will be done with the finding. What are the costs and benefits of acting now or waiting for more evidence? The attacks on science because of the reproducibility crisis focus only on false positive findings. But in public health, we are at least as concerned with false negatives – waiting for ‘enough’ evidence before acting.

Conflicts of interest: none declared.

References

  1. Suchak T, Aliu AE, Harrison C, Zwiggelaar R, Geifman N, Spick M. Explosion of formulaic research articles, including inappropriate study designs and false discoveries, based on the NHANES US national health database. PLoS Biol 2025;23(5):e3003152. doi: 10.1371/journal.pbio.3003152
  2. NHANES: National health and nutrition examination survey homepage. 24.09.2024. Available from: https://www.cdc.gov/nchs/nhanes/index.htm
  3. Savitz DA, Wellenius GA. Consequential (and inconsequential) environmental epidemiology. Environ Epidemiol 2025;9(6):e433. doi: 10.1097/EE9.0000000000000433
  4. Fanelli D. Opinion: Is science really facing a reproducibility crisis, and do we need it to? Proc Natl Acad Sci U S A 2018;115(11):2628-31. doi: 10.1073/pnas.1708272114
  5. Schulson M. The Cultural and Political Moment for Toxins Research. Undark Magazine. 11/062025. Available from: https://undark.org/2025/11/06/toxins-research-maha/
  6. Bradford Hill A. The Environment and Disease: Association or Causation? Proc R Soc Med 1965;58(5):295-300. doi: 10.1177/003591576505800503
  7. Suthar H, Tanghal RB, Chatzi L, et al. Metabolic Perturbations Associated with both PFAS Exposure and Perinatal/Antenatal Depression in Pregnant Individuals: A Meet-in-the-Middle Scoping Review. Curr Envir Health 2024;11(3):404-15. doi: 10.1007/s40572-024-00451-w
  8. Ravetz J. The post-normal science of precaution. Futures 2004;36(3):347-57. doi:10.1016/S0016-3287(03)00160-5
  9. Kriebel D. How much evidence is enough? Conventions of causal inference Law & Contemp Probl 2009;72(1):121-36.

 

       Visite