Current Microbiome Analyses Turning Up High Number Of False-Positives

By Deborah Borfitz

March 16, 2023 | Insufficient taxonomic information in public sequence records is a longstanding problem that was again brought to light in a recent study of simulated microbial communities. Researchers in Spain, using standard techniques to explore virtual microbiome communities, found that results obtained from DNA analyses bear little resemblance to the real-world bacterial populations being imitated and many of the detected species are not actually present.

The study, published in PLOS ONE (DOI: 10.1371/journal.pone.0280391), demonstrates for the first time significant flaws in the way microbial communities are currently identified, says Clemente Fernández Arias, at the Biological Research Center of Madrid. Results of microbiome studies should therefore be interpreted with caution.

Although the research team was examining the limitations of genomic analyses using the larvae of the greater wax moth, the study’s conclusions apply to any microbial community, says Arias. “One of the variables considered in our models is the amount of information available in the databases used in the analyses, regardless of whether this information is from humans or insects.”

Confusion over how to name any individual bacteria and how that pairs with its genetic makeup arose in 2021 when the National Center for Biotechnology Information (NCBI), part of the U.S. National Library of Medicine, announced changes to much of the bacterial nomenclature, says Andrea Merchak, a doctoral candidate in neuroscience at the University of Virginia. “The further problem in that bacteria evolve very quickly, so we have new ones popping up all the time.”

The significant limitations in characterizing microbes described in the PLOS ONE article are daily challenges in her research, Merchak says. It is also central to how researchers go about answering questions.

During one recent study, where she was looking at the gut microbiome of a mouse model of multiple sclerosis, she felt compelled to supplement information from 16S sequencing with a lot of metabolomics data before drawing conclusions. Merchak says she’s certain that everyone working in the microbiome field feels likewise. It’s “disheartening” that fewer species are annotated than not on any given list.

But it is the nature of science to work with the available tools even if they aren’t yet fully developed “without overstepping the bounds,” she quickly adds. Database limitations will be overcome with greater consensus on nomenclature and how to go about defining different bacterial species as well as continued work by many people to get more species deposited to fill current voids.

Overlooked Limitations

Researchers in Madrid generated virtual bacterial populations to test their hypothesis that the limited information available in genomic databanks is reducing the accuracy of microbiome analyses, although this constraint is “normally ignored” in such studies, according to Arias. Their motivation was a previous project looking at microorganisms in the guts of insects that can degrade plastic, where different analyses turned up almost uniformly different results.

To create their virtual model, the team simulated the previously described ecology of microbial communities in terms of the specific abundances and distributions of different species (Nature Communications, DOI: 10.1038/s41467-020-18529-y). They did this using an algorithm that allowed them to choose random species while still complying with the macroecological laws followed by real-world microbial populations, explains Arias.

The high number of false-positives is related to the fact that the identification approach involves looking for a genetic match to a small fragment of DNA that could be similar to a sequence from different species, he explains. Yet it’s the method “used every day in many different contexts and it seems we are not really aware of the limitations of these kinds of studies.”

That only a small part of a microbial community can reasonably be sampled is only part of the problem, says Arias. “Even when you get the right DNA from the sample, genetics tools are not of sufficient grade to ensure the results of the analysis are actually similar to reality.” Interestingly, 16S (amplicon) and whole genome sequencing each proved capable of identifying species and genera that could not be found by the other.

Although the scientific implications here are potentially huge, Arias says he isn’t expecting a groundswell of concern or discouragement from all this. Multiple journals have seen these study results and wrote them off as immaterial.

“Conclusions of the [virtual microbiome] model are not restricted to insects,” says Arias, noting that previous studies on human-derived samples revealed discrepancies in results based on the analysis technique. Proving that any microbial community has the same limitations as the ones found in the greater wax moth would simply be a matter of re-doing the simulation exercise with a different set of microbial species.

Researchers are aware of the problem, he adds, but not its magnitude. The biggest surprise for many may be that they cannot trust that the species found by sequencing analyses are present in their samples.

The key difficulty is that the size of the databases used by metagenomic tools is much smaller than the volume of genomic information that has amassed over the last decade—the number of sequence records at the NCBI databank ranges in the billions—owing to the limited amount of information that current computational tools can analyze.

NCBI Response

The public DNA databases maintained by NCBI rely on information provided by submitters. But the agency tells Bio-IT World it has taken several steps to improve the veracity of database contents, including a major change in its approach that published in 2018 (International Journal of Systematic and Evolutionary Microbiology, DOI: 10.1099/ijsem.0.002809). Specifically, records from reference material (type strains) published with new species names are now used to validate and flag misidentified genomes with computational comparisons.

The same journal publishes the International Code of Nomenclature of Prokaryotes, which contains the relevant code for most microbes, the NCBI reports. In advance of implementing recent nomenclature changes, several NCBI Insights blogs—one highlighting that taxonomic names will include phylum rank and another that the update would affect 42 taxa—posted to help prepare internal and external stakeholders for the new processes. “It is also important to note that all the old, informal names will remain searchable and visible in various displays and will not affect species names attached to records,” the NCBI says in a written response.

As the NCBI continues to make improvements to its data curation practices, it remains “quite dependent on the submission of reliable records that we can flag as exemplars [e.g., bacterial genomes from type strains],” the agency continues. Its actions to find and validate genomes obtained from bacterial type strains is described in a recent published paper (International Journal of Systematic and Evolutionary Microbiology, DOI: 10.1099/ijsem.0.005707). “Part of these efforts included communicating bacterial species where we urgently need more submission from data obtained from type strains,” states the NCBI.