Mapping the Trillion-Gene Universe: The Startup That Wants to Teach AI the Language of Life

By Allison Proffitt

March 30, 2026 | Basecamp Research, the AI-biology lab, has announced the Trillion Gene Atlas: a plan to collect genomic data from more than 100 million species across 31 countries, expand the known genetic universe by a factor of 100, and compress what it estimates would be 20 years of data processing into under two years. Partners include Anthropic, sequencing hardware makers Ultima Genomics and PacBio, and NVIDIA for compute.

“Trillion” is not a hypothetically large number, said Glen Gowers, co-founder of Basecamp Research. He’s counting. The Trillion Gene Atlas is a scaling up and extension of the company’s existing proprietary database, BaseData, introduced in June 2025, that already holds 10 billion genes drawn from one million newly discovered species — more than ten times the size of all public genomic resources combined. The Trillion Gene Atlas is a plan to scale that by another factor of 100, in two phases of roughly 10x each. Each scaling phase will take about a year, Gowers said, including a stage of data gathering and generation and an assessment stage to incorporate learnings before the next scaling phase.

The partnerships Basecamp announced with the atlas will be key to the project’s success. “We’re scaling three things at the same time,” Gowers explained: samples, sequencing, and compute.

Sample collection, he said, is one place Basecamp has a competitive advantage. In Iceland, the company built off-grid DNA extraction labs that can run on solar power from a tent. Today, local scientists in 31 partner countries process samples on-site, extracting DNA within an hour of collection to preserve its native state, before shipping the stable DNA itself. Access and Benefit-Sharing agreements with each partner country — aligned with emerging Digital Sequence Information regulations — govern how economic value flows back to source nations.

For sequencing, the partnership with Ultima Genomics and PacBio addresses a specific technical challenge: you need both short-read and long-read sequencing at industrial scale. Short reads deliver throughput; PacBio’s HiFi long reads capture regulatory context, epigenetic information, and subspecies-level resolution that short reads miss entirely.

Historically, “short-read sequencing engaged because of the scale, not necessarily because it’s the most complete answer,” said Christian Henry, PacBio’s CEO. “Long-read sequencing now achieves scale that’s material to the model-building.”

PacBio is doing its sequencing at its facility in Menlo Park, California, and Henry said PacBio has increased capacity to be ready to start. “We have an applications lab that we have scaled, and what’s great about that is we can use our latest technologies event before they hit the market to deliver even more scale and lower costs,” he said.

The compute capability needs dramatic scale as well. Even with samples and sequencing in hand, Basecamp estimated that running its existing bioinformatics pipeline on a trillion genes would take over 20 years. NVIDIA’s accelerated computing infrastructure — including Parabricks for metagenomic assembly — is what brings that timeline below two years.

From there, the final partner, Anthropic, comes in. “Without revealing more than we can say right now,” Gowers teased, “there is a very interesting angle where we combine natural language reasoning—which the models like Claude and all the other ones are getting really good at natural language reasoning and human-based languages—and we combine that with bio-native models that can speak a language of biology. No one has put those two things together before, really,” he said. A successful marrying of the two languages will have utility for life sciences, pharma, diagnostics, animal health, agriculture, environmental biology, and more, Gowers predicted.

Developing EDEN

Henry describes Basecamp’s approach to sequencing as “unprecedented.”

“To start off by thinking about the whole notion of looking at the tree of life to develop the context, that’s actually a novel concept in building scaled models right now,” Henry said. Historically, PacBio focused on biodiverse sequencing, not just human, and Henry sees great value in Basecamp’s preference for the broad diversity of genomic information when approaching drug discovery.

“Humans have evolved from the tree of life, but [we’ve been] focusing on the narrow sliver of human data to drive our understanding of the genome and understanding of how to build models,” he observed. The Basecamp approach, he thinks, will be better.

In January, Basecamp published EDEN, a 28 billion parameter foundation model trained on the 10-billion-gene version of BaseData, and the results of the Trillion Gene Atlas will be incorporated into new versions EDEN.

EDEN was posted on bioRxiv and has not been peer-reviewed, but according to Basecamp, EDEN was used to design antimicrobial peptides against drug-resistant pathogens simply by typing the pathogen ID into the model. Thirty-two of 33 designed peptides showed functional activity — a 97% hit rate — without any further optimization, the company reports. Those peptides are now being tested in mouse models.

EDEN also demonstrated what Basecamp calls AI-Programmable Gene Insertion (aiPGI): the model identified over 10,000 disease-related genomic locations and designed CAR T-cells that showed over 90% tumor-cell clearance in laboratory assays — without using any human or clinical data in training.

Of course in vivo mouse results are a long way from Phase I clinical trials, but automated labs are speeding up target validation, Gowers pointed out. Basecamp’s commercialization plan is to bring models to all areas of science that the genome can inform: life sciences, pharma, diagnostics, animal health, agriculture, environmental biology, and more.

Gowers was clear that the company does not intend to run clinical trials itself. “I believe the future of AI drug discovery will be companies that can spin up new drug entities very quickly with a very high amount of predictability over how those drugs will behave and let others who are great at running clinical trials take them on board.”

As such, the datasets are proprietary, developed with close partners. “We want to enable as much research as possible. In fact, we have partnerships with tons of researchers in different institutions … But just releasing data out there into the either, that doesn’t solve a lot of problems.”

Neither of the two papers Basecamp has published on bioRxiv since June 2025 are yet peer reviewed. “We keep our North Star on solving human health problems and planetary health problems,” Gowers said. “Our North Star is not publishing.”

The approach, so far, is fine with VCs. The company is venture-backed. Gowers reported they have $110 million in venture funding to date. Publicly, they’ve had a Series A in December 2022 that raised about $21.6 million and a Series B in October 2024 raised $60 million.

While Gowers compares the Trillion Genome Atlas to The Human Genome Project—an effort that cost $3 billion in the 1990s—the costs today have changed considerably.

“I really think this is the type of project and the scale of project that wouldn’t have been possible even two years ago. It is a function of these scaling curves coming together,” he said. Now sequencing, compute, and sample-gathering are all much, much cheaper. The most expensive component today, Gowers said, is storage—an unusual bottleneck that reflects the sheer scale of the data being generated.

Emergent Properties

Gowers says what the model would be capable of at the trillion-gene scale is “uncharted territory” and that makes it exciting. “In the same way that when GPT models started scaling up, you have these things called emergent properties and capability overhangs.” The trillion-gene equivalent, he suggested, could include multi-drug combinatorial therapy design, diagnostic capabilities layered on top of therapeutic design, or modalities no one has yet thought to test.

Henry, who has spent his career in genomics, agreed. “Most of us got into science not just to build businesses, but to change the world. To be part of something like that is always inspiring.”