Researchers Call for an Open AI Virtual Cell, Outline Challenges to Overcome

December 18, 2024

By Bio-IT World Staff 

December 18, 2024 | In a Perspectives paper published last week in Cell, researchers from Stanford University, Genentech, and the Chan-Zuckerberg Initiative issued a call for AI to build a virtual cell. Technology is ripe for a biologically-valuable virtual cell, they say, outlining their views of the priorities and opportunities.  

“Advances in AI and omics offer groundbreaking opportunities to create an AI virtual cell (AIVC), a multi-scale, multi-modal large-neural-network-based model that can represent and simulate the behavior of molecules, cells, and tissues across diverse states,” the authors write in the paper (DOI: 10.1016/j.cell.2024.11.015). They define “virtual cell” as a computational model that simulates the biological functions and interactions of a cell.  

But they are not limiting their vision to a model of a single cell. “The AIVC can capture cell biology at three distinct physical scales by representing (1) molecules and their structures found within individual cells, (2) individual cells, as spatial collections of those interacting molecules and structures, and (3) how individual cells interact with one another and the non-cellular environment in a tissue,” they write.  

Cell models exist but have been rules-based, built on assumptions about underlying biological mechanisms gleaned from observational data. AI, however, learns patterns and processes directly from data without needing explicit rules or human annotation, the authors write. And recent AI advances now, “satisfy the trifecta of being predictive, generative, and queryable, which are key utilities for advancing biological research and understanding,” they write.  

Based on extensive conversations in the industry, the authors believe that AI is mature enough to build an AIVC that will launch, “a new era of high-fidelity simulation in biology”. The corresponding scientific advances in generating ‘omics data means that there is now sufficient data to train such a model. For instance, NIH’s Short Read Archive now contains more than 14 petabytes of data—a thousand times larger than the dataset used to train ChatGPT.  

“Modeling human cells can be considered the holy grail of biology,” said Emma Lundberg, associate professor of bioengineering and of pathology in the schools of Engineering and Medicine at Stanford and senior author of the paper in a press release. “AI offers the ability to learn directly from data and to move beyond assumptions and hunches to discover the emergent properties of complex biological systems.”  

Lundberg’s fellow senior authors include two Stanford colleagues, Stephen Quake, a professor of bioengineering and science director at the Chan-Zuckerberg Initiative, and Jure Leskovec, a professor of computer science in the School of Engineering, as well as Theofanis Karaletsos, head of artificial intelligence for science at the Chan Zuckerberg Initiative, and Aviv Regev executive vice president of research at Genentech.  

Six Data Challenges  

In the paper, the authors outline six major challenges to be tackled in the creation of an AIVC. First among them: designing evaluation frameworks. Multiple foundation models in biology perform some of the needed capabilities for a virtual cell, the authors say. But it will be “important to define what the core capabilities of AIVCs should be and how those capabilities can be evaluated.”  

Second, the authors highlight the daunting task of building self-consistent models across its multi-modalities. For instance, “interactions between molecules should have consistent effects when measuring binding affinity, gene expression, cell-cell communication, or tissue organization,” they write. Models should be agnostic to both their inputs and outputs.  

Third, a useful model must balance interpretability and biological utility. “AIVC models will ultimately be judged on their ability to expand our understanding of biology, either by providing novel insights to biological processes or by accelerating the scientific process,” they write. To that end, models must make highly-accurate, well-calibrated predictions that are also explainable.  

The authors championed collaboration in their fourth highlighted challenge. “We foresee a future where AIVC platforms function as open, interconnected hubs for collaborative development and broad deployment of cell models to researchers and as education hubs delivering training to researchers, as well as providing engagement activities for educators, patients, and the public,” they write. They challenged the community to invest in infrastructure for collaboration.  

The fifth challenge the authors noted will be ethical representation and responsible use of a model that reflects human diversity. “Developers will have to use the utmost care to ensure these datasets are used ethically and transparently while building AIVCs and develop strategies to mitigate risks of model contamination with falsified data,” they say. New regulatory requirements could help with the responsible use of AIVCs.  

Finally, the authors envision a universal cell model and highlight challenges in knowing which data to include. “These data will need to encompass the breadth of biology,” they write, “in different species, domains, and modalities, representing the heterogeneity of life, while maintaining depth sufficient to distinguish true signals from noise.”  

Opportunities Galore 

But if these challenges can be overcome, the authors predict astonishing opportunities.  

Thanks to an AIVC, therapeutics can be tested in silico and enable virtual phenotypic screening. An AIVC could improve precision of cell engineering. An AIVC could identify tumor microenvironment (TME) interactions and match treatments to TME cell states. An AIVC could extend spatial profiling to a universal, pan-cancer framework, and then personalize findings to individual patients—creating a digital twin. A virtual cell could take on hypothesis generation and facilitate active learning.  

“This is a mammoth project, comparable to the genome project, requiring collaboration across disciplines, industries, and nations, and we understand that fully functional models might not be available for a decade or more,” Lundberg asserted. “But, with today’s rapidly expanding AI capabilities and our massive and growing datasets, the time is ripe for science to unite and begin the work of revolutionizing the way we understand and model biology."