Biology’s Role As A Driver of the Future of Computation Development

By Allison Proffitt

June 9, 2021 | At the DECODE: AI for Pharmaceuticals forum yesterday, Puneet Batra, director of machine learning at the Broad Institute, outlined the mission of the new Eric and Wendy Schmidt Center at the Broad: to position biology to drive the next era of computing.

The EWSC was launched at the end of March thanks to a $150 million endowment from Eric and Wendy Schmidt that was matched by The Broad Foundation. “The Eric and Wendy Schmidt Center seeks to understand the programs of life and how they’re organized across three biological scales: cells, tissues, and organisms,” Batra explained. “We are doing this—and promoting this—by convening a community of computational scientists and biologists. The goal is not just to bring the tools of modern machine learning to bear on biological discovery—though that’s a great goal—it’s also to make biology a key driver of advances in computation itself,” he added.

The group is international by design. Twenty-five percent of the combined $300 million endowment is committed to use outside of Boston, and the community of collaborating computational scientists, biologists, and clinicians is already fairly extensive. Beyond the Broad Institute, MIT and Harvard communities, collaborators include Mila (Quebec AI Institute), led by Yoshua Bengio; European Laboratory for Learning and Intelligent Systems, Tuebingen, led by Bernhard Schoelkopf; The Alan Turing Institute, directed by Sir Adrian Smith; Oxford Big Data Institute, directed by Cecilia Lindgren; clinicians and researchers at Mayo Clinic and Geisinger; biopharmaceutical companies, including Genentech (a member of the Roche Group), AstraZeneca, and Novartis; technology and research companies focused on scientific inquiry, including DeepMind, Google Research, and Microsoft; Mikhail Belkin (UC San Diego), David Blei (Columbia University), Marzyeh Ghassemi (University of Toronto), Jennifer Listgarten (UC Berkeley), and Mihaela van der Schaar (Cambridge University).

The Schmidt Center is co-directed by Caroline Uhler, Associate Professor of Electrical Engineering and Computer Science and the Institute for Data, Systems, and Society at MIT and an associate member of the Broad Institute; and Anthony Philippakis, Broad’s chief data officer.

Biology as Driver

Biology and machine learning can together untangle some of the fundamental questions about the programs of life, Batra said. “We think there are a lot of neat problems in biology that need to be addressed by new developments in machine learning,” he said. For instance: how do genes interact to form cell types? How do cell types give rise to tissues? And, at the organism level, how do genotypes map to phenotypes?

There are two great revolutions of the 21^st Century: the explosion in data technologies (machine learning, cloud, etc.) as well as the blossoming of biological technologies (sequencing, single-cell genomes, medical imaging, etc.). These two revolutions are converging, Batra said, and together they will open a new door on biological research.

But Batra insisted that the goal is not simply to apply machine learning to biological questions. Machine learning, thus far, has been driven by image recognition and predictive accuracy. Driverless cars are a prime example. But for biological questions, he argues, our aim is to understand natural laws. Machine learning needs to move from predictive accuracy to causal modeling, from “what?” to “why?”.

Biology—and these unique biological questions—can serve as a key driver to advances in computing.

ML4H: Machine Learning for Health

From its beginning EWSC partnered with the Machine Learning for Health—ML4H—project at Broad. Batra co-leads this 25-person, cross-disciplinary team of computer scientists and clinician researchers . The three pillars of what ML4H does, Batra said, are looking at rich data and outcomes, applying deep learning, and driving clinical questions. “We are not just interested in applying models to data, but we’re really interested in changing how patients are treated,” he said. “Understanding not just the machine learning side, but also does it mean to improve patient care is a critical driver for the problems we choose to work on.”

ML4H work is driven by the deep phenotype hypothesis: the supposition that there are many more phenotypes with genetic bases waiting to be discovered in rich data sets. “Once we have these phenotypes, it’ll accelerate clinical impact in a variety of ways,” Batra believes. He expects these deep phenotypes to enable biological discovery, identifying precise genetic architecture of disease and its progression; create new biomarkers and improve trial selection; and improve patient screening, predicting who will get sick and to which therapeutics he or she will respond.

To support the search of these deep phenotypes, Batra and his group are using real-world data from a 500,000 primary care cohort with an average of seven years of follow up per patient. “It’s deep-learning scale,” Batra said, comprising 60-billion-word tokens, 7 million ECGs, rich imaging echoes, and tens of thousands of longitudinal outcomes including stroke, heart attacks, heart failure, and more.

“These are the kinds of datasets that one needs to be able to build these deep phenotypes and be able to validate their impact on outcomes,” he adds.

Big Data, New Strategies

But even this large dataset still requires thoughtful, new machine learning strategies. Batra outlined some of the steps ML4H takes to maximize the machine learning impact.

First, events and data labels are precious, Batra said. “We don’t have them at unlimited scales, so we found that it’s really important to apply pre-trained models from either outside the healthcare industry or inside that we’ve built ourselves,” he said, for example, BERT, DenseNet, and PCLR. “When you use [pre-trained models], it really improves data efficiency to make sure your models are learning as fast as they can on the limited data you have.”

Second, choice of data representation is crucial. “We often take these rich datasets and reduce them to a smaller dimension. When that happens, we need to make sure that the representation is faithful: it captures the biology and doesn’t lose it, but also doesn’t enhance bias and protocol differences,” he said.

Third, Batra has not found clinical relevance to be captured by the area under the curve and believes that carefully considering how the models or tools will be applied in the clinic and in trials is essential. “That gives you a different way of evaluating whether it’s useful or not, which is why the collaboration with clinicians is so important.”

It’s these practical observations that will drive EWSC’s vision along with the work ML4H has been doing.

“Our vision is not just to combine these two fields—to bring modern machine learning to bear on biology, which is happening in many places,” Batra said. “It’s also to start to make the central questions of biology needs to address, this causal aspect, this mechanistic aspect, to make those key needs drivers of additional advances in computing.”