This AI Tool Could Predict the Next Coronavirus Variant

The model, which uses machine learning to track the fitness of different viral strains, accurately predicted the rise of Omicron’s BA.2 subvariant and the Alpha variant

By Sara Reardon

A model that uses artificial intelligence could help predict new variants of SARS-CoV-2, the virus that causes COVID (shown in gold). — A model that uses artificial intelligence could help predict new variants of SARS-CoV-2, the virus that causes COVID (*shown in gold*).

Image Point FR/NIH/NIAID/BSIP/Universal Images Group via Getty Images

Despite having only been around for fewer than three years, the COVID-causing virus SARS-CoV-2 is perhaps the most studied and genetically sequenced pathogen in history. Disease surveillance teams around the world have uploaded millions of viral sequences to public databases that allow researchers to track how the virus spreads.

A new computational model mined this unprecedented amount of data—more than 6.4 million SARS-CoV-2 sequences—to find patterns among the mutations that help a new viral strain spread throughout the world. The model, called PyR₀, analyzed how different viral lineages arose and spread between December 2019 and January 2022. From these data, it learned how to identify the combinations of mutations and amount of time required for variants such as Delta or Omicron to become predominant. The model, which a team of researchers described in Science in May, could give public health programs advance notice about which lineages are potentially dangerous and allow officials to plan ahead.

PyR₀ used data leading up to mid-December 2021 to correctly predict that Omicron’s BA.2 subvariant, which was rare in much of the world at the time, would soon spread rapidly. By March 2022, BA.2 had become the dominant strain globally. If the model had been run in November 2020, it also would have correctly predicted that the Alpha variant would soon become dominant: the World Health Organization did not identify Alpha as a variant of concern until December of that year.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Most COVID vaccines target the virus’s spike protein, which it uses to enter cells. Mutations in this protein appear to allow certain variants to escape the body’s immune response to the virus from vaccination or prior infection. The PyR₀ model found that simply having numerous spike protein mutations didn’t necessarily make a strain more evolutionarily fit. But a few specific spike mutations in late 2021 helped the Omicron subvariants BA.1 and BA.2 evade the immune system.

PyR₀ also found that a set of nonspike mutations in BA.2’s genome that affect how the virus replicates might contribute to its rapid spread. The model’s ability to quickly analyze entire genomes, the researchers say, might help scientists know which areas of the virus’s genome to study in order to develop future therapeutics.

Scientific American spoke with study co-author Jacob Lemieux, an infectious disease researcher at the Broad Institute of the Massachusetts Institute of Technology and Harvard University and a physician at Massachusetts General Hospital in Boston, about how algorithms that “learn” from large data sets can predict the pandemic’s future.

[An edited transcript of the interview follows.]

What can PyR₀ tell us about the next predominant variants?

We can’t necessarily say what’s going to happen next in terms of mutations. We can say what’s going to happen next in terms of which lineages are most likely to increase in frequency.

In other words, if one car is traveling at 70 miles an hour, and another car’s traveling at 35 miles an hour, we can make a prediction that in a certain amount of time, the 70-mile-an-hour car is going to catch up and overtake the other car. But those predictions are only good in the near future because the way the pandemic works is that, all of a sudden, there’s a 210-mile-an-hour car that comes out of nowhere and completely changes the dynamics.

The amazing thing is that it’s happened over and over again. First, it was the D614G variant, then it was Alpha, then it was Delta, then it was Omicron; now it’s Omicron BA.2 and its close cousins BA.4 and BA.5. So this kind of dynamic seems to be a general feature of the pandemic.

But the things that allow the cars to go fast—the properties that confer this fitness advantage—seem to have changed over time. Omicron in particular seems to be very immune-evasive, particularly by escaping the human antibody response. That property has been increasingly important for the virus, and that makes sense because so many people have either had COVID or been vaccinated, or both.

It seems like this increasing immune evasion has been brewing continuously throughout the pandemic, and now it has really reached its full expression. This isn’t the first study to show that, but it demonstrates it systematically. And it seems likely that such immune escape is going to continue to be a part of what makes a lineage grow. We can’t predict, within the context of this study, what mutations are going to arise in the future and confer additional immune escape.

How does your model help predict and track new variants?

What we’re modeling is how different combinations of mutations in different lineages affect the growth rate of individual viral variants in the population. [Editor’s note: A lineage is a group of variants with a common ancestor.] Because each new lineage has a constellation of mutations—some of which we’ve seen before in other lineages—we can start to ask the question “Which mutations are driving this?”

We’re modeling this question in lots of different regions of the world and then essentially aggregating the information into a single model. The reason we’re able to do this is because people from all around the world are sequencing the virus, and they’re labeling the sequences with the date and region of the collection. So we know, in different regions, which lineages are increasing in frequency relative to the others. This information is incredibly valuable—we wouldn’t have been able to create our model without this kind of information.

It’s a real computational challenge to actually implement that model and fit it to the data. Lead study author Fritz Obermeyer had come to the Broad Institute from Uber AI, where researchers had developed a programming language and a software framework that uses machine learning to model probabilities and apply them to large datasets. It was really amazing to be able to apply these methods to the scale of data we’ve never had before.

We’re trying to improve the model, and we have a new version of it. We actually think successful lineages are driven by a small number of mutations, and the others are just sort of along for the ride. A related challenge is trying to study the genetic or statistical interaction among mutations. Maybe Mutation 1 makes the virus more fit; maybe Mutation 2 makes it more fit. But maybe the combination of 1 and 2 together actually makes it less fit. Those kinds of interactions are really hard to handle because the number of them grows so quickly.

How can this model help us plan our response to the pandemic?

One of the things we’re learning is that genome sequencing of emerging viruses is part of the outbreak response. We’re seeing a lot of genome sequencing, for example, with the monkeypox outbreak that’s going on right now.

There’s so much data that we can’t have a human just sifting through all of it. We need systematic, statistical machine learning programs that aid in the detection of new variants by humans. As a disease surveillance support tool, this kind of approach can be really useful. We’re trying to automate this model so we can run it on a regular basis and see if we can flag things that we should be worried about.

We found that by modeling mutations instead of just lineages, the model was smarter, and it learns faster. And the faster you learn about a lineage’s properties, the more you know how concerned you should be.

I don’t think this model is a replacement for well-structured programs—such as those run by governments and international organizations—for conducting disease surveillance. It’s a support tool for such programs to allow them to systematically screen and rank lineages that are rising. I would think this kind of approach will be doable in the future as data accumulates for influenza and other viruses.