Data Science & AI, NLP

Viral Escape

Written by

DSL

Published on

August 17, 2021

It’s almost here.
Just about everyone in the Netherlands has now been given the chance to get vaccinated, the 1.5-meter rule is disappearing and the cabinet is striving to loosen the other measures as well.
That sounds like the end of corona doesn’t it?
Unfortunately, like the flu virus, SARS-CoV-2 (the coronavirus that causes COVID-19) mutates like wildfire.
In addition to those thousands of mutations, we now know of several variants of SARS-CoV-2, including the alpha, beta, gamma and (currently the most common) delta variant.Fortunately, most vaccines are also quite resistant to the new variants of the virus, but what if that is no longer the case?
A “viral escape” is therefore a doomsday scenario, where the virus mutates just enough that existing antibodies no longer recognize it.
The consequences are serious: such a dangerous mutation will bypass the immune systems of people who have been vaccinated (or were previously infected).
In short, we will be back to square one.

How Artificial Intelligence can help

To escape this scenario, it is important to discover which of the thousands of mutations could actually cause a major hazard.
So that vaccine development can anticipate them as soon as possible.
Artificial Intelligence (AI) can help with this.
In fact, MIT researchers have found and devised a new way to model “viral escapes” based on models originally developed to analyze language.
The well-known Natural Language Processing (NLP) models.
The idea behind engaging an NLP model for this case is as follows: viruses mutate themselves in a way that follows biological rules of protein structure, but also as favorably as possible.
For example, SARS-CoV-2 mutates into the spike protein which can be seen in the image below.
Currently, people who have recently had corona or have been vaccinated have antibodies that fit the spike protein.
This prevents the SARS-CoV-2 virus cells from making contact with human cells and a person becomes infected.
Therefore, the goal of the virus is to quickly mutate back into the spike protein so that the non-antibodies fit, but the receptors of the human cell do.
In doing so, the virus thus wants to mutate in such a way that it can escape the human immune system (the antibodies), not dying or losing the ability to multiply.
Thus, for an NLP model, not only does it apply that a sentence must have the correct meaning (semantics), but also that the grammar (syntax) of the sentence must be correct.
Using these same two principles, researchers creatively adapted NLP models to observe changes in the genetic code of viruses.

An example

How the NLP model estimates which mutations of the virus can cause a viral escape is illustrated by the example below.
The first sentence represents the virus before it undergoes a mutation.
The second sentence (from the left) shows a small mutation.
The meaning of the sentence has hardly changed and the sentence is grammatically correct.
For this mutation, the new virus is still similar enough to the original that the immune system would recognize and attack it.
Thus, no new antibodies are needed in this process.
The third sentence is one that is not grammatically correct.
Therefore, in the language of the virus, such a mutation will be seen as a failed mutation.
The last sentence is where the danger lies.
This sentence is grammatically correct and has correct semantics.
Therefore, these are the exceptions that the NLP model estimates could cause a viral escape.
The researchers call the search for these exceptions“constrained semantic change search” (CSCS).

NLP model and training

Of course, the real NLP model was not trained on sentences, but on the building blocks of various spike proteins derived from coronaviruses, called amino acid sequences.
A total of just under 1000 sequences of the SARS-CoV-2 spike protein and another 3000 spike amino acid sequences from other types of coronaviruses were present in the training set.
The figure below shows the NLP model.
Internally, the model constructs a semantic representation, also called an “embedding,” for a given amino acid sequence.
The output of the model shows how well an amino acid fits into the “grammar” of the sequence.
In the case of the image, the amino acid that fits grammatically best into the sequence is marked by the capital letter A.

Testing

Of the 891 different coronavirus spike amino acid sequences the researchers examined with the model, one was from a strain that re-infected someone who had recovered from Covid-19 last year.
Consequently, this sequence was quickly scored high by the CSCS.
Furthermore, only three other sequences in the sequence were found to show both higher semantic change and so-called grammaticality.
The researchers also fed some of the new variants into their algorithm, determining that both the South African and British strains scored “quite high” in terms of their escape probability.

How to prevent the viral escape

Now what?
If the new mutations are properly tracked and the algorithm is used to estimate the hazard of the new mutations, researchers can test suspicious strains in the laboratory as soon as possible and adjust vaccines accordingly.
Testing then proceeds as follows.
Suspect strains and antibodies are put together.
If the antibodies are found not to adhere to the spike proteins, the current antibodies no longer provide protection.
How much time vaccine developers actually save with an AI-based approach such as this is still unclear.
What we do know is that in a pandemic as big as this one, every second counts.

References

Hie, Brian, et al.
“Learning the language of viral evolution and escape.” Science371.6526 (2021): 284-288.
DOI: 10.1126/science.abd7331
https://singularityhub.com/2021/01/19/a-language-ai-is-accurately-predicting-covid-19-escape-mutations/
https://spectrum.ieee.org/ai-predicts-most-potent-covid-19-mutations
https://news.mit.edu/2021/model-viruses-escape-immune-0114
https://qz.com/africa/1995639/scientists-use-algorithms-to-predict-new-covid-19-variants/