Making the most out of genomic epidemiology

Marco Galardini
23 August 2023

Even though the lab is focused on bacterial pathogens, we (Marco and Gabriel) took a look at the “2020 main character” pathogen, SARS-CoV-2, and just put out a preprint on the topic of real-time identification of epistatic interactions.

A particularly interesting development during the pandemic has been the unprecedented amount of genome sequencing, and its use in very quickly developing diagnostics (e.g. PCR primers) and treatments (i.e. sequence-based vaccines and their updates).

Other applications closer to our field of study are the identification of virus variants with a fitness advantage and predicting the impact of genetic variants on genes and viral fitness. We asked ourselves: can we find another use for this genomic data?

Plot with the number of SARS-CoV-2 genomes in GISAID as a function of time

The application we were looking for would ideally be computed quickly, so that it could be part of the public health decision process, and should leverage the metadata associated with each sequence so that context would be taken into account.

We decided to look at a generally overlooked aspect in the evolution of this virus: epistatic interactions. Generally speaking, two positions in the genome can be said to interact epistatically if the impact of a double mutation deviates from the “sum” of the effect of the single ones.

A particularly interesting example of macroscopic epistatic interaction can be found in the history of scurvy.

We already know that these interactions are important for the evolution of the Omicron variant (BA.1), has it has been elegantly showed through experimental work.

These experimental studies are however relatively slow to perform and restricted to certain regions of the genome (e.g. the Spike RBD). Could we come up with a fast computational method that scales to millions of sequences?

We reworked a method based on mutual information (MI) and applied to a large set of SARS-CoV-2 sequences (~4M), finding 474 interactions across the genome, the majority in the Spike gene (185).

Predicted epistatic interactions across SARS-CoV-2 genome

Even though we made the method somewhat scalable, it took ~36 hours to complete using several nodes on HZI’s cluster. But luckily we can obtain similar results with as few as 10k sequences (a bit better if closer to 100k), which only takes 2 hours to complete.

Prediction quality when subsampling

How do these interactions change over time, and how quickly can we spot known ones, such as those found in Omicron BA.1? We further modified the method to reduce the influence of older sequences and thus highlight emerging interactions.

And indeed we were able to identify a cornerstone epistatic interaction in the Spike protein between residue 501 and 498 as early as 6 (!) Omicron sequences were present in the dataset, which speaks to the sensitivity of the method.

Predicted Spike epistatic interactions across SARS-CoV-2 genome

The flip side of this sensitivity is that genomes with incorrect metadata (i.e. date) will make epistatic interactions appear at odd times, which was the case before we used the excellent community resources put together by Emma Hodcroft.

Predicted Spike epistatic interactions including artifacts

We hope this work demonstrates that pathogen genomic sequencing is here to stay, as the data can be used to generate many useful predictions that can in turn guide public health decisions.

This work was started during the first lockdown, but it was significantly advanced by Gabriel Innocenti, who was supported first by a FEMS research and training grant and then by RESIST. A real pleasure to work with him!

P.S. as with our other papers, we have also shared the analysis code so that the study can be reproduced.