Interpretable detection of novel human viruses from genome sequencing data
pmid: 33554119
pmc: PMC7849996
Interpretable detection of novel human viruses from genome sequencing data
ABSTRACTViruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next-generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. We propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. Nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy-to-install packages enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.
- Freie Universität Berlin Germany
- Robert Koch Institute Germany
- University of Potsdam Germany
- Hasso Plattner Institute Germany
- Hasso Plattner Institute Germany
ddc:610, viral host prediction, 610, Hasso-Plattner-Institut für Digital Engineering GmbH, Standard Article, 500 Naturwissenschaften und Mathematik::570 Biowissenschaften; Biologie::570 Biowissenschaften; Biologie, interpretability tools, deep neural architectures, ddc:570, 610 Medizin und Gesundheit, ddc: ddc:610
ddc:610, viral host prediction, 610, Hasso-Plattner-Institut für Digital Engineering GmbH, Standard Article, 500 Naturwissenschaften und Mathematik::570 Biowissenschaften; Biologie::570 Biowissenschaften; Biologie, interpretability tools, deep neural architectures, ddc:570, 610 Medizin und Gesundheit, ddc: ddc:610
citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).49 popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.Top 1% influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).Top 10% impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.Top 10%
