Annotation of Essential Viral Genes and Identification of Conserved Gene Sets in OrthoDB

Toward a scalable framework for identifying viral essential genes through orthology annotation, evolutionary conservation, and predictive modeling.

Viruses rely on highly compact genomes, yet the genes indispensable for replication, information flow, and host adaptation remain poorly characterized at scale. This project addresses that gap by combining large-scale orthology-based annotation, cross-domain conservation analysis, and predictive modeling to identify high-confidence candidates for viral essential genes across diverse viral genomes.

Affiliation

School of Pharmaceutical Sciences, Wuhan University

Supervisor

Prof. Fengbiao Guo

Project Lead

Linyi Jiang

Viral Genomics Bioinformatics Essential Gene Prediction OrthoDB Evolutionary Conservation EggNOG-mapper Machine Learning Protein Language Models

7,962

viral genomes

129,962

representative genes annotated

0.8434

best nucleotide AUROC

Background

Why Viral Essential Genes Remain Difficult to Predict

Viral essential genes are the core functional units required to sustain replication, assembly, and successful infection. Identifying them is important not only for understanding viral life cycles, but also for discovering potential broad-spectrum antiviral targets. However, this task remains difficult at scale because experimentally validated viral essentiality data are scarce, public repositories are heavily centered on cellular organisms, and most existing predictors were originally built for bacterial or eukaryotic systems.

As a result, viral genomes sit in a methodological blind spot: they are small, diverse, rapidly evolving, and poorly covered by gold-standard labels. This makes direct transfer from cellular essentiality models unreliable and motivates the need for a virus-oriented framework grounded in orthology, conservation, and adaptable predictive modeling.

Viral essentiality labels are limited and taxonomically uneven.
Existing frameworks mainly reflect bacterial or eukaryotic assumptions.
Viral sequence diversity creates strong domain mismatch.
A virus-specific prediction strategy is therefore necessary.

A compact overview of the current data gap and the limitations of existing prediction frameworks.

Few validated viral gold-standard datasets; strong methodological mismatch with cellular predictors.

Annotation of Essential Viral Genes and Identification of Conserved Gene Sets in OrthoDB

Research narrative viewer

Why Viral Essential Genes Remain Difficult to Predict

Why Viral Essential Genes Remain Difficult to Predict