Functional Databases

Introduction of Functional Databases

Functional databases are essential tools in bioinformatics, providing comprehensive resources for the annotation, characterization, and analysis of genes, proteins, and other biomolecules. These databases offer curated datasets that facilitate the study of biological functions, molecular mechanisms, and evolutionary relationships across various organisms. The integration of functional data is critical for advancing our understanding of complex biological processes, enabling researchers to predict functions, discover novel interactions, and explore the roles of genes and proteins in health and disease.

Applications of Functional Databases

Gene and Protein Annotation

Functional databases serve as critical resources for the annotation of genes and proteins, providing detailed insights into their roles within various cellular processes. Databases such as Pfam and KEGG are particularly valuable as they house comprehensive data on protein families and biochemical pathways, respectively. Through the classification of proteins based on functional domains, these resources facilitate the prediction of protein involvement in complex metabolic and signaling pathways, which is essential for elucidating gene function and regulation across different biological systems.

Drug Resistance and Pathogenicity Studies

In antimicrobial resistance and pathogenicity research, databases such as CARD and VFDB play a pivotal role. These resources catalog genes associated with drug resistance and virulence, facilitating the identification of resistance mechanisms and pathogenic traits in microbial genomes. This information is crucial for developing new therapeutic strategies and managing infectious diseases.

Metagenomic and Functional Genomic Analysis

Functional databases are also extensively utilized in metagenomic studies to predict the functional capabilities of microbial communities. For instance, the eggNOG and CAZy databases offer annotations that aid in understanding the metabolic potential of microbial consortia in environmental samples. These analyses enhance our knowledge of microbial ecology and the role of microorganisms in ecosystem processes.

Essential Functional Databases

eggNOG

The evolutionary genealogy of genes: Non-supervised Orthologous Groups (eggNOG) database is a key resource in functional genomics. It provides comprehensive orthologous groupings and functional annotations across a broad spectrum of organisms, making it essential for studying gene evolution and functional conservation. By organizing genes into orthologous categories, eggNOG enables researchers to investigate evolutionary relationships and predict gene functions based on conserved sequences and structural characteristics. This database is instrumental in elucidating the evolutionary history and functional roles of genes across various species.

Taxonomic levels of orthologous groups in prokaryotic, eukaryotic, and viral genomes, highlighting new levels in eggNOG. Figure 1. Taxonomic levels for orthologous groups (OGs) computed for (A) prokaryotic, (B) eukaryotic, and (C) viral genomes. Taxonomic levels introduced in new eggNOG versions are highlighted in blue. Data presented include the number of OGs per level (red), species coverage (black), and functional annotation coverage (green).

CAZy

The Carbohydrate-Active enZYmes (CAZy) is a specialized database that focuses on enzymes involved in the synthesis, modification, and breakdown of carbohydrates. It is widely used in studies of microbial metabolism, bioenergy production, and the development of biotechnological applications involving carbohydrate-active enzymes.

CARD

The Comprehensive Antibiotic Resistance Database (CARD) is an extensively curated repository of genes linked to antibiotic resistance. This database plays a crucial role in the identification of resistance determinants within microbial genomes. By offering detailed information on resistance mechanisms, CARD supports research into the molecular basis of antimicrobial resistance and aids in the development of strategies to mitigate the growing threat of antibiotic-resistant infections.

dbCAN

The Database for Carbohydrate-Active Enzymes (dbCAN) is a crucial resource for annotating carbohydrate-active enzymes (CAZymes). It offers advanced tools for identifying and classifying CAZymes in genomic and metagenomic datasets. dbCAN is pivotal for research in microbial ecology, bioenergy, and biotechnology, as it facilitates the functional characterization and understanding of enzymes involved in carbohydrate metabolism.

Pfam

Pfam is a comprehensive database of protein families, containing HMMs for the identification of conserved domains and motifs in protein sequences. It is widely used for protein classification, function prediction, and evolutionary studies, providing a foundational resource for functional genomics.

Graph of relationships among Pfam entries in the Glutaminase I clan, showing sequence similarity based on HHsearch. Figure 2. Graph showing relationships among Pfam entries within the Glutaminase I clan (accession: CL0014). Each node (circle) represents an entry, with diameter proportional to the number of sequences in the alignment. Edges indicate similarity based on HHsearch results, with width proportional to the E-value (E-values ≤ 0.01 are significant).

Resfams

Resfams is a specialized database focused on protein families associated with antibiotic resistance. Complementing CARD, it provides a curated collection of hidden Markov models (HMMs) for identifying resistance genes within microbial genomes. Resfams is particularly valuable for high-throughput screening of resistance determinants in metagenomic datasets.

KEGG

The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a comprehensive database that integrates genomic, chemical, and systemic functional information. It provides pathway maps, ortholog clusters, and molecular interaction networks, which are essential for the in-depth study of metabolic and signaling pathways. KEGG serves as a crucial resource for understanding complex biological systems, enabling researchers to explore the molecular interactions and regulatory mechanisms that underpin cellular processes.

Workflow for organizing and annotating GENES databases. Figure 3. Workflow for the organization and annotation of GENES databases.

GO

Gene Ontology (GO) database provides a structured vocabulary for the annotation of genes and proteins based on their biological processes, cellular components, and molecular functions. GO is widely used in functional genomics, enabling consistent annotations across different species and facilitating the integration of functional data.

PHI-base

The Pathogen-Host Interactions Database (PHI-base) is a specialized resource that catalogs experimentally verified pathogenicity, virulence, and effector genes from pathogens. It is used to study pathogen-host interactions and to identify potential targets for disease control in agriculture and medicine.

TCDB

The Transporter Classification Database (TCDB) provides information on the classification and functional characterization of membrane transport proteins. It is an essential resource for understanding the roles of transporters in cellular processes, including nutrient uptake, ion transport, and drug resistance.

VFDB

Virulence Factors Database (VFDB) focuses on genes related to bacterial virulence. It provides detailed annotations of virulence factors, supporting research in pathogenic microbiology, vaccine development, and the identification of targets for antimicrobial therapy.

Evolution of VFDB from Release 1 to Releases 2 and 3, showing expansion and added comparative genomics features. Figure 4. Evolution of the VFDB showing Release 1 (R1) as the core, and Releases 2 (R2) and 3 (R3) expanding data and adding comparative genomics features.

Conclusion

Functional databases are indispensable resources in the study of molecular biology, providing comprehensive annotations and insights into the functions of genes and proteins. These databases enable researchers to explore the complex relationships between genetic information and biological functions, contributing to advancements in various fields, including genomics, metagenomics, and systems biology. As research progresses, these databases will continue to play a crucial role in elucidating the molecular mechanisms underlying health, disease, and environmental processes.

References

Finn, R.D.; et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research. 2016, 44(D1), D279-D285.
Kanehisa, M.; Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research. 2000, 28(1), 27-30.
Huerta-Cepas, J.; et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Research. 2019, 47(D1), D309-D314.
Jia, B.; et al. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Research. 2017, 45(D1), D566-D573.
Lombard, V.; et al. The carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Research. 2014, 42(D1), D490-D495.
Gibson, M.K.; et al. Improved annotation of antibiotic resistance determinants reveals microbial resistomes cluster by ecology. ISME Journal. 2015, 9(1), 207-216.
Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucleic Acids Research. 2015, 43(D1), D1049-D1056.
Urban, M.; et al. PHI-base: the pathogen-host interactions database. Nucleic Acids Research. 2020, 48(D1), D613-D620.
Saier, M.H.; et al. The Transporter Classification Database (TCDB): recent advances. Nucleic Acids Research. 2016, 44(D1), D372-D379.
Chen, L.; et al. VFDB 2016: hierarchical and refined dataset for big data analysis—10 years on. Nucleic Acids Research. 2016, 44(D1), D694-D697.

Services

* For Research Use Only. Not for use in diagnostic procedures.