Lost in the Database: The Global Gap in Genomic Information

The availability of genomic data in major genetic databases remains far from complete, leaving a significant gap between known biodiversity and sequenced genomes. A 2018 study estimated that Earth hosts around 10 to 15 million eukaryotic species, of which about 2.3 million have been described, yet fewer than 15,000 genomes had been sequenced at that time. By 2025, the NCBI GenBank contained sequences from approximately 581,000 species, representing less than 1% of the estimated global total and only about 5% of the described species. At the kingdom level, coverage is even more limited: only 1,482 plant species have nuclear genome data, equivalent to just 0.4% of the 391,000 known species. Similarly, GenBank holds genomic data for about 3,278 animal species, representing just 0.2% of the 1.66 million described. Microbial representation is somewhat broader but still incomplete. Although there may be hundreds of thousands to millions of prokaryotic taxa worldwide, only about 200,000 bacterial and archaeal genomes are available, which corresponds to roughly 2.1% of global diversity. Plant genomes are especially underrepresented compared to animals and fungi, and similar imbalances exist across many groups. Moreover, even among available genomes, quality varies widely, with many existing only as draft assemblies rather than complete, well-annotated references.

Several factors contribute to this limited coverage. First, technological and resource constraints remain major obstacles. Although sequencing technologies have become faster and cheaper, many species possess extremely large or complex genomes, such as polyploid plants and amphibians, which are difficult to process. Obtaining high-quality DNA and applying advanced bioinformatics tools are also challenges, particularly in regions with limited resources. Second, funding priorities often favor economically valuable or model organisms like humans, mice, rice, and maize, leaving wild or less commercially relevant species overlooked. For instance, the Earth BioGenome Project is projected to require around 4.7 billion USD, highlighting the immense financial investment needed for such large-scale initiatives. Third, research bias skews heavily toward vertebrates and domesticated animals, even though groups like arthropods account for nearly 78.5% of all animal species. Geographic bias also plays a major role, as about 95.5% of animal genome publications in GenBank originate from institutions in North America, Europe, and Asia, with 70% of contributions concentrated in the United States, China, and Switzerland. As a result, tropical regions rich in biodiversity remain seriously underrepresented. Fourth, specimen collection is often difficult, especially for species inhabiting remote ecosystems such as deep oceans or tropical rainforests. Rare and endangered species pose additional challenges because collecting samples risks harming fragile populations, and limited taxonomic expertise can lead to misidentification. Finally, legal and policy barriers, including the Nagoya Protocol, complicate international sharing of genetic resources by requiring benefit-sharing agreements, while some countries prohibit the export of raw DNA, further restricting collaboration.

The imbalance in genomic data has serious implications for science, conservation, and innovation. Reference genomes are essential for research in phylogenetics, population genetics, and adaptation, yet many taxa remain in “biological darkness” without them. For example, Saccharum, the genus of sugarcane and the world’s most widely cultivated crop, currently has only 12 genomic entries in NCBI. Similarly, Coffea, the genus of coffee, has just 11 entries, and Camellia, the genus of tea, has only 36, despite coffee and tea being the sources of two of the most popular beverages in the world after water. Musa, the genus of banana and one of the most consumed fruits globally, is also underrepresented, with only 19 genomic entries available. Such gaps highlight how even economically and culturally significant plants are poorly represented, limiting agricultural research and broader scientific progress. Genomic data are also vital for conservation, as they reveal hidden species diversity, genetic variation, and population structure, which can inform more effective management strategies. Whole-genome sequencing can help uncover cryptic species, monitor genetic erosion, and predict responses to climate change. Without such information, conservation policies are less precise and less effective. Furthermore, many advances in biotechnology rely on exploring wild genomes, for example, the discovery of anticancer compounds from marine sponges or the use of yeast in food production. Thus, incomplete genomic coverage not only slows basic research but also hinders innovation and weakens biodiversity protection worldwide.

At the same time, these gaps present major opportunities for impactful research. Sequencing previously unstudied species can lead to groundbreaking discoveries, as demonstrated by the first successful sequencing of a human genome from ancient Egypt, extracted from 4,800-year-old teeth, which offered new insights into human evolution and patterns of disease resistance. Similar opportunities exist for endemic, rare, or economically overlooked species. Researchers can focus on exploring biodiversity-rich regions such as tropical rainforests or conducting in-depth studies in “genomic dead zones,” which remain vastly underexplored. Joining large international consortia like the Earth BioGenome Project or the Darwin Tree of Life Project provides access to infrastructure, funding, and collaborative networks that can accelerate progress. In addition, alternative approaches such as DNA barcoding, metabarcoding, and metagenomics can help expand biodiversity surveys and support species identification in the field. Open-access data publication in genomic repositories and journals further ensures that new findings are available globally. Strengthening international collaboration and establishing joint funding mechanisms are also critical to building genomic capacity in developing countries, allowing their researchers to contribute more fully to closing the global data gap.

Several global initiatives are already working to reduce this inequality. The Earth BioGenome Project aims to sequence, catalog, and characterize the genomes of all eukaryotic species on Earth within the next decade, starting with around 9,500 families. The Darwin Tree of Life Project in the UK and Ireland seeks to sequence 70,000 local eukaryotic species. Other projects focus on specific groups, such as the Vertebrate Genomes Project, Bird 10K, Bat1K, and i5K for arthropods. Large-scale plant-focused projects, such as the Ten Thousand Plant Genomes Project led by BGI, are also contributing significantly. Complementary initiatives like the Earth Microbiome Project and the International Barcode of Life aim to add massive amounts of DNA barcode data, expanding taxonomic coverage beyond full genomes. Together, these efforts reflect a global shift toward building a comprehensive catalog of life’s genetic diversity. With continued funding, equitable collaboration, and sustained commitment, the scientific community has a real chance to close the genomic data gap and move closer to achieving a complete representation of Earth’s genetic heritage.

Lost in the Database: The Global Gap in Genomic Information