Gene Mining

Gene Mining

The application of gene mining has proved to be one of the most powerful tools of the post-genomic age due to the booming growth of sequencing technology and the subsequent avalanche of biological information. Since the price of genome sequencing dropped and the throughput of this process has been growing exponentially over the last decade, the major challenge in life sciences has become the interpretation of the data instead of its production. In response to this change came gene mining, a collection of computational and experimental approaches to identifying genes associated with desirable traits, biochemical processes, or disease pathways by searching genomic and multi-omics-scale data. Instead of analysing single genes individually, gene mining is provided with a genome-wide, integrative view, which enables investigators to fill the traditional gap between phenotype and genotype.

Gene mining has its conceptual origins in classical genetics and comparative biology, but it cannot be discussed in its current incarnation without references to the progress in bioinformatics. Early gene discovery has been based on forward genetics, mutagenesis and phenotypic screening, which were potent but slow and were usually limited to model systems. This paradigm was altered radically by whole-genome sequencing. It was possible to find candidate genes in silico, and then experimentally confirm them by comparative genomics, pan-genome analyses and genome-wide association studies. Within the past decade, studies included in the Scopus database demonstrated that gene mining is becoming more and more reliant on the combination of many layers of omics data, such as genomics, transcriptomics, proteomics, metabolomics and epigenomics. This integration allows detection of the candidate genes, as well as the deduction of their regulatory functions, interaction patterns and evolutionary dynamics.

In its essence, the essence of gene mining is dictated by the fact that biological functionality leaves signatures behind in data. Genes playing related roles are likely to have conserved sequences, co-expression or network connectivity patterns. Bioinformatic techniques can therefore be used to rank genes associated with traits of interest by taking advantage of these patterns. Comparative genomics identifies genes that are conserved or specific to the lineage, association studies identify genetic variations that are related to phenotype, and transcriptomic studies demonstrate dynamically regulated genes under different environmental or physiological conditions. During the past few years, machine-learning techniques have been placed at the centre of this workflow. Random forests, support vector machines, and deep neural network algorithms are becoming popular to process high-dimensional data to exploit non-linear relationships and rank thousands of candidate genes out of thousands of possibilities. Such tools are not substitutes for biological insight but rather augment it by going through data scales, which would otherwise be intractable.

Gene mining has gained a niche in the current crop improvement in the agricultural field. Climate change, overpopulation and scarcity of resources have exacerbated the demand for resilient, productive and nutritionally enriched crops. Genome and pan-genome mining of plants is now a regular practice to determine which genes can be linked to drought resistance, heat resistance, disease resistance and yield stability. Notable improvements have been made by transcriptomic and co-expression network methods in realistic field conditions to allow scientists to whittle down large gene lists with small, high-confidence lists. Those genes may then be transferred into marker-assisted breeding or be targeted with the help of genome-editing technologies. Consequently, gene mining has facilitated the conversion of genomic information into an actual breeding phenomenon.

Another sphere of the transformative influence of gene mining is industrial biotechnology. All microorganisms contain vast biochemical diversity, and most of the microorganisms cannot be cultured by standard techniques. By making this hidden diversity accessible through direct sequencing of environmental DNA, metagenomics has generated access to previously unknown genes encoding enzymes with industrial utility, and bioinformatic pipelines are being used to extract them. In the last ten years, through large-scale gene mining studies, enormous repertoires of enzymes active in breaking down biomass, polymer breakdown and specialised chemical transformation have been identified. The use of genome mining of biosynthetic gene clusters has also transformed the process of natural-product discovery, whereby researchers have been able to anticipate and access new antibiotics, pigments and bioactive compounds not necessarily based on traditional methods of screening using cultures.

Gene mining is also entering into an important role in food science and technology. Complex microbial communities influence the formation of fermented foods, which are based on the metabolic activities encoded in the microbial genomes. New sequencing and transcriptomics technologies can now be used to mine these communities to obtain genes in flavour development, vitamin biosynthesis and bioactive peptides. The ability to associate gene content and expression with the result of fermentation allows researchers to optimise the process of fermentation and the use of starter cultures rationally. Simultaneously, mining genes in crop plants has enabled the identification of genes that regulate micronutrient levels and lipid profiles, thereby facilitating the production of biofortified foods and healthier edible oils.

Gene mining forms the basis of precision medicine in the medical field. Patient cohort-wide sequencing results in vast datasets that have to be analysed to determine disease-related genes and pathways. Network-based methods combine genetic variation and gene expression, protein-protein and regulatory data to rank candidate disease genes even in polygenic architecture complex diseases. A direct medical use of gene mining is offered by pharmacogenomics, since drug-metabolising enzyme and transporter variants can be determined to predict the response of each individual to drug treatment and to inform individualised therapy. In the past ten years, both the quality of computational methods and curated genomic resources have substantially increased the clinical utility of these analyses.

New technologies are still widening the boundaries and strength of gene mining. Deep learning and artificial intelligence have displayed outstanding results in the prediction of gene functions, regulatory factors and protein structure based on the sequence information. Further resolution in single-cell and spatial genomics can allow gene mining to detect cell-type-specific and context-dependent signals that are not evident in bulk analyses. Collectively, these developments make gene mining a predictive and more generative field.

Irrespective of its successes, gene mining continues to experience a number of challenges such as data heterogeneity, lack of functional annotation and the necessity to conduct experimental validation. There is also a need to pay attention to the question of interpretability of complex machine-learning models and the ethical aspects of genomic data. However, the future of the field is evident. Gene mining has become a conciliatory system to extract value out of biological big data and to perform a systematic discovery of genes in agriculture, industry, food science and healthcare. With the further enhancement of computational capabilities and strategies of data integration, gene mining is likely to become increasingly more instrumental in the translation of genomic information into practical scientific and social value.