Abstract
Bioinformatics is an amalgamation of biology, mathematics and computer science. It is a science which gathers the information from biology in terms of molecules and applies the informatic techniques to the gathered information for understanding and organizing the data in a useful manner. With the help of bioinformatics, the experimental data generated is stored in several databases available online like nucleotide database, protein databases, GENBANK and others. The data stored in these databases is used as reference for experimental evaluation and validation. Till now several online tools have been developed to analyze the genomic, transcriptomic, proteomics, epigenomics and metabolomics data. Some of them include Human Splicing Finder (HSF), Exonic Splicing Enhancer Mutation taster, and others. A number of SNPs are observed in the non-coding, intronic regions and play a role in the regulation of genes, which may or may not directly impose an effect on the protein expression. Many mutations are thought to influence the splicing mechanism by affecting the existing splice sites or creating a new sites. To predict the effect of mutation (SNP) on splicing mechanism/signal, HSF was developed. Thus, the tool is helpful in predicting the effect of mutations on splicing signals and can provide data even for better understanding of the intronic mutations that can be further validated experimentally. Additionally, rapid advancement in proteomics have steered researchers to organize the study of protein structure, function, relationships, and dynamics in space and time. Thus the effective integration of all of these technological interventions will eventually lead to steering up of next-generation systems biology, which will provide valuable biological insights in the field of research, diagnostic, therapeutic and development of personalized medicine.
Introduction
The emergence of “innovative biology” is accompanied by the birth/innovation of other sciences, such as computational biology and bioinformatics, which have a combined interface of molecular biology. Due to the large datasets generated, its management and storage become critically important. Therefore, different databases came into existence, which organise a large amount of biological information stored and processed to permit the scientific community access (Ritchie et al., 2015). The increasing amount of data has been abetted by an increase in the number of biological databases (Pevsner, 2015). Usually public databases accumulate big amounts of information, and they are categorised into primary and secondary databases. The primary databases are composed of the findings of experimental data that are reported without any critical analysis related to previous publications (Luscombe et al., 2001; Prosdocimi, 2010). However, in the secondary databases, there is a collection and explication of data, called process of content curation. Besides various functional databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Reactome that allow analysis and explanation of metabolic maps. Various primary databases like DNA Database of Japan (DDBJ), GenBank at the National Center for Biotechnology Information (NCBI), and European Molecular Biology Laboratory (EMBL) remained as the main databases of nucleotide sequences and proteins. International Nucleotide Sequence Database Collaboration (INSDC) being the parent organisation of these databases and sharing among each other the deposited information daily (Prosdocimi et al., 2002; Amaral et al., 2007; Pevsner, 2015).
Last 2 decades have witnessed great advancements in molecular biology, data analysis procedures were established at a fast pace to enable the interpretation of the large amount of information produced mainly by DNA sequencing technologies that produced the exponential amelioration of genomics, transcriptomics and proteomics information. Biological data of genomics/proteomics although considered to be the recent domains, have emerged interdependently and created a historical impact on the available information coupled with innovations in computational resources, resulted in huge biological data and data analysis that can enhance and intensify the developments in medical science (Verli, 2014). In the current modern times ‘-omics’ suffix include the genomics, transcriptomics, proteomics, phylogenomics, metabolomics and metagenomics, associated with large-scale biological data and the allied bioinformatics analysis. The emergence of newest high-throughput sequencing innovations, starting with improvements in Sanger sequencing, innovations in NGS technologies and next-generation proteomics, resulted in emergence of novel findings in the clinical settings (Zhou et al., 2010).
Genome-Wide Approach—From Genome to Proteome
DNA sequencing plays a crucial role in the progression of molecular biology, not only changing the genetic landscape of genome designs but also opening up new opportunities in therapeutic arena and personalised medicine
Genomics
Generally, Genomics is the domain that aims to uncover and explore structure, function, and innovative realm of genomes applying bioinformatics tools to explore sequenced genomes. (Altmann et al., 2012).
Paul Berg’s (Jackson et al., 1972), Frederick Sanger’s (Sanger and Coulson, 1975), and Walter Gilbert’s (Maxam and Gilbert, 1977) pioneering work on DNA sequencing enabled several developments, including the advances that opened up completely new potentials for DNA analysis, Sanger’s ‘chain-termination’ sequencing technology, more commonly known as Sanger sequencing (Sanger et al., 1977). Further technological advancements steered in the rise of DNA sequencing, led to the development of the first automated DNA sequencer (ABI PRISM AB370A) to be released in 1986, allowing drafting of the human genome to be completed during the next decade (Venter et al., 2001). These new methods are meant to supplement and eventually replace Sanger sequencing Figure 1. This technology is commonly known as next-generation sequencing (NGS) or massively parallel sequencing (MPS), which encompasses a wide range of methodologies. It is feasible to create huge amounts of data & each instrument runs in a faster and more cost-effective manner using this technology. The Next Generation Sequencing market is currently developing and expanding, with the world-wide market expected to reach 21.62 billion US dollars by 2025, up around 20% from 2017 (BCC Research, 2019). As a result, multiple brands are currently competing in this business, including BGI Genomics, Illumina, Ion Torrent (Thermo Fisher Scientific), PacBio and Oxford Nanopore Technologies etc. All of them provide distinct approaches to the same query: the generation of sequencing data. Second-generation sequencing relies on large parallel and clonal amplification of molecules (PCR, polymerase chain reaction) (Shendure and Ji, 2008), whereas third-generation sequencing depends on sequencing of single-molecules without a preceding clonal amplification (Schadt et al., 2010; van Dijk et al., 2018; Ameur et al., 2019). Although the process of NGS include various steps:
FIGURE 1
1) NGS library Preparation: A library comprises DNA/RNA fragments that denotes the full genome/transcriptome or a region of interest in next-generation sequencing. Each NGS platform has its own unique features, in general, the production of an NGS library begins with fragmentation of the DNA/RNA, followed by the connection of sequence adaptors to fragments to permit enrichment of those fragments. The sensitivity and specificity of a good library should be high. This implies that all relevant fragments should be properly represented in the library and that there should be no random errors (non-specific products). It is easier said than done, though, because genomic areas are not all equally susceptible to sequencing, making the creation of a sensitive and specialised library difficult and cumbersome (Aird et al., 2011).
2) NGS Platforms
Platforms for Second-Generation Sequencing
The category of cyclic-array sequencing technologies (Amaral et al., 2007) includes second-generation systems. The production and library amplification (made from RNA/DNA samples), clonal growth, sequencing, and investigation are all part of the core workflow for second-generation platforms. Ion Torrent and Illumina are the two most well-known sequencing firms for second-generation sequencing systems (Kircher et al., 2011; Quail et al., 2012).
3) Platforms for Third-Generation Sequencing:
The ability to avoid limitations of PCR-based methods, such as nucleotide misincorporation by a polymerase, formation of chimaera and drop-outs of alleles resulting in an false homozygosity call, was made possible by 3rd-generation NGS technology (Thompson and Steinmann, 2010). The Helicos Genetic Analysis System was the first commercial third-generation sequencer (Pushkarev et al., 2009). The Pacific Biosystems (PacBio RS II sequencer) established the notion of single-molecule real-time (SMRT) sequencing in 2011 (McCarthy, 2010). Furthermore, this method allows for the sequencing of lengthy reads (up to 30 kb on average). Individual DNA polymerases are coupled to zero-mode waveguide (ZMW) wells, which are nanoholes where a single DNA polymerase enzyme molecule can be put directly (McCarthy, 2010). PacBio has released the Sequel II System, which claims to cut project costs and timelines by up to 175 kb with highly accurate individual long reads (HiFi reads) compared to previous versions (Pereira et al., 2020).
Merker and co-workers demonstrated initially to use a PacBio System for sequencing of long-read genomes to find a pathogenic variant in Mendelian disease patients, indicating that this method has a lot of potential for identifying structural variation (Merker et al., 2018). The Chromium instrument, which uses gel beads in emulsion (GEMs) technology, was released by 10X Genomics in 2016 (Pereira et al., 2020). The benefit of GEMs technology is that it cuts down on time, beginning material, and prices (Zheng et al., 2016; Zheng et al., 2017; Pereira et al., 2020). With low false positives and high throughput, the chromium system can also perform single-cell genomic and transcriptional profiling, immunological profiling, and chromatin accessibility studies at single-cell resolution. As a result, intriguing new applications are emerging, particularly in the areas of epigenetics research, de novo genome assembly, and long sequencing reads (Delaneau et al., 2019; Laurentino et al., 2019; Wang et al., 2019).
4) Innovative Bioinformatics approach: Sequencing platforms are improving, and it is now possible to sequence the human genome in as little as a week or two. Thus, the huge data generated necessitates bioinformatics and computational expertise to organise, analyse, and infer NGS data. As a result, NGS bioinformatics is undergoing significant development, which can only be aided by improving computational capabilities (hardware) as well as algorithms and applications (software) to streamline all required steps: from processing of raw data to detailed data analysis and variant interpretation in a clinical setting.
Analysis of the NGS data: NGS bioinformatics is usually classified into three categories: primary, secondary, and tertiary analysis (Pereira et al., 2020).
The primary data analysis includes the identification and evaluation of raw data (signal analysis), the target of the generation of legible sequencing reads (base calling), and the estimation of base quality (Ledergerber and Dessimoz, 2011). This main analysis often produces a FASTQ file (Illumina) or an unmapped binary alignment map (uBAM) file (Ion Torrent).
Secondary analysis, which involves read alignment against the reference human genome (usually hg19 or hg38) and variant calling, is the next step in the NGS data analysis workflow.
Read alignment, which includes aligning sequenced fragments (processed data) against a reference genome, or de-novo assembly, which involves constructing a genome from basic without the use of external data, are two options for mapping sequencing reads. The availability or absence of a reference genome could be enough to decide between one technique and another. Nonetheless, reference sequence mapping is the preferred method for most NGS applications, particularly in clinical genetics (Flicek and Birney, 2009). However, de-novo assembly, on the other hand, is primarily limited to more focused tasks, such as correcting flaws in the reference genome and improving the detection of SV and other complicated rearrangements and newer findings (Ameur et al., 2018).
In the context of human clinical genetics, the third main phase of the NGS analysis pipeline addresses the essential issue of “making sense” or data interpretation, which requires finding the basic link between variant data and the observed phenotype in a patient. The tertiary analysis starts with variant annotation, which adds a fresh layer of data to predict the functional impact of all variants found during the variant calling procedure. Variant filtering, prioritisation, and data visualisation approaches are utilised after variant annotation. These procedures can be carried out utilising a number of software suites, which must be updated on a regular basis to reflect the most recent scientific findings, necessitating ongoing maintenance and development on the part of the developers. The generalised workflow of NGS is shown in Figure 2.
FIGURE 2
Variant annotation is a crucial first step in the assessment of sequencing variants. As previously indicated (Scherer et al., 2007), variant calling generates a VCF file. Each line in such a file contains high-level information about a variant, such as genomic position, reference, and alternate bases, but no information biological implications. Variant annotation provides biological context for all discovered variants. Data annotation is performed automatically due to the large amount of NGS data. For variant annotation, several programmes are currently available, each of which uses distinct approaches and databases such as Sorting Intolerant from Tolerant (SIFT), (Ng and Henikoff, 2003), PolyPhen-2, (Adzhubei et al., 2010), Combined Annotation Dependent Depletion (CADD) (Kircher et al., 2014) and Condel (González-Pérez and López-Bigas, 2011), compute the impact scores for each variant based on various specifications, such as sequence homology, conservation of amino acid residues, evolutionary conservation, structure of protein, or statistical prediction based on known mutations, are integrated into such annotation tools. Furthermore, annotation can be used to search disease variant databases like ClinVar and HGMD for information on their clinical associations. Annotate Variation (ANNOVAR) (Yang and Wang, 2015) variant effect predictor (VEP) (McLaren et al., 2010), Single Nucleotide polymorphism effect (snpEff) (Cingolani et al., 2012), and SeattleSeq (Ng et al., 2009) are the most extensively used annotation tools among the many available. SNPs, INDELs, and Copy Number Variation (CNVs) can all be found using ANNOVAR, a command-line tool. It compares variants and explicates the functional consequence of variants on genes and other genomic components (Wang et al., 2010a). The overall number of variants obtained after analysis of a VCF file from WES may range between 30,000 and 50,000. Filtering algorithms are required to find the variant(s) responsible for a particular disorder. Some more examples in Table 1. As a result, it is strongly advised to eliminate false-positive calls and variant call errors when beginning the third level of NGS analysis, depending on quality parameters or prior knowledge of artefacts. The population frequency filter is one of the most widely used NGS filters. One of the filter based on allele frequency is minor allele frequency (MAF), which can sort variations into different categories: uncommon variants (MAF 0.5, usually picked for Mendelian illnesses), low frequency variants (minor allele frequency between 0.5 and 5%), and common variants (MAF >5%) (Consortium et al., 2010). It not only aids in better identifying disease alleles, but also in understanding population migrations, relationships, origins, admixtures, and population size changes, which may be useful in understanding various disease patterns (Stoneking and Krause, 2011). The most extensively utilised databases are the 1,000 genome project (Siva, 2008), Exome Aggregation Consortium (ExAC) (Lek et al., 2016), and the Genome Aggregation Database (gnomAD; http://gnomad.broadinstitute.org/). This filter, however, has limits and may result in incorrect exclusion.
TABLE 1
| S.No | Software | Description | Ref |
|---|---|---|---|
| 1 | Phylo PPhylogeneticp-values | The patterns of conservation (positive scores)/acceleration (negative scores) for various annotation classes and clades of interest are investigated using a neutral evolution model | Pollard et al. (2010) |
| 2 | SIFT Sorting Intolerant from Tolerant | Based on the sequence homology, Predicts whether an AA change would affect protein function and maybe alter the phenotype. A variation with a score of less than 0.05 is considered deleterious | Ng and Henikoff, (2003) |
| 3 | PolyPhen-2 Polymorphism Phenotyping v2 | Using a naive Bayes classifier, predicts the functional impact of an AA substitution based on its individual properties Two tools are included. HumDiv (intended for use in complicated phenotypes) and HumVar (designed for Mendelian disease diagnosis). Higher scores (>0.85) predicts more confidently, damaging variants | Adzhubei et al. (2010) |
| 4 | CADDCombined Annotation Dependent Depletion | Scores all human SNV and Indel using a combination of genomic annotations. According to functional categories, effect sizes, and genetic architectures, it prioritizes functional, deleterious, and disease-causing variations. Pathogenic variants should be identified using a cut-off score of 10 or above | Kircher et al. (2014) |
| 5 | MutationTaster | Evaluates evolutionary conservation, splice-site alterations, protein loss, and changes that could affect mRNA levels. Polymorphisms and disease-causing variants are both classed as polymorphism | Schwarz et al. (2010) |
| 6 | nsSNPAnalyzer | Extracts structural and evolutionary information from a query nsSNP and predicts its phenotypic effect using a machine learning method (Random Forest). The variant is divided into two categories: neutral and disease | Bao et al. (2005) |
| 7 | TopoSNP Topographic mapping of SNP | SNPs are analysed based on their geometric position and conservation information, resulting in an interactive visualisation of disease and non-disease linked with each SNP. | Stitziel et al. (2004) |
| 8 | ANNOVAR * Annotate Variation | Annotates variants based on a variety of criteria, including whether SNPs or CNVs affect protein function (gene-based), locating variants in specified genomic regions outside of protein-coding regions (region-based), and locating known variants in public and licensed databases (filter-based) | Yang and Wang, (2015) |
| 9 | VEP *Variant Effect Predictor | Determines the impact of numerous variants (SNPs, insertions, deletions, CNVs, or structural variants) on genes, transcripts, and protein sequences, as well as regulatory domains, on genes, transcripts, and protein sequences | McLaren et al. (2010) |
| 10 | snpEff * | SNV are annotated and classified based on their effects on annotated genes, such as synonymous/nsSNP, start or stop codon gains or losses, genomic positions, and so on Considered a structurally based annotation tool | Cingolani et al. (2012) |
| 11 | SeattleSeq | Provides dbSNP rs IDs, gene names and accession numbers, variant functions, protein locations and AA changes, conservation scores, HapMap frequencies, PolyPhen predictions, and clinical association for SNVs and tiny indels | Ng et al. (2009) |
Demonstrates a list of commonly used tools for performing an NGS functional filter, along with examples.
The bold values are the names of software/tools.
Even though, functional annotation offers a significant information for filtering, the most critical question to answer, especially in the context of gene discovery, is whether a given variant or mutant gene the disease-causing gene? What is its frequency in different population sets studied globally? To solve this difficult issue, a new generation of tools is being created that, rather than just omitting information, rate variants and allow them to be prioritised. (MacArthur et al., 2012; Lelieveld et al., 2016; Harper, 2017). Various ways have been suggested e.g. PHIVE investigates the similarities between human illness phenotypes and those derived from animal model organism knockout experiments (Robinson et al., 2014). While other methods try to handle the problem in a novel way, by computing a lethal score (also known as burden score) for each gene using data from population variation databases (Eilbeck et al., 2017).
Phevor, which uses data from other relevant ontologies, such as gene ontology (GO), to advocate novel gene–disease connections, can also be employed for the identification of novel genes (Singleton et al., 2014). The fundamental purpose of these tools is to provide a small number of variants that can be validated using molecular techniques (Pereira et al., 2019a; Pereira et al., 2019b). VarSeq/VSClinical (Golden Helix), Ingenuity Variant Analysis (Qiagen), Alamut® software (interactive biosoftware), and VarElect have all recently been developed commercial softwares for the elucidation and prioritisation of variants in a clinical context, to be used by clinicians, geneticists, and researchers (Stelzer et al., 2016). Apart from the tools that aid in variant analysis and elucidation, clinicians now have access to medical genetics firms like Invitae (https://www.invitae.com/en/) and CENTOGENE (https://www.centogene.com/) that provide a precise medical diagnosis.
5) Third generation sequencing technologies has the capability of sequencing single molecules with average read lengths of >10,000bp -100,000bp or even more. The advent of this technology has eliminated the requirement of amplification of DNA (PCR) and it provides real time results (Pereira et al., 2020). The third-generation sequencing services are provided by Pacific Biosciences (PacBio) that utilizes the single molecule real time (SMRT) platform and fluorescent nucleotide detection methodology. Oxford Nanopore Technologies (Minion) which utilizes the nanopore methodology where an ionic current passes through the flow cell and nucleotides bases are determined by the changes they produce in the current respectively when pass through the nanopores. (Xiao and Zhou, 2020).
The bioinformatic tools required to analyze the data obtained from the third-generation sequencing technologies needs to be more specific and error prone. Some tools are depicted in Table 2.
TABLE 2
| S.No | Software | Description | Ref |
|---|---|---|---|
| 1 | MinHash Alignment Process (MHAP) | Detects long read overlaps | Berlin et al. (2015) |
| 2 | Minimap/miniasm | De novo assembler for long reads | Li, (2016) |
| 3 | DALIGN | finds overlaps and local alignments in very noisy long read DNA sequencing data sets | Li, (2016) |
| 4 | Graphmap | detects single-nucleotide variant calling on the human genome; have increased sensitivity of 15%; provides precise detection of structural variants from length 100 bp - 4 kbp | Sović. (2016) |
| 5 | BLASR | Maps long reads influenced by insertion and deletion errors | Chaisson and Tesler, (2012) |
| 6 | Nanocorrect | Error correction in long reads | Loman et al. (2015) |
| 7 | PBJelly | For gap closing in genome assembly | English et al. (2012) |
| 8 | HGAP | De novo assembly | Chin et al. (2013) |
| 9 | PoreSeq | Variant calling | Szalay and Golovchenko, (2015) |
| 10 | Nanocorr | Error correction/de novo assembly/de novo mutation or SNPs detection | Goodwin et al. (2015) |
| 11 | Nanocall | Variant calling | David et al. (2017) |
| 12 | DeepNano | Base caller | Boža et al. (2017) |
| 13 | Nanopolish | Enhances the base quality | Loman et al. (2015) |
Demonstrates various software used in third generation sequencing.
Limitations: Although Third generation sequencing technology is fast and provide real time result however still NGS are preferred as the error rate is less in NGS as compared to third generation sequencing which is ∼15%. Due to this high error rate, the technology can miss the detection of SNPs/point mutations and not best suited for mutational analyses. The methodology requires improvement. Moreover, there is need to develop more bioinformatic tools and algorithms for the downstream data analyses that is again a challenge for researchers for the time being (Ozsolak, 2012).
Transcriptomics
cDNA sequencing or RNA-seq when compared to other methods allows for more accurate mapping of reads and quantification at the transcript level. Differential expression analysis and identification of isoforms due to mRNA splicing, NGS of Small non-coding RNA as well as the discovery and characterisation of novel transcripts, are examples of high throughput applications (Marioni et al., 2008; Wang, 2009; Montgomery et al., 2010).
Small non-coding RNA NGS: A significant increase has been seen in the research community related to biomarkers which aids in the prediction, early detection and prevention of the disease. The biomarkers research helps the scientific and clinical community significantly in improving the clinical outcomes (Lopez et al., 2015). Non-coding RNAs (ncRNAs) have become the biomarker hotspot of the research interest in the field of disease identification and treatment. MicroRNAs (miRNAs) are the type of ncRNAs which are mostly explored for their potential biomarker role (Lopez et al., 2015). Till date ncRNA studies have been performed mainly by qRT-PCR, in situ hybridization, or microarray techniques. NGS has opened a new way to analyze/detect the RNA molecules present in the biological samples. NGS tenders several methodological advantages over other technologies like increased throughput, decreased RNA input, good consistency and quality of data, higher detection depth, analysis of all RNA populations, and discovery of novel molecules (Liu et al., 2021). A typical RNA-sequencing experiment consists of the following steps:
Thus all the above possibilities have allowed us to learn more about the genome’s organisation, the molecular constituents of cells and tissues, and the complexities of regulatory systems (Zhou et al., 2010; Sims et al., 2014). Many investigations, both fundamental and applied, have focused on mRNA splicing. Between the transcriptional and translational level, splicing occurs in every eukaryotic cell. Pre-mRNA transcripts may be variably spliced depending on location of tissue and/or stage of development, allowing multiple transcripts to be generated and hence distinct proteins to be made from the same gene (Burge et al., 1999; Nilsen, 2003). The divergence of splice site sequences from the prototypes has been linked to the generation of alternative transcripts. Furthermore, in most introns of higher eukaryotes, these extremely degraded motifs may be observed. Pseudo-exons are intronic sequences of standard exon size that outnumber real exons and are flanked by sequences that fit the exon’s 5′ and 3′ splicing signal requirements, but are never recognized as proper exons by the spliceosome. To distinguish true exons and splice sites from pseudo exons, splicing machinery must rely on auxiliary sequence features such as intronic and exonic cis-elements (Jacob and Gallinaro, 1989).
Exonic Splicing Enhancers (ESEs) are the most researched and well explored among them. They’re nucleotide sequences of short length that are primarily targeted by Serine/Argine-rich (SR) proteins, which then help to define exons (Blencowe, 2000). Exonic Splicing Silencers (ESSs), on the other hand, assist the spliceosome in neglecting pseudo exons and decoy splice sites. They serve as binding sites for exon exclusion-promoting proteins (mostly hnRNP proteins) (Zhu et al., 2001). Several bioinformatics approaches have been created and are now accessible to examine or predict splice signals (Zhang et al., 2005). One of the most essential bioinformatics tools is HSF (Human Splice Finder). For administration of data, designing of algorithm and online interface, HSF was built with the 4D package (4D S.A.). The HSF database was created with all human genes containing introns and exons. It was created using an Ensembl dataset that included about 22 000 genes and 46 000 transcripts from Homo sapiens. Because matrices and methods were specifically built for the human genome, the HSF database exclusively contains human genes (Flicek et al., 2008). HSF also has data taken from the Ensembl Variation Database (EVD), which can be used to investigate the impact of SNPs on splicing. A Perl script was written utilizing the Ensembl Perl API to allow HSF to access the EVD directly and get SNPs in human genes. Because matrices and methods were specifically built for the human genome, the HSF database exclusively contains human genes (Flicek et al., 2008).
On the other hand, Exonic splicing enhancers (ESEs) can be disrupted by nonsense, missense, and even translationally silent mutations, causing the splicing machinery to skip the mutant exon with significant consequences on gene structure. The frequency of mutations, whose major consequence is unusual splicing has been significantly underestimated because the effects of mutations are most often predicted purely based on information of genomic sequence (Cartegni et al., 2002). ESEs are found in both alternative and constitutive exons, where they serve as binding sites for Ser/Arg-rich proteins (SR proteins), a family of conserved splicing factors involved in a variety of splicing stages (Graveley, 2000). Through their RNA-binding domain, SR proteins promote exon definition by attracting spliceosomal components via protein–protein interactions facilitated by their RS domain and/or antagonizing the function of surrounding splicing silencers. Multiple categories of ESE consensus motifs have been described, and different SR proteins have varying substrate specificities (Graveley, 2000; Cartegni et al., 2002; Fairbrother et al., 2002). Using weight matrices for four different human SR proteins, ESE finder searches query sequences for potential ESEs. The matrices are based on frequency values produced from the alignment of winning sequences obtained through functional SELEX studies, corrected for the background nucleotide frequency of the initial SELEX library, which was created using chemical synthesis (Liu et al., 1998; Liu et al., 2000). The query sequences can be entered directly into the input box or submitted as a text file. Multiple sequences can be processed at the same time if they are preceded by a FASTA-format description line (starting with ‘>’). Despite the fact that ESEfinder is a tool for RNA analysis, it only accepts normal DNA nomenclature (A, C, G, and T, not U). Any character other than the letters A, C, G, and T, as well as spaces and paragraph breaks, will be ignored by the programme. Although both upper and lower case are acceptable, the output lines will be written in upper case. The user can choose from one to four matrices to be used at the same time. The result for each matrix is a series of 1 ntd incremented scores. Only the ‘hits’ or ‘high score motifs’ are displayed in the initial output window, Figure 3 which include the position of the first nucleotide, the motif match sequence, and the calculated score. When a score exceeds the threshold value set in the input page, it is deemed a high score.
FIGURE 3
By choosing the ‘custom’ button and entering the required value into the box, any score can be used as the cutoff threshold. As a result, ESEfinder may be used to identify potential ESEs, and the prime application is the accurate interpretation of the impact of disease-associated variants. It has been previously demonstrated that ESEs predicted by this matrix-based method cluster in places where natural enhancers have been empirically localized and are more common in exons than in introns (Cartegni et al., 2003).
CircRNAs: In contrast to messenger RNAs, circular RNAs (circRNAs) are physiologically active nucleic acid molecules that occur in closed loop RNA forms and do not have polyadenylated tails. CircRNAs are classified as non-coding RNA (ncRNA), yet some circRNAs have the ability to code for proteins. CircRNAs were originally discovered and identified in plant viroids in the 1970s, and then in the cytoplasm of eukaryotic cells in the 1980s. Due to the prevalence of linear RNAs, early development in this field was likely modest, and circRNAs were thought to be a consequence of RNA splicing. Recent advancements in next-generation sequencing and related bioinformatics technologies, on the other hand, have speed up research in humans, mice, nematodes, plants, and archaea have all been found to have these compounds (Chen et al., 2021). Various tools employed for the analysis of circRNAs are summarized in Table 3, below.
TABLE 3
| Tool name | TT | Installation Type | ATMR | PL | CV | Platform | Ref |
|---|---|---|---|---|---|---|---|
| CIRCexplorer | De novo; annotation | pip, Conda, Docker | STAR, BWA | Python | v2.3.8 | Unix/Linux | (Zhang et al., 2014a) |
| CircPro | De novo; annotation | MID | BWA (CIRI2) | Perl | — | Unix/Linux | Meng et al. (2017) |
| MapSplice | De novo; annotation | Conda | Bowtie | Python | v2.2.1 | Unix/Linux | Wang et al. (2010b) |
| circRNA_finder | De novo | MID | STAR | Perl, AWK | v1.2 | Unix/Linux | (Westholm et al., 2014; Jia et al., 2019) |
| CircRNAFisher | De novo | MID | Bowtie2 | Perl | v0.1 | Unix/Linux | Westholm et al. (2014) |
| miARma | De novo | Docker, Virtual box image | BWA (CIRI) | Perl, Python, R | v1.7.5 | Unix/Linux, Windows | Andrés-León et al. (2016) |
| CIRI | De novo | MID | BWA | Perl | v2.0.6 | Unix/Linux | (Gao et al., 2015; Gao et al., 2018; Zheng et al., 2019) |
| ACFS | De novo | MID | BWA BLAT | Perl | v2.0 | Unix/Linux | You and Conrad, (2016) |
| CircDBG | Annotation | CR | k-mer (no need aligner) | C++ | - | Unix/Linux | Li and Wu, (2020) |
Showing the various bioinformatic software tools used in circRNAs analysis.
Header Abbreviations: TT, tools type; IT, installation type; CV, current version; Ref, reference; ATMR, aligner or tools or method required; PL, programming language.
Proteomics
Understanding the molecular processes that mediate cellular physiology requires the identification, quantification, and characterization of a cell’s whole protein content (Schmidt et al., 2014; Jensen et al., 2006). A rapid advancement in proteomics has steered the researchers to organize the study of protein structure, function, relationships, and dynamics in space and time. The groundbreaking revelation that DNA contains all of the genetic instruction required to build an organism gave rise to molecular biology’s central dogma, which characterized a one-way flow of information from DNA to RNA to Proteins. This belief has been debunked by recent discoveries. Epigenetic markings, alternative splicing, non-coding RNAs (including microRNAs), protein–protein interaction (PPI) networks, and post-translational modifications (PTMs) are only a few examples of how genotype and phenotype are not solely determined by information on the genome (Nagaraj et al., 2011; Beck et al., 2011; Baker, 2012). Proteomics is the global study of proteins, which are the key functional entities in the cell. This analysis is arguably the most important level of information required to understand how cells work. When compared to data collection at the genomic and transcriptomic levels, the proteomic data acquisition has proven difficult. Global protein analysis is a difficult analytical task, in part because amino acids, the building blocks of proteins, have such a wide range of physicochemical properties. Furthermore, in comparison to the genome, the proteome is enriched by alternative splicing and a wide range of protein modifications and degradation, and the complexity is heightened by the interconnectivity of proteins into complexes and signaling networks that are highly divergent in time and space Figure 4 (Cox and Mann, 2011). A decade ago, sequencing and identifying a single protein was a big problem; however, today’s high-throughput technology allows for the identification and quantification of essentially all expressed proteins in a single experiment. Similarly, 10 years ago, MS-based phosphoproteomics could only identify a few hundred phosphosites, whereas currently more than 30,000 phosphosites can be quantitatively monitored. This current method is referred to as “next-generation proteomics” to reflect its ability to characterize practically the whole proteome as a result of advancement in technology. Proteomics technologies, particularly MS-based Protein identification has advanced tremendously in recent years as a result of cumulative technological breakthroughs in instrumentation, sample preparation and computational analysis (Ficarro et al., 2002; Lemeer and Heck, 2009; Lundby et al., 2012).
FIGURE 4
Proteomics using mass spectrometry (MS) generates a large quantity of information about the expression, post-translational modifications (PTMs), and interactions among thousands of proteins. The obtained data must be supplied to the scientific community in a format that is both suitable and curated, as well as retrievable and interpretable. Proteomics data will be made freely available to the public, ensuring that quality standards are maintained in the area. The long-term storage of unprocessed raw data is a first level of distribution for proteomics data. Understanding the proteome’s complex and dynamic interactions necessitates the creation of physical interaction charts.
Proteins frequently interact with one another in stable or transient multi-protein complexes of varying composition, with the human interactome containing an estimated 130,000 binary interactions, the majority of which have yet to be mapped. Proteins can also interact with other molecules like RNA, DNA and metabolites. These complexes play crucial roles in regulatory processes, signalling cascades, cellular functions, and their inability to interact can result in their function being lost (Altelaar et al., 2012; Ma and Johnson, 2012). Tranche is one of the few public repositories that can manage this type of data at the moment, and it is based on an encrypted peer-to-peer system that stores data in numerous servers across the world. Raw data, on the other hand, is in a closed format, which makes it difficult to share. As a result, attempts are being undertaken to standardise formats that preserve all necessary information (Smith et al., 2011). The European Bioinformatics Institute’s PRIDE database exhibits this determination, as it enables the for the storage of both conventional MS data formats (XML) and associated peptide and protein identifications. Furthermore, including additional data (such as species, fragmentation procedures, and proteases) allows for a global meta-analysis of proteomic data sets (Perez-Riverol et al., 2019).
Moreover, Protein sequence alignment compares two or more than two sequences and aids in the identification of homologous regions, visualizing the relationship among sequences with respect to evolution and structure. It plays a crucial role in bioinformatics and helps in the query and construction of databases, prediction of protein’s primary, secondary and tertiary structure and biological function and many more. Many platforms are developed to analyse the sequence alignment. Some of them are PROSITE, Pfam, BLAST, FASTA, Clustal omega, T-Coffee, MUSCA, ALIGN, DIALIGN, ProbCons, HMMER3 phmmer and many more (Pruess and Apweiler, 2003; Sievers et al., 2011; Singh et al., 2016a).
Protein structure prediction can be done using the ProtParam tool from ExPasy (Expert Protein analysis system) (Gasteiger et al., 2005). It helps in the primary structure prediction of protein and aids in the computation of physicochemical properties of a given protein. The parameters that can be computed include molecular weight, amino acid and atomic composition, isoelectric point, estimated half-life, grand average of hydropathicity (GRAVY) and more. To predict the secondary structure, many tools have been developed till now including Chow-Fasman algorithim—a statistical approach which is based on calculation of statistical propensities of each residuum to form an α-helix or β-strand, GOR, Jpred, etc. Similarly, for tertiary protein structure prediction, PHYRE2 (Protein Homology/analogY Recognition Engine) (Kelley et al., 2015) and I-TASSER are available (Yang et al., 2015).
Apart from above mentioned software suits, there are other tools which are helpful in addressing protein analysis. Some of them are mentioned in Table 4.
TABLE 4
| S.No | Software | Description | Ref |
|---|---|---|---|
| 1 | Expasy | A molecular server dedicated to protein and nucleic acid sequence analysis | Gasteiger et al. (2003) |
| 2 | Frame plot | Protein coding region prediction in Bacterial DNA | Ishikawa and Hotta, (1999) |
| 3 | MPEx | Membrane Protein Explorer (MPEx) is a tool that uses hydropathy plots based on thermodynamic principles to explore the topology and other properties of membrane proteins | Snider et al. (2009) |
| 4 | Predict Protein | Predict Protein is an online service that analyses protein sequences and predicts their structure and function. Predict Protein offers numerous sequence alignments, PROSITE sequence motifs, low-complexity regions (SEG), nuclear localization signals, regions lacking regular structure (NORS), and secondary structure predictions after users submit protein sequences or alignments | Bernhofer et al. (2021) |
| 5 | ProDom | Pro Dom is a database of protein domain families built by grouping homologous regions. The recursive PSI-BLAST searches [ALTS2] are used in the ProDom construction technique MKDOM2. Non-fragmentary protein sequences from the SWISS-PROT and TrEMBL databases were used as the starting point | Bru et al. (2005) |
| 6 | Prot Scale | Prot Scale lets you compute and visualise the profile generated by any amino acid scale on a given protein. Each type of amino acid is assigned a number value on an amino acid scale | Gasteiger et al. (2005) |
| 7 | Sequence Manipulation Suite (SMS) | The Sequence Manipulation Suite is a set of JavaScript tools for generating, formatting, and analysing short DNA and protein sequences in BioSyn’s Gizmo Tools | Stothard, (2000) |
| 8 | Worldwide Protein Data Bank (wwPDB) | The wwPDB hosts a single Protein Data Bank Archive of macromolecular structural data that is freely and openly accessible to the entire world | Berman et al. (2007) |
Demonstrates the Protein sequence analysis tool.
To study the post-translational modifications, tools like GlycoMod (Cooper et al., 2001), NetPhos (Trost and Kusalik, 2011), NetPicoRNA (Smits et al., 2013), FindMod (Gasteiger et al., 2003), ScanProsite (De Castro et al., 2006) and others are available online. For protein interaction analyses STRING can be used (Szklarczyk et al., 2021). To visualize the 3-D structure of proteins, tools like Pymol and Jmol can be used. Pymol is also used to visualize the protein-ligand docking, binding site prediction, protein interactions and others (DeLano, 2002; Herráez, 2006).
The identification of protein biomarkers with prognostic or diagnostic significance is one of the most difficult applications of proteomics right now Figure 5.
FIGURE 5
As previously mentioned, recent technical advancements have resulted in the development of comprehensive pipelines that incorporate the discovery and validation phases, allowing plasma biomarkers to be identified for many diseases (Addona et al., 2011; Whiteaker et al., 2011). Despite the introduction of some successful biomarkers for clinical application, many (if not most) claimed biomarkers have weak reliability or lack rigorous confirmation, leading to scepticism among clinicians. The lack of proper controls in the discovery phase, the use of appropriate statistical tools for biomarker definition, and the need for independent validation steps in large patient cohorts to certify the legitimacy of the biomarker unambiguously are the primary flaws in many biomarker studies; such flaws lead to claimed biomarkers that are rarely directly related to disease biology (Poste, 2011).
Metabolomics: Beacon for the 21st Century
After genomics, transcriptomics and proteomics, metabolomics is the innovative & newest of the “omics” sciences, combining high-throughput analytical techniques with bioinformatics. It is concerned with the quantitative and qualitative evaluation of metabolites, which are key metabolic intermediates and end products (Zhang et al., 2014b). The purpose of this scientific method is not only to figure out what pathological processes or disturbances are at the root of a specific disease entity, but also to anticipate how those conditions will respond to treatment interventions. Metabolomic analysis help discriminate between normal and abnormal pathways, which aids in disease diagnosis and prognosis prediction (Zhang et al., 2015). The potential of the metabolome to reflect environmental effects and to provide a snapshot of the individual’s pathophysiological status at a certain point in time is a noteworthy benefit of the metabolome over the genome (Shah et al., 2015; Zhang et al., 2015). The prime concern of the researchers/clinicians is the better understanding of the disrupted biochemical and pathological processes, as well as to inform the creation of more effective therapeutic medicines for the treatment of those illness states in humans. Metabolomic tools have the benefits of being quick, inexpensive, and sensitive. Metabolomics can be studied using a variety of techniques, including mass spectrometry (MS), nuclear magnetic resonance (NMR) spectroscopy, and Fourier-transform infrared (FTIR) spectroscopy. Metabolomic fingerprinting, metabolic profiling, metabolic footprinting, target analysis, and flux analysis are examples of such methods that all play important roles in understanding toxicological mechanisms and disease processes in live organisms (Tripathi et al., 2013; Zhang et al., 2013; Zhang et al., 2014b). Metabolomics is also critical in discovering new drugs, biomarkers for early disease diagnosis, such as rheumatoid or osteoarthritis (Carlson et al., 2018; Takahashi et al., 2019; Dudka et al., 2021), osteoporosis, cardiovascular disease, and Alzheimer’s disease (AD), cancer prognosis, diagnosis, and treatment (Pushkarev et al., 2009; McCarthy, 2010; Thompson and Steinmann, 2010; Kircher et al., 2011; Quail et al., 2012; Zheng et al., 2016; Zheng et al., 2017; Merker et al., 2018; Pereira et al., 2020), inborn errors of metabolism (IEM) and a variety of other applications (Carlson et al., 2018).
Pharmacogenomics/Pharmacogenetics: in-Silico Approach
Pharmacogenomics is described as the study of genes and how medications alter an individual’s reaction. Pharmacogenomics is an emerging new discipline of science that combines pharmacology (the branch of science that studies drugs) with genomics (the branch of science that studies genes) to generate effective doses and safe pharmaceuticals tailored to an individual patient’s genetic makeup. One of the most important programs in which researchers are building and learning about genetic relationships and their impact on the body’s reaction to drugs is the Human Genome Project. Differences in genetic makeup influence pharmaceutical effectiveness, making it possible to anticipate medication effectiveness for an individual and investigate the presence of adverse drug reactions in the future (Caldwell et al., 2007).
Because of the wide range of individual responses to drug therapy, predicting the degree of effectiveness of a medication for a certain patient is difficult. Along with these clinical aspects, pharmacological factors such as variations in metabolism, drug distribution, and drug directed proteins play a significant role (Wattanachai et al., 2017). Table 5 describes various softwares employed in addressing Pharmacogenomics.
TABLE 5
| S.No | Software name | Software Description | Ref |
|---|---|---|---|
| 1 | Pharmacogenomics Knowledge (PharmGKB) | It’s a comprehensive resource that compiles information on the impact of genetic variation on drug response, such as dosing guidelines, drug labels, gene-drug connections, and the genotype-phenotype link | Thorn et al. (2013) |
| 2 | The Drug Gene Interaction Database | DGIdb is a database and web interface for identifying drug-gene interactions, both known and unknown | Freshour et al. (2020) |
| 3 | Side Effect Resource (SIDER 2) | It covers data on marketed drugs and any adverse medication reactions that have been reported. Public documents and package inserts were used to gather the data. Side effect frequency, drug and side effect categories, and connections to additional information, such as drug–target relationships, are all included in the available data | Kuhn et al. (2016) |
| 4 | Drug Bank | Drug Bank Online is a comprehensive, free-to-use online database of drug and drug target information | Wishart et al. (2018) |
| 5 | Search Tool for Interaction of Chemicals (STITCH) | It uses data from the scientific literature and new research findings to describe chemical interactions with genes and proteins, as well as diseases and chemicals, and diseases and genes/proteins on humans | Kuhn et al. (2008) |
| 6 | Genomics of Drug Sensitivity in Cancer | The database contains data on the link between tumour cell genomes and anti-cancer drug sensitivity The sensitivity patterns of human cancer cell lines to a wide range of anti-cancer treatments were compared to genomic and expression data in order to find genetic factors that are predictive of sensitivity | Yang et al. (2013) |
Demonstrates various in silico approaches used in Pharmacogenomics.
The bold values are the names of software/tools.
Epigenomics—complex diseases: An enigma
Understanding the causes and mechanisms of complex non-Mendelian diseases remains a major issue and point of concern, despite substantial effort. Despite the fact that various molecular genetic linkage and association studies have been carried out in order to explain the heritable tendency to complicated disorders, the results are sometimes inconclusive and even contentious. Similarly, determining the environmental factors that cause a disorder is difficult (Singh Nanda et al., 2016). The emphasis is switched to epigenetic misregulation as a primary etiopathogenic element, which presents a novel interpretation of the paradigm of “genes plus environment”.
Various non-Mendelian irregularities of complex diseases, such as the presence of clinically indistinguishable sporadic and familial cases, sexual dimorphism, relatively late age of onset and peaks of susceptibility to some diseases, discordance of monozygotic twins, and major fluctuations on the course of disease severity, are consistent with epigenetic mechanisms. It is also been claimed that stochastic epigenetic processes in the cell may account for a significant percentage of phenotypic diversity formerly attributed to environmental factors. It is proposed that using epigenetic strategies in conjunction with traditional genetic strategies can greatly speed up the finding of etiopathogenic processes in complicated disorders (Lacal and Ventura, 2018). Epigenetic microarray technologies and in silico approaches will considerably enhance epigenetic investigations in complicated disorders as shown in Table 6.
TABLE 6
| S.No | Software name | Software Description | Ref |
|---|---|---|---|
| 1 | DMRichR | R package and executable for analysing and visualizing differentially methylated regions (DMRs) using CpG count matrices statistically (Bismarck genome-wide cytosine reports) It primarily employs the dmrseq and bsseq algorithms for upstream pre-processing, downstream analysis, and data display | Laufer et al. (2020) |
| 2 | CpG_Me | A whole genome bisulfite sequencing (WGBS) process for DNA methylation alignment and quality control that starts with raw reads (FastQ) and ends with a CpG count matrix (Bismark genome-wide cytosine reports) | Laufer et al. (2022) |
| 3 | Rn Beads | A Bioconductor (R) package for comprehensive analysis of DNA methylation data from Illumina Infinium arrays (450 K and EPIC) and BS-seq. MeDIP-seq and MBD-seq are also supported after some external processing | Müller et al., (2019) |
| 4 | MEDIPS | A Bioconductor (R) package for MeDIP (methylated DNA immunoprecipitation) and sequencing research (MeDIP-seq) | Lienhard et al. (2014) |
| 5 | Minifi | A Bioconductor (R) package for your Illumina Infinium arrays (450 K and EPIC) that enables complete analysis and takes cellular heterogeneity into account | Aryee et al. (2014) |
| 6 | DMRcate | A Bioconductor (R) package for the identification of DMR from the human genome using WGBS and Illumina Infinium array (450 K and EPIC) data | Peters et al. (2015) |
| 7 | FEM | Integrative analysis of DNA methylation and gene expression data | Gentleman et al. (2004) |
| 8 | coMET | Visualization of Epigenome-Wide Association Study (EWAS) from a genomic region | Martin, (2014) |
Showing various in silico approaches in Epigenomics.
The bold values are the names of software/tools.
Pathway/Enrichment Analysis framework: omics Data
Comprehensive DNA, RNA, and protein quantification in biological materials is prevalent. The generated data is rapidly accumulating, and its analysis aids researchers in discovering new biological functions, genotype–phenotype correlations, and disease causes (Lander, 2011; Stephens et al., 2015). Many researchers, however, find that analysing and interpreting these data is a huge issue. Long lists of genes often emerge from analyses, requiring an impractically enormous amount of manual literature research to analyze.
Scientists can use pathway enrichment analysis to acquire mechanistic insight into gene lists generated by genome-scale (omics) investigations. This approach finds biological pathways that are more enriched in a gene list than is expected by chance (Nguyen et al., 2019). Innovative pathway enrichment analysis methodologies and provide a step-by-step guidance for interpreting gene lists generated by RNA-seq and genome-sequencing research. The approaches can be employed in various sets: defining a gene list using omics data, determining statistically enriched pathways, and visualizing and interpreting the results. This technique can be used in expressed genes and cancer genes that have been altered; however, the idea can be extended to a wide range of omics data (Paczkowska et al., 2020). Although there are various enrichment tools. Few of them are summarized in Table 7.
TABLE 7
| S.No | Software name | Software Description | Ref |
|---|---|---|---|
| 1 | singular enrichment analysis (SEA) | The enrichment P-value for each term from the pre-selected interesting gene list is calculated | Huang et al. (2009) |
| Then, in a basic linear text style, the enriched terms are listed. The most traditional algorithm is this one The majority of enrichment analysis tools still rely on it | |||
| 2 | Gene set enrichment analysis (GSEA) | The enrichment analysis takes into account all genes (without pre-selection) and their related experimental values. The following are the distinguishing characteristics of this strategy: I Unlike Classes I and II, there is no requirement to pre-select interesting genes; (ii) Experimental values are integrated into P-value computation | Subramanian et al. (2005) |
| 3 | Modular enrichment analysis (MEA) | This approach carries on the spirit of the SEA. The term–term/gene–gene associations, on the other hand, are taken into account when calculating the enrichment P-value The benefit of this technique is that the term–term/gene–gene interaction may contain biological meaning that isn’t shared by a single term or gene This type of network/modular analysis is more in line with the structure of biological data | Tabas-Madrid et al. (2012) |
Showing various enrichment tools.
The bold values are the names of software/tools.
Single-Cell Genomics “Cancer Research/Pan-Cancer Biomarkers”
Single-cell sequencing refers to the sequencing of a single-cell genome or transcriptome in order to gather genomic, transcriptomic, or other multi-omics information that can be used to show cell population distinctions and cell evolutionary linkages as in plethora of cancers. Traditional sequencing methods can only obtain an average of many cells, making it impossible to study a small number of cells and resulting in the loss of cellular heterogeneity data (Wen and Tang, 2018).
Single-cell methods have the advantages of detecting variability among individual cells [1, differentiating a small number of cells, and outlining cell maps when compared to classical sequencing technology (Pennisi, 2012).
Multimodal analysis with integration (Multimodal analysis), or the ability to assess various data types simultaneously from the same cell, is a new and exciting future for single-cell genomics. Weighted closest neighbor (WNN) analysis, an unsupervised technique for learning the information content of each modality in each cell and defining cellular state based on a weighted combination of both modalities, is introduced in Seurat v4. Infact, Multimodal analysis, or the simultaneous measurement of many modalities, is an intriguing new Frontier in single-cell genomics that needs novel computational methods to describe biological states based on numerous data sources. Recent research have demonstrated WNN to create a multimodal reference of human PBMC using a CITE-seq dataset with matched transcriptome and 228 surface protein measurements. WNN can be used to analyse multimodal data from several technologies, such as CITE-seq, ASAP-seq, 10X Genomics ATAC + RNA, and SHARE-seq (Ensslin, 2008) (Tables 8, 9).
TABLE 8
| S.No | Level of Analysis | Description | Method of Analysis |
|---|---|---|---|
| 1 | Genome | Complete set of genes of an organism or its organelles | WGS, WES, DNA microarray |
| 2 | Transcriptome | Complete set of messenger RNA molecules present in a cell, tissue of organ | RNA-Sequencing Expression microarray Expression microarray Spatially resolved transcriptomics |
| 3 | Proteome | Complete set of protein molecules present in a cell, tissue or organ | Peptide/protein microarrays (RPPA) Mass spectrometry Imaging mass cytometry |
| 4 | Metabolome | Complete set of metabolites (low-molecular-weight intermediates) in a cell, tissue or organ | Nuclear magnetic resonance spectrometry Mass spectrometry Infrared spectroscopy |
| 5 | Methylome | Complete set of methylation sites within a genome | Bisulfite-Sequencing, ChIP-Seq |
| 6 | Microbiome | Complete set of genes of all microbes (bacteria, fungi, protozoa and viruses) in a cell, tissue or organ | DNA-Sequencing 16 S rRNA-Sequencing |
| 7 | Lipidome | Complete set of all biomolecules defined as lipids | Mass Spectrometry |
Different omics levels of gene-function relationship.
WGS, Whole-genome Sequencing; WES, Whole-exome sequencing; ChIP, chromatin immunoprecipitation.
TABLE 9
| S.No | Tool name | Description | Ref |
|---|---|---|---|
| 1 | SCI-seq | Construction of single-cell libraries and detection of cell copy number variation | Vitak et al. (2017) |
| 2 | LIANTI | Finding the copy number variation and disease-related mutation | Brierley et al. (2002) |
| 3 | scCOOL-seq | Uncovering of chromatin status/nucleosome localization, DNA methylation, copy number variation and ploidy | Guo et al. (2017) |
| 4 | Microwell-seq | Enhances the detection abundance of single cell sequencing technology | Han et al. (2018) |
| 5 | SPLit-seq | Single cell transcriptome sequencing | Rosenberg et al. (2018) |
| 6 | Single-Nucleus RNA-Seq + DroNc-Seq | A variety of cells can be accurately analyzed. It may be used in the Human Cell Atlas Project in the future | Habib et al. (2017) |
Demonstrates various single cell sequencing technologies.
Deep Learning in Genomics
Although genomics generates large amounts of data, most bioinformatics algorithms use machine learning and, more recently, deep learning to discover patterns, make predictions, and model disease progression or treatment. Deep learning (DL) advances have sparked a surge of interest in biomedical informatics, spawning new bioinformatics and computational biology research areas. In deep learning models, it is anticipated to deliver higher accuracies in specific genomics tasks than current state-of-the-art methods. Given the growing trend of using deep learning architectures in genomics research. Deep learning will accelerate improvements in genomics. Deep learning is a sort of AI technique that is used to process vast and complicated genomic datasets in particular fields, such as clinical genomics (Koumakis, 2020). Various deep learning architectures have been designed till date, among them includes Artificial Neural Networks (ANN), Convolutional Neural Network (CNN) & Recurrent Neural Networks (RNN).
Artificial Neural Networks (ANN): The neurons and networks that make up human brains served as inspiration for Artificial Neural Networks (ANN). The ANN is made up of a set of fully linked nodes (neurons) that simulate the stimulus transmission of brain synapses across the neural network, whether they fire or not. These DL architectures can be used for feature selection, classification, dimensionality reduction, or as a submodule of a more complex design like convolutional neural networks (Zurada, 1992).
The Convolutional Neural Network (CNN) is a deep neural network architecture that is most typically used to analyse visual images. It was intended as a completely automated image analysis network for classifying handcrafted characters. CNNs are fully connected networks based on the multilayer perceptrons approach, in which each node/neuron in one layer is (fully) connected to all nodes in the following layer (LeCun et al., 1998).
Recurrent neural networks (RNN): The functioning of recurrent neural networks (RNN) is similar to that of normal feedforward neural networks (FNN), in which nodes form a directed graph along a temporal sequence. RNNs can now demonstrate temporal dynamic behavior while also integrating internal memory. Recurrent networks can remember information from previously studied states thanks to their short-term memory, making them ideal for sequential signal processing and prediction models. The ability of RNNs to relate information from a previous activity to the current task is one of their strengths (Williams and Zipser, 1989a). Table 10 enlists various tools of deep learning (AI) in genomics.
TABLE 10
| S.No | Tools | Prediction | Ref |
|---|---|---|---|
| 1 | DeepTarget | target prediction | Lee, (2016) |
| 2 | DeepMirGene | miRNA Target | Park, (2016) |
| 3 | Deep Net | Case control pre-processing step for clustering. Prediction of transcriptomic machinery | (Gupta et al., 20152015; Dombi et al., 2017) |
| 4 | D-GEX | Gene expression interference | Chen et al. (2016) |
| 5 | Deep Chrome | Classify Gene Expression | Singh et al. (2016b) |
| 6 | DeepFIGV | Predictive Quantative epigenetic variation | Hoffman et al. (2019) |
| 7 | Deepathology | Predict tissue-of-origin, normal or disease state and cancer type | Azarkhalili et al. (2019) |
| 8 | DeepCpG | predicts missing methylation states and detects sequence motifs | Angermueller et al. (2017) |
| 9 | DanQ | predicting the function of DNA directly from sequence alone | Quang and Xie, (2016) |
| 10 | FBGAN | optimize the synthetic gene sequences | Gupta and Zou, (2019) |
Shows list of deep learning techniques in genomics.
The bold values are the names of software/tools.
Conclusion and Future Perspectives
The introduction of massively parallel sequencing has changed genetics and genomics research forever because of its widespread adoption and numerous applications, massively parallel sequencing is projected to play a vital role in the medical industry in the next years. It is worth noting that NGS as a research tool faces major challenges in terms of manufacturing, data management and downstream analysis.
➢ Thus, in the past decade, rapid advancements in high-throughput intervention, backed by lower costs, have opened up new pathways for interrogating a biological system at several regulatory levels, while also providing us with an unprecedented picture. Integrating more genomic/proteome/transcriptome/metabolome/epigenome data with relevant information obtained at other levels, such as genomes, transcriptomes, epigenomics and metabolomics is still a difficulty.
➢ Nonetheless, new sequencing technologies addressing genomic, proteome, transcriptome, metabolome, and epigenome data clearly have tremendous research potential; their capabilities in the hands of researchers will surely speed our understanding of genomic, medical science and allied domains.
➢ Advances in data creation and analysis skills, as well as the interpretation of outcomes, have pointed to a bright future. However, rapid advancement in all fields of science has resulted in the introduction of novel analytical methodologies. While we continue to learn more about how the body functions, we should shift our focus from molecular to systemic and analytic techniques, which has the potential to revolutionize our understanding of how complex biological systems are regulated.
➢ Data integration, on the other hand, is not the end. Although, the bioinformatics challenges posed by NGS are significant, a variety of software tools and algorithms have been created to aid data management, short-read alignment, and sequence variant identification. The high throughput of NGS necessitates the use of automated pipelines, which aid in the transition from novel sequencing technology
➢ Thus the scenario emphasizes the necessity for scientists who are experts in a variety of fields, as well as the effectiveness of multidisciplinary research groups, in which the complementarity of varied abilities will allow for significant scientific advancements & contributions. Addressing system-wide biological concerns necessitates the use of integrated biology techniques. Routine integration, on the other hand, will necessitate the maturation and alignment of various post-genome technologies, as well as cross-communication across various scientific communities. The effective integration of all of these technologies will eventually lead to next-generation systems biology, which will provide valuable biological insights and adoption to high-throughput research and publication.
Statements
Author contributions
DA and GRB conceived the concept. GRB, IS, and DA wrote the manuscript. BR and RK technically refined the MS. All the authors finally approved the MS.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
1
AddonaT. A.ShiX.KeshishianH.ManiD. R.BurgessM.GilletteM. A.et al (2011). A Pipeline that Integrates the Discovery and Verification of Plasma Protein Biomarkers Reveals Candidate Markers for Cardiovascular Disease. Nat. Biotechnol.29 (7), 635–643. 10.1038/nbt.1899
2
AdzhubeiI. A.SchmidtS.PeshkinL.RamenskyV. E.GerasimovaA.BorkP.et al (2010). A Method and Server for Predicting Damaging Missense Mutations. Nat. Methods7 (4), 248–249. 10.1038/nmeth0410-248
3
AirdD.RossM. G.ChenW. S.DanielssonM.FennellT.RussC.et al (2011). Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries. Genome Biol.12 (2), R18–R14. 10.1186/gb-2011-12-2-r18
4
AltelaarA. F. M.NavarroD.BoekhorstJ.van BreukelenB.SnelB.MohammedS.et al (2012). Database Independent Proteomics Analysis of the Ostrich and Human Proteome. Proc. Natl. Acad. Sci. U.S.A.109 (2), 407–412. 10.1073/pnas.1108399108
5
AltmannA.WeberP.BaderD.PreußM.BinderE. B.Müller-MyhsokB. (2012). A Beginners Guide to SNP Calling from High-Throughput DNA-Sequencing Data. Hum. Genet.131 (10), 1541–1554. 10.1007/s00439-012-1213-z
6
AmaralA.ReisM.SilvaF. (2007). O programa BLAST: guia prático de utilização. Lisboa, Portugal: EMBRAPA. Documentos, 224.
7
AmeurA.CheH.MartinM.BunikisI.DahlbergJ.HöijerI.et al (2018). De Novo assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data. Genes9 (10), 486. 10.3390/genes9100486
8
AmeurA.KloostermanW. P.HestandM. S. (2019). Single-molecule Sequencing: towards Clinical Applications. Trends Biotechnology37 (1), 72–85. 10.1016/j.tibtech.2018.07.013
9
Andrés-LeónE.Núñez-TorresR.RojasA. M. (2016). miARma-Seq: a Comprehensive Tool for miRNA, mRNA and circRNA Analysis. Scientific Rep.6 (1), 1–8.
10
AngermuellerC.LeeH. J.ReikW.StegleO. (2017). Erratum to: DeepCpG: Accurate Prediction of Single-Cell DNA Methylation States Using Deep Learning. Genome Biol.18 (1), 90–13. 10.1186/s13059-017-1233-z
11
AryeeM. J.JaffeA. E.Corrada-BravoH.Ladd-AcostaC.FeinbergA. P.HansenK. D.et al (2014). Minfi: a Flexible and Comprehensive Bioconductor Package for the Analysis of Infinium DNA Methylation Microarrays. Bioinformatics30 (10), 1363–1369. 10.1093/bioinformatics/btu049
12
AzarkhaliliB.SaberiA.ChitsazH.Sharifi-ZarchiA. (2019). DeePathology: Deep Multi-Task Learning for Inferring Molecular Pathology from Cancer Transcriptome. Sci. Rep.9 (1), 16526–16614. 10.1038/s41598-019-52937-5
13
BakerM. (2012). The Interaction Map. Nature484 (7393), 271–275. 10.1038/484271a
14
BaoL.ZhouM.CuiY. (2005). nsSNPAnalyzer: Identifying Disease-Associated Nonsynonymous Single Nucleotide Polymorphisms. Nucleic Acids Res.33 (Suppl. l_2), W480–W482. 10.1093/nar/gki372
15
BeckM.SchmidtA.MalmstroemJ.ClaassenM.OriA.SzymborskaA.et al (2011). The Quantitative Proteome of a Human Cell Line. Mol. Syst. Biol.7 (1), 549. 10.1038/msb.2011.82
16
BerlinK.KorenS.ChinC.-S.DrakeJ. P.LandolinJ. M.PhillippyA. M. (2015). Assembling Large Genomes with Single-Molecule Sequencing and Locality-Sensitive Hashing. Nat. Biotechnol.33 (6), 623–630. 10.1038/nbt.3238
17
BermanH.HenrickK.NakamuraH.MarkleyJ. L. (2007). The Worldwide Protein Data Bank (wwPDB): Ensuring a Single, Uniform Archive of PDB Data. Nucleic Acids Res.35, D301–D303. 10.1093/nar/gkl971
18
BernhoferM.DallagoC.KarlT.SatagopamV.HeinzingerM.LittmannM.et al (2021). PredictProtein - Predicting Protein Structure and Function for 29 Years. Nucleic Acids Res.49 (W1), W535–W540. 10.1093/nar/gkab354
19
BlencoweB. J. (2000). Exonic Splicing Enhancers: Mechanism of Action, Diversity and Role in Human Genetic Diseases. Trends Biochemical Sciences25 (3), 106–110. 10.1016/s0968-0004(00)01549-8
20
BožaV.BrejováB.VinařT. (2017). DeepNano: Deep Recurrent Neural Networks for Base Calling in MinION Nanopore Reads. PloS one12 (6), e0178751. 10.1371/journal.pone.0178751
21
BrierleyA. S.FernandesP. G.BrandonM. A.ArmstrongF.MillardN. W.McPhailS. D.et al (2002). Antarctic Krill under Sea Ice: Elevated Abundance in a Narrow Band Just South of Ice Edge. Science295 (5561), 1890–1892. 10.1126/science.1068574
22
BruC.CourcelleE.CarrèreS.BeausseY.DalmarS.KahnD. (2005). The ProDom Database of Protein Domain Families: More Emphasis on 3D. Nucleic Acids Res.33, D212–D215. 10.1093/nar/gki034
23
BurgeC. B.TuschlT.SharpP. A. (1999). Splicing of Precursors to mRNAs by the Spliceosomes. Cold Spring Harbor Monogr. Ser.37, 525–560.
24
CaldwellM. D.BergR. L.ZhangK. Q.GlurichI.SchmelzerJ. R.YaleS. H.et al (2007). Evaluation of Genetic Factors for Warfarin Dose Prediction. Clin. Med. Res.5 (1), 8–16. 10.3121/cmr.2007.724
25
CarlsonA. K.RawleR. A.AdamsE.GreenwoodM. C.BothnerB.JuneR. K. (2018). Application of Global Metabolomic Profiling of Synovial Fluid for Osteoarthritis Biomarkers. Biochem. biophysical Res. Commun.499 (2), 182–188. 10.1016/j.bbrc.2018.03.117
26
CartegniL.ChewS. L.KrainerA. R. (2002). Listening to Silence and Understanding Nonsense: Exonic Mutations that Affect Splicing. Nat. Rev. Genet.3 (4), 285–298. 10.1038/nrg775
27
CartegniL.WangJ.ZhuZ.ZhangM. Q.KrainerA. R. (2003). ESEfinder: A Web Resource to Identify Exonic Splicing Enhancers. Nucleic Acids Res.31 (13), 3568–3571. 10.1093/nar/gkg616
28
ChaissonM. J.TeslerG. (2012). Mapping Single Molecule Sequencing Reads Using Basic Local Alignment with Successive Refinement (BLASR): Application and Theory. BMC bioinformatics13 (1), 238–318. 10.1186/1471-2105-13-238
29
ChenL.WangC.SunH.WangJ.LiangY.WangY.et al (2021). The Bioinformatics Toolbox for circRNA Discovery and Analysis. Brief. Bioinformatics22 (2), 1706–1728. 10.1093/bib/bbaa001
30
ChenY.LiY.NarayanR.SubramanianA.XieX. (2016). Gene Expression Inference with Deep Learning. Bioinformatics32 (12), 1832–1839. 10.1093/bioinformatics/btw074
31
ChinC.-S.AlexanderD. H.MarksP.KlammerA. A.DrakeJ.HeinerC.et al (2013). Nonhybrid, Finished Microbial Genome Assemblies from Long-Read SMRT Sequencing Data. Nat. Methods10 (6), 563–569. 10.1038/nmeth.2474
32
CingolaniP.PlattsA.WangL. L.CoonM.NguyenT.WangL.et al (2012). A Program for Annotating and Predicting the Effects of Single Nucleotide Polymorphisms, SnpEff. Fly6 (2), 80–92. 10.4161/fly.19695
33
ConsortiumI. H.AltshulerD. M.GibbsR. A.PeltonenL.AltshulerD. M.GibbsR. A.et al (2010). Integrating Common and Rare Genetic Variation in Diverse Human Populations. Nature467 (7311), 52–58. 10.1038/nature09298
34
CooperC. A.GasteigerE.PackerN. H. (2001). GlycoMod - A Software Tool for Determining Glycosylation Compositions from Mass Spectrometric Data. Proteomics1 (2), 340–349. 10.1002/1615-9861(200102)1:2<340::aid-prot340>3.0.co;2-b
35
CoxJ.MannM. (2011). Quantitative, High-Resolution Proteomics for Data-Driven Systems Biology. Annu. Rev. Biochem.80, 273–299. 10.1146/annurev-biochem-061308-093216
36
DavidM.DursiL. J.YaoD.BoutrosP. C.SimpsonJ. T. (2017). Nanocall: an Open Source Basecaller for Oxford Nanopore Sequencing Data. Bioinformatics33 (1), 49–55. 10.1093/bioinformatics/btw569
37
De CastroE.SigristC. J.GattikerA.BulliardV.Langendijk-GenevauxP. S.GasteigerE.et al (2006). ScanProsite: Detection of PROSITE Signature Matches and ProRule-Associated Functional and Structural Residues in Proteins. Nucleic Acids Res.34 (Suppl. l_2), W362–W365. 10.1093/nar/gkl124
38
DelaneauO.ZaguryJ. F.RobinsonM. R.MarchiniJ. L.DermitzakisE. T. (2019). Accurate, Scalable and Integrative Haplotype Estimation. Nat. Commun.10 (1), 5436–5510. 10.1038/s41467-019-13225-y
39
DeLanoW. L. (2002). PyMOL.
40
DombiJ.JónásT.TóthZ. E. (2017). “A Pliant Arithmetic-Based Fuzzy Time Series Model,” in International Work-Conference on Artificial Neural Networks (Springer).
41
DudkaI.ChachajA.SebastianA.TańskiW.StenlundH.GröbnerG.et al (2021). Metabolomic Profiling Reveals Plasma GlycA and GlycB as a Potential Biomarkers for Treatment Efficiency in Rheumatoid Arthritis. J. Pharm. Biomed. Anal.197, 113971. 10.1016/j.jpba.2021.113971
42
EilbeckK.QuinlanA.YandellM. (2017). Settling the Score: Variant Prioritization and Mendelian Disease. Nat. Rev. Genet.18 (10), 599–612. 10.1038/nrg.2017.52
43
EnglishA. C.RichardsS.HanY.WangM.VeeV.QuJ.et al (2012). Mind the gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology. PloS one7 (11), e47768. 10.1371/journal.pone.0047768
44
EnsslinA. (2008). Introduction to Multimodal Analysis by David Machin. Wiley Online Library.
45
FairbrotherW. G.YehR.-F.SharpP. A.BurgeC. B. (2002). Predictive Identification of Exonic Splicing Enhancers in Human Genes. Science297 (5583), 1007–1013. 10.1126/science.1073774
46
FicarroS. B.McClelandM. L.StukenbergP. T.BurkeD. J.RossM. M.ShabanowitzJ.et al (2002). Phosphoproteome Analysis by Mass Spectrometry and its Application to Saccharomyces cerevisiae. Nat. Biotechnol.20 (3), 301–305. 10.1038/nbt0302-301
47
FlicekP.BirneyE. (2009). Sense from Sequence Reads: Methods for Alignment and Assembly. Nat. Methods6 (11), S6–S12. 10.1038/nmeth.1376
48
FlicekP.AkenB. L.BealK.BallesterB.CaccamoM.ChenY.et al (2008). Ensembl 2008. Nucleic Acids Research36 (Database issue), D707–D714. 10.1093/nar/gkm988
49
FreshourS. L.KiwalaS.CottoK. C.CoffmanA. C.McMichaelJ. F.SongJ. J.et al (2020). Integration of the Drug-Gene Interaction Database (DGIdb 4.0) with Open Crowdsource Efforts. Nucleic Acids Res.49 (D1), D1144–D1151. 10.1093/nar/gkaa1084
50
GaoY.WangJ.ZhaoF. (2015). CIRI: an Efficient and Unbiased Algorithm for De Novo Circular RNA Identification. Genome Biol.16 (1), 4–16. 10.1186/s13059-014-0571-3
51
GaoY.ZhangJ.ZhaoF. (2018). Circular RNA Identification Based on Multiple Seed Matching. Brief. Bioinformatics19 (5), 803–810. 10.1093/bib/bbx014
52
GasteigerE.AlexandreG.ChristineH.IvanI.RonD. A.AmosB. (2003). ExPASy: The Proteomics Server for In-Depth Protein Knowledge and Analysis. Nucleic Acids Res.31 (13), 3784–3788. 10.1093/nar/gkg563
53
GasteigerE.HooglandC.GattikerA.DuvaudS. e.WilkinsM. R.AppelR. D.et al (2005). Protein Identification and Analysis Tools on the ExPASy serverThe Proteomics Protocols Handbook, 571–607. 10.1385/1-59259-890-0:571Protein Identification and Analysis Tools on the ExPASy Server
54
GentlemanR. C.CareyV. J.BatesD. M.BolstadB.DettlingM.DudoitS.et al (2004). Bioconductor: Open Software Development for Computational Biology and Bioinformatics. Genome Biol.5 (10), R80–R16. 10.1186/gb-2004-5-10-r80
55
González-PérezA.López-BigasN. (2011). Improving the Assessment of the Outcome of Nonsynonymous SNVs with a Consensus Deleteriousness Score, Condel. Am. J. Hum. Genet.88 (4), 440–449.
56
GoodwinS.GurtowskiJ.Ethe-SayersS.DeshpandeP.SchatzM. C.McCombieW. R. (2015). Oxford Nanopore Sequencing, Hybrid Error Correction, and De Novo Assembly of a Eukaryotic Genome. Genome Res.25 (11), 1750–1756. 10.1101/gr.191395.115
57
GraveleyB. R. (2000). Sorting Out the Complexity of SR Protein Functions. Rna6 (9), 1197–1211. 10.1017/s1355838200000960
58
GuoF.LiL.LiJ.WuX.HuB.ZhuP.et al (2017). Single-cell Multi-Omics Sequencing of Mouse Early Embryos and Embryonic Stem Cells. Cell Res27 (8), 967–988. 10.1038/cr.2017.82
59
GuptaA.WangH.GanapathirajuM. (20152015). “Learning Structure in Gene Expression Data Using Deep Architectures, with an Application to Gene Clustering,” in IEEE international conference on bioinformatics and biomedicine (BIBM) (IEEE).
60
GuptaA.ZouJ. (2019). Feedback GAN for DNA Optimizes Protein Functions. Nat. Mach Intell.1 (2), 105–111. 10.1038/s42256-019-0017-4
61
HabibN.Avraham-DavidiI.BasuA.BurksT.ShekharK.HofreeM.et al (2017). Massively Parallel Single-Nucleus RNA-Seq with DroNc-Seq. Nat. Methods14 (10), 955–958. 10.1038/nmeth.4407
62
HanX.WangR.ZhouY.FeiL.SunH.LaiS.et al (2018). Mapping the Mouse Cell Atlas by Microwell-Seq. Cell172 (5), 1091–1107. 10.1016/j.cell.2018.02.001
63
HarperP. S. (2017). The European Society of Human Genetics: Beginnings, Early History and Development over its First 25 Years. United Kingdom: European Journal of Human Genetics, 1–8.
64
HerráezA. (2006). Biomolecules in the Computer: Jmol to the rescue. Biochem. Mol. Biol. Educ.34 (4), 255–261. 10.1002/bmb.2006.494034042644
65
HoffmanG. E.BendlJ.GirdharK.SchadtE. E.RoussosP. (2019). Functional Interpretation of Genetic Variants Using Deep Learning Predicts Impact on Chromatin Accessibility and Histone Modification. Nucleic Acids Res.47 (20), 10597–10611. 10.1093/nar/gkz808
66
HuangD. W.ShermanB. T.LempickiR. A. (2009). Bioinformatics Enrichment Tools: Paths toward the Comprehensive Functional Analysis of Large Gene Lists. Nucleic Acids Res.37 (1), 1–13. 10.1093/nar/gkn923
67
IshikawaJ.HottaK. (1999). FramePlot: a New Implementation of the Frame Analysis for Predicting Protein-Coding Regions in Bacterial DNA with a High G+C Content. FEMS Microbiol. Lett.174 (2), 251–253. 10.1111/j.1574-6968.1999.tb13576.x
68
JacksonD. A.SymonsR. H.BergP. (1972). Biochemical Method for Inserting New Genetic Information into DNA of Simian Virus 40: Circular SV40 DNA Molecules Containing Lambda Phage Genes and the Galactose Operon of Escherichia coli. Proc. Natl. Acad. Sci. U.S.A.69 (10), 2904–2909. 10.1073/pnas.69.10.2904
69
JacobM.GallinaroH. (1989). The 5′ Splice Site: Phylogetic Evalution and Variable Geometry of Association with U1RNA. Nucl. Acids Res.17 (6), 2159–2180. 10.1093/nar/17.6.2159
70
JensenL. J.SaricJ.BorkP. (2006). Literature Mining for the Biologist: from Information Retrieval to Biological Discovery. Nat. Rev. Genet.7 (2), 119–129. 10.1038/nrg1768
71
JiaG.-y.WangD.-l.XueM.-z.LiuY.-w.PeiY.-c.YangY.-q.et al (2019). CircRNAFisher: a Systematic Computational Approach for De Novo Circular RNA Identification. Acta Pharmacol. Sin40 (1), 55–63. 10.1038/s41401-018-0063-1
72
KelleyL. A.MezulisS.YatesC. M.WassM. N.SternbergM. J. E. (2015). The Phyre2 Web portal for Protein Modeling, Prediction and Analysis. Nat. Protoc.10 (6), 845–858. 10.1038/nprot.2015.053
73
KircherM.HeynP.KelsoJ. (2011). Addressing Challenges in the Production and Analysis of Illumina Sequencing Data. BMC genomics12 (1), 382–414. 10.1186/1471-2164-12-382
74
KircherM.WittenD. M.JainP.O'RoakB. J.CooperG. M.ShendureJ. (2014). A General Framework for Estimating the Relative Pathogenicity of Human Genetic Variants. Nat. Genet.46 (3), 310–315. 10.1038/ng.2892
75
KoumakisL. (2020). Deep Learning Models in Genomics; Are We There yet?Comput. Struct. Biotechnol. J.18, 1466–1473. 10.1016/j.csbj.2020.06.017
76
KuhnM.von MeringC.CampillosM.JensenL. J.BorkP. (2008). STITCH: Interaction Networks of Chemicals and Proteins. Nucleic Acids Res.36, D684–D688. 10.1093/nar/gkm795
77
KuhnM.LetunicI.JensenL. J.BorkP. (2016). The SIDER Database of Drugs and Side Effects. Nucleic Acids Res.44 (D1), D1075–D1079. 10.1093/nar/gkv1075
78
LacalI.VenturaR. (2018). Epigenetic Inheritance: Concepts, Mechanisms and Perspectives. Front. Mol. Neurosci.11, 292. 10.3389/fnmol.2018.00292
79
LanderE. S. (2011). Initial Impact of the Sequencing of the Human Genome. Nature470 (7333), 187–197. 10.1038/nature09792
80
LauferB. I.HwangH.JianuJ. M.MordauntC. E.KorfI. F.Hertz-PicciottoI.et al (2020). Low-pass Whole Genome Bisulfite Sequencing of Neonatal Dried Blood Spots Identifies a Role for RUNX1 in Down Syndrome DNA Methylation Profiles. Hum. Mol. Genet.29 (21), 3465–3476. 10.1093/hmg/ddaa218
81
LauferB. I.NeierK.ValenzuelaA. E.YasuiD. H.SchmidtR. J.LeinP. J.et al (2022). Placenta and Fetal Brain Share a Neurodevelopmental Disorder DNA Methylation Profile in a Mouse Model of Prenatal PCB Exposure. Cel Rep.38 (9), 110442. 10.1016/j.celrep.2022.110442
82
LaurentinoS.HeckmannL.Di PersioS.LiX.Meyer Zu HörsteG.WistubaJ.et al (2019). High-resolution Analysis of Germ Cells from Men with Sex Chromosomal Aneuploidies Reveals normal Transcriptome but Impaired Imprinting. Clin. Epigenetics11 (1), 127–213. 10.1186/s13148-019-0720-3
83
LeCunY.BottouL.BengioY.HaffnerP. (1998). Gradient-based Learning Applied to Document Recognition. Proc. IEEE86 (11), 2278–2324. 10.1109/5.726791
84
LedergerberC.DessimozC. (2011). Base-calling for Next-Generation Sequencing Platforms. Brief. Bioinformatics12 (5), 489–497. 10.1093/bib/bbq077
85
LeeB. (2016). “deepTarget: End-To-End Learning Framework for microRNA Target Prediction Using Deep Recurrent Neural Networks,” in Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics.
86
LekM.KarczewskiK. J.MinikelE. V.SamochaK. E.BanksE.FennellT.et al (2016). Analysis of Protein-Coding Genetic Variation in 60,706 Humans. Nature536 (7616), 285–291. 10.1038/nature19057
87
LelieveldS. H.VeltmanJ. A.GilissenC. (2016). Novel Bioinformatic Developments for Exome Sequencing. Hum. Genet.135 (6), 603–614. 10.1007/s00439-016-1658-6
88
LemeerS.HeckA. J. (2009). The Phosphoproteomics Data Explosion. Curr. Opin. Chem. Biol.13 (4), 414–420. 10.1016/j.cbpa.2009.06.022
89
LiH. (2016). Minimap and Miniasm: Fast Mapping and De Novo Assembly for Noisy Long Sequences. Bioinformatics32 (14), 2103–2110. 10.1093/bioinformatics/btw152
90
LiX.WuY. (2020). Detecting circular RNA from high-throughput sequence data with de Bruijn graph. BMC genomics21 (1), 749–811. 10.1186/s12864-019-6154-7
91
LienhardM.GrimmC.MorkelM.HerwigR.ChavezL. (2014). MEDIPS: Genome-wide Differential Coverage Analysis of Sequencing Data Derived from DNA Enrichment Experiments. Bioinformatics30 (2), 284–286. 10.1093/bioinformatics/btt650
92
LiuH.-X.ChewS. L.CartegniL.ZhangM. Q.KrainerA. R. (2000). Exonic Splicing Enhancer Motif Recognized by Human SC35 under Splicing Conditions. Mol. Cel Biol20 (3), 1063–1071. 10.1128/mcb.20.3.1063-1071.2000
93
LiuH. X.ZhangM.KrainerA. R. (1998). Identification of Functional Exonic Splicing Enhancer Motifs Recognized by Individual SR Proteins. Genes Dev.12 (13), 1998–2012. 10.1101/gad.12.13.1998
94
LiuQ.DingC.LangX.GuoG.ChenJ.SuX. (2021). Small Noncoding RNA Discovery and Profiling with sRNAtools Based on High-Throughput Sequencing. Brief. Bioinformatics22 (1), 463–473. 10.1093/bib/bbz151
95
LomanN. J.QuickJ.SimpsonJ. T. (2015). A Complete Bacterial Genome Assembled De Novo Using Only Nanopore Sequencing Data. Nat. Methods12 (8), 733–735. 10.1038/nmeth.3444
96
LopezJ. P.DialloA.CruceanuC.FioriL. M.LaboissiereS.GuilletI.et al (2015). Biomarker Discovery: Quantification of microRNAs and Other Small Non-coding RNAs Using Next Generation Sequencing. BMC Med. Genomics8 (1), 35–18. 10.1186/s12920-015-0109-x
97
LundbyA.SecherA.LageK.NordsborgN. B.DmytriyevA.LundbyC.et al (2012). Quantitative Maps of Protein Phosphorylation Sites across 14 Different Rat Organs and Tissues. Nat. Commun.3 (1), 876–910. 10.1038/ncomms1871
98
LuscombeN. M.GreenbaumD.GersteinM. (2001). What Is Bioinformatics? an Introduction and Overview. Yearb. Med. Inform.10 (01), 83–100. 10.1055/s-0038-1638103
99
MaB.JohnsonR. (2012). De Novo sequencing and Homology Searching. Mol. Cel Proteomics11 (2), O111–O014902. 10.1074/mcp.O111.014902
100
MacArthurD. G.BalasubramanianS.FrankishA.HuangN.MorrisJ.WalterK.et al (2012). A Systematic Survey of Loss-Of-Function Variants in Human Protein-Coding Genes. Science335 (6070), 823–828. 10.1126/science.1215040
101
MarioniJ. C.MasonC. E.ManeS. M.StephensM.GiladY. (2008). RNA-seq: an Assessment of Technical Reproducibility and Comparison with Gene Expression Arrays. Genome Res.18 (9), 1509–1517. 10.1101/gr.079558.108
102
MartinT. C. (2014). The coMET User Guide.
103
MaxamA. M.GilbertW. (1977). A New Method for Sequencing DNA. Proc. Natl. Acad. Sci. U.S.A.74 (2), 560–564. 10.1073/pnas.74.2.560
104
McCarthyA. (2010). Third Generation DNA Sequencing: pacific Biosciences' Single Molecule Real Time Technology. Chem. Biol.17 (7), 675–676. 10.1016/j.chembiol.2010.07.004
105
McLarenW.PritchardB.RiosD.ChenY.FlicekP.CunninghamF. (2010). Deriving the Consequences of Genomic Variants with the Ensembl API and SNP Effect Predictor. Bioinformatics26 (16), 2069–2070. 10.1093/bioinformatics/btq330
106
MengX.ChenQ.ZhangP.ChenM. (2017). CircPro: an Integrated Tool for the Identification of circRNAs with Protein-Coding Potential. Bioinformatics33 (20), 3314–3316. 10.1093/bioinformatics/btx446
107
MerkerJ. D.WengerA. M.SneddonT.GroveM.ZappalaZ.FresardL.et al (2018). Long-read Genome Sequencing Identifies Causal Structural Variation in a Mendelian Disease. Genet. Med.20 (1), 159–163. 10.1038/gim.2017.86
108
MontgomeryS. B.SammethM.Gutierrez-ArcelusM.LachR. P.IngleC.NisbettJ.et al (2010). Transcriptome Genetics Using Second Generation Sequencing in a Caucasian Population. Nature464 (7289), 773–777. 10.1038/nature08903
109
MüllerF.SchererM.AssenovY.LutsikP.WalterJ.LengauerT.et al (2019). RnBeads 2.0: Comprehensive Analysis of DNA Methylation Data. Genome Biol.20 (1), 55. 10.1186/s13059-019-1664-9
110
NagarajN.WisniewskiJ. R.GeigerT.CoxJ.KircherM.KelsoJ.et al (2011). Deep Proteome and Transcriptome Mapping of a Human Cancer Cell Line. Mol. Syst. Biol.7 (1), 548. 10.1038/msb.2011.81
111
NgP. C.HenikoffS. (2003). SIFT: Predicting Amino Acid Changes that Affect Protein Function. Nucleic Acids Res.31 (13), 3812–3814. 10.1093/nar/gkg509
112
NgS. B.TurnerE. H.RobertsonP. D.FlygareS. D.BighamA. W.LeeC.et al (2009). Targeted Capture and Massively Parallel Sequencing of 12 Human Exomes. Nature461 (7261), 272–276. 10.1038/nature08250
113
NguyenT. M.ShafiA.NguyenT.DraghiciS. (2019). Correction to: Identifying Significantly Impacted Pathways: a Comprehensive Review and Assessment. Genome Biol.20 (1), 234–315. 10.1186/s13059-019-1882-1
114
NilsenT. W. (2003). The Spliceosome: the Most Complex Macromolecular Machine in the Cell?Bioessays25 (12), 1147–1149. 10.1002/bies.10394
115
OzsolakF. (2012). Third-generation Sequencing Techniques and Applications to Drug Discovery. Expert Opin. Drug Discov.7 (3), 231–243. 10.1517/17460441.2012.660145
116
PaczkowskaM.BarenboimJ.SintupisutN.FoxN. S.ZhuH.Abd-RabboD.et al (2020). Integrative Pathway Enrichment Analysis of Multivariate Omics Data. Nat. Commun.11 (1), 735–816. 10.1038/s41467-019-13983-9
117
ParkS. (2016). deepMiRGene: Deep Neural Network Based Precursor Microrna Prediction. arXiv preprint arXiv:1605.00017.
118
PennisiE. (2012). Single-cell Sequencing Tackles Basic and Biomedical Questions. American Association for the Advancement of Science.
119
PereiraR.BarbosaT.GalesL.OliveiraE.SantosR.OliveiraJ.et al (2019). Clinical and Genetic Analysis of Children with Kartagener Syndrome. Cells8 (8), 900. 10.3390/cells8080900
120
PereiraR.OliveiraJ.SousaM. (2020). Bioinformatics and Computational Tools for Next-Generation Sequencing Analysis in Clinical Genetics. Jcm9 (1), 132. 10.3390/jcm9010132
121
PereiraR.OliveiraM. E.SantosR.OliveiraE.BarbosaT.SantosT.et al (2019). Characterization of CCDC103 Expression Profiles: Further Insights in Primary Ciliary Dyskinesia and in Human Reproduction. J. Assist. Reprod. Genet.36 (8), 1683–1700. 10.1007/s10815-019-01509-7
122
Perez-RiverolY.CsordasA.BaiJ.Bernal-LlinaresM.HewapathiranaS.KunduD. J.et al (2019). The PRIDE Database and Related Tools and Resources in 2019: Improving Support for Quantification Data. Nucleic Acids Res.47 (D1), D442–d450. 10.1093/nar/gky1106
123
PetersT. J.BuckleyM. J.StathamA. L.PidsleyR.SamarasK.V LordR.et al (2015). De Novo identification of Differentially Methylated Regions in the Human Genome. Epigenetics Chromatin8 (1), 6–16. 10.1186/1756-8935-8-6
124
PevsnerJ. (2015). Bioinformatics and Functional Genomics. John Wiley & Sons.
125
PollardK. S.HubiszM. J.RosenbloomK. R.SiepelA. (2010). Detection of Nonneutral Substitution Rates on Mammalian Phylogenies. Genome Res.20 (1), 110–121. 10.1101/gr.097857.109
126
PosteG. (2011). Bring on the Biomarkers. Nature469 (7329), 156–157. 10.1038/469156a
127
ProsdocimiF. (2010). Introdução à Bioinformática. Curso Online.
128
ProsdocimiF.CerqueiraG. C.BinneckE.SilvaA. F.ReisA. N.JunqueiraA. C. M.et al (2002). Bioinformatics: User Manual - Biotechnology Science & Development.
129
PruessM.ApweilerR. (2003). Bioinformatics Resources for In Silico Proteome Analysis. J. Biomed. Biotechnol.2003 (4), 231–236. 10.1155/s1110724303209219
130
PushkarevD.NeffN. F.QuakeS. R. (2009). Single-molecule Sequencing of an Individual Human Genome. Nat. Biotechnol.27 (9), 847–850. 10.1038/nbt.1561
131
QuailM. A.SmithM.CouplandP.OttoT. D.HarrisS. R.ConnorT. R.et al (2012). A Tale of Three Next Generation Sequencing Platforms: Comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq Sequencers. BMC genomics13 (1), 341–413. 10.1186/1471-2164-13-341
132
QuangD.XieX. (2016). DanQ: a Hybrid Convolutional and Recurrent Deep Neural Network for Quantifying the Function of DNA Sequences. Nucleic Acids Res.44 (11), e107. 10.1093/nar/gkw226
133
RitchieM. D.HolzingerE. R.LiR.PendergrassS. A.KimD. (2015). Methods of Integrating Data to Uncover Genotype-Phenotype Interactions. Nat. Rev. Genet.16 (2), 85–97. 10.1038/nrg3868
134
RobinsonP. N.KöhlerS.OellrichA.WangK.MungallC. J.LewisS. E.et al (2014). Improved Exome Prioritization of Disease Genes through Cross-Species Phenotype Comparison. Genome Res.24 (2), 340–348. 10.1101/gr.160325.113
135
RosenbergA. B.RocoC. M.MuscatR. A.KuchinaA.SampleP.YaoZ.et al (2018)., 360. New York, NY), 176–182. 10.1126/science.aam8999Single-cell Profiling of the Developing Mouse Brain and Spinal Cord with Split-Pool BarcodingScience6385
136
SangerF.CoulsonA. R. (1975). A Rapid Method for Determining Sequences in DNA by Primed Synthesis with DNA Polymerase. J. Mol. Biol.94 (3), 441–448. 10.1016/0022-2836(75)90213-2
137
SangerF.NicklenS.CoulsonA. R. (1977). DNA Sequencing with Chain-Terminating Inhibitors. Proc. Natl. Acad. Sci. U.S.A.74 (12), 5463–5467. 10.1073/pnas.74.12.5463
138
SchadtE. E.TurnerS.KasarskisA. (2010). A Window into Third-Generation Sequencing. Hum. Mol. Genet.19 (R2), R227–R240. 10.1093/hmg/ddq416
139
SchererS. W.LeeC.BirneyE.AltshulerD. M.EichlerE. E.CarterN. P.et al (2007). Challenges and Standards in Integrating Surveys of Structural Variation. Nat. Genet.39 (7), S7–S15. 10.1038/ng2093
140
SchmidtA.ForneI.ImhofA. (2014). Bioinformatic Analysis of Proteomics Data. BMC Syst. Biol.8 Suppl 2 (2), S3–S7. 10.1186/1752-0509-8-S2-S3
141
SchwarzJ. M.RödelspergerC.SchuelkeM.SeelowD. (2010). MutationTaster Evaluates Disease-Causing Potential of Sequence Alterations. Nat. Methods7 (8), 575–576. 10.1038/nmeth0810-575
142
ShahN. J.SureshkumarS.ShewadeD. G. (2015). Metabolomics: a Tool Ahead for Understanding Molecular Mechanisms of Drugs and Diseases. Ind. J. Clin. Biochem.30 (3), 247–254. 10.1007/s12291-014-0455-z
143
ShendureJ.JiH. (2008). Next-generation DNA Sequencing. Nat. Biotechnol.26 (10), 1135–1145. 10.1038/nbt1486
144
SieversF.WilmA.DineenD.GibsonT. J.KarplusK.LiW.et al (2011). Fast, Scalable Generation of High‐quality Protein Multiple Sequence Alignments Using Clustal Omega. Mol. Syst. Biol.7, 539. 10.1038/msb.2011.75
145
SimsD.SudberyI.IlottN. E.HegerA.PontingC. P. (2014). Sequencing Depth and Coverage: Key Considerations in Genomic Analyses. Nat. Rev. Genet.15 (2), 121–132. 10.1038/nrg3642
146
Singh NandaJ.KumarR.RaghavaG. P. (2016). dbEM: A Database of Epigenetic Modifiers Curated from Cancerous and normal Genomes. Sci. Rep.6 (1), 19340–19346. 10.1038/srep19340
147
SinghN.UpadhyayS.JaiswarA.MishraN. (2016a). In Silico Analysis of Protein. J. Bioinform Genomics Proteomics1 (2), 1007.
148
SinghR.LanchantinJ.RobinsG.QiY. (2016). DeepChrome: Deep-Learning for Predicting Gene Expression from Histone Modifications. Bioinformatics32 (17), i639–i648. 10.1093/bioinformatics/btw427
149
SingletonM. V.GutheryS. L.VoelkerdingK. V.ChenK.KennedyB.MargrafR. L.et al (2014). Phevor Combines Multiple Biomedical Ontologies for Accurate Identification of Disease-Causing Alleles in Single Individuals and Small Nuclear Families. Am. J. Hum. Genet.94 (4), 599–610. 10.1016/j.ajhg.2014.03.010
150
SivaN. (2008). 1000 Genomes Project. Nat. Biotechnol.26 (3), 256. 10.1038/nbt0308-256b
151
SmithB. E.HillJ. A.GjukichM. A.AndrewsP. C. (2011). Tranche Distributed Repository and ProteomeCommons.Org. Methods Mol. Biol.696, 123–145. 10.1007/978-1-60761-987-1_8
152
SmitsS. L.RajV. S.OduberM. D.SchapendonkC. M. E.BodewesR.ProvaciaL.et al (2013). Metagenomic Analysis of the Ferret Fecal Viral flora. PLoS One8 (8), e71595. 10.1371/journal.pone.0071595
153
SniderC.JayasingheS.HristovaK.WhiteS. H. (2009). MPEx: a Tool for Exploring Membrane Proteins. Protein Sci.18 (12), 2624–2628. 10.1002/pro.256
154
SovićI.ikićI.WilmA.FenlonS. N.ChenS.NagarajanN. (2016). Fast and Sensitive Mapping of Nanopore Sequencing Reads with GraphMap. Nat. Commun.7, 11307. 10.1038/ncomms11307
155
StelzerG.PlaschkesI.Oz-LeviD.AlkelaiA.OlenderT.ZimmermanS.et al (2016). VarElect: the Phenotype-Based Variation Prioritizer of the GeneCards Suite. BMC genomics17 Suppl 2 (2), 444–206. 10.1186/s12864-016-2722-2
156
StephensZ. D.LeeS. Y.FaghriF.CampbellR. H.ZhaiC.EfronM. J.et al (2015). Big Data: Astronomical or Genomical?Plos Biol.13 (7), e1002195. 10.1371/journal.pbio.1002195
157
StitzielN. O.BinkowskiT. A.TsengY. Y.KasifS.LiangJ. (2004). topoSNP: a Topographic Database of Non-synonymous Single Nucleotide Polymorphisms with and without Known Disease Association. Nucleic Acids Res.32 (Suppl. l_1), D520–D522. 10.1093/nar/gkh104
158
StonekingM.KrauseJ. (2011). Learning about Human Population History from Ancient and Modern Genomes. Nat. Rev. Genet.12 (9), 603–614. 10.1038/nrg3029
159
StothardP. (2000). The Sequence Manipulation Suite: JavaScript Programs for Analyzing and Formatting Protein and DNA Sequences. Biotechniques28 (6), 1102–1104. 10.2144/00286ir01
160
SubramanianA.TamayoP.MoothaV. K.MukherjeeS.EbertB. L.GilletteM. A.et al (2005). Gene Set Enrichment Analysis: a Knowledge-Based Approach for Interpreting Genome-wide Expression Profiles. Proc. Natl. Acad. Sci. U.S.A.102 (43), 15545–15550. 10.1073/pnas.0506580102
161
SzalayT.GolovchenkoJ. A. (2015). De Novo sequencing and Variant Calling with Nanopores Using PoreSeq. Nat. Biotechnol.33 (10), 1087–1091. 10.1038/nbt.3360
162
SzklarczykD.GableA. L.NastouK. C.LyonD.KirschR.PyysaloS.et al (2021). Correction to 'The STRING Database in 2021: Customizable Protein-Protein Networks, and Functional Characterization of User-Uploaded Gene/measurement Sets'. Nucleic Acids Res.49 (18), 10800. 10.1093/nar/gkab835
163
Tabas-MadridD.Nogales-CadenasR.Pascual-MontanoA. (2012). GeneCodis3: a Non-redundant and Modular Enrichment Analysis Tool for Functional Genomics. Nucleic Acids Res.40, W478–W483. 10.1093/nar/gks402
164
TakahashiS.SaegusaJ.OnishiA.MorinobuA. (2019). Biomarkers Identified by Serum Metabolomic Analysis to Predict Biologic Treatment Response in Rheumatoid Arthritis Patients. Rheumatology58 (12), 2153–2161. 10.1093/rheumatology/kez199
165
ThompsonJ. F.SteinmannK. E. (2010). Single Molecule Sequencing with a HeliScope Genetic Analysis System. Curr. Protoc. Mol. Biol.Chapter 7 (1), Unit7–10. 10.1002/0471142727.mb0710s92
166
ThornC. F.KleinT. E.AltmanR. B. (2013). PharmGKB: the Pharmacogenomics Knowledge Base. Methods Mol. Biol. (Clifton, N.J.)1015, 311–320. 10.1007/978-1-62703-435-7_20
167
TripathiP.SomashekarB. S.PonnusamyM.GurskyA.DaileyS.KunjuP.et al (2013). HR-MAS NMR Tissue Metabolomic Signatures Cross-Validated by Mass Spectrometry Distinguish Bladder Cancer from Benign Disease. J. Proteome Res.12 (7), 3519–3528. 10.1021/pr4004135
168
TrostB.KusalikA. (2011). Computational Prediction of Eukaryotic Phosphorylation Sites. Bioinformatics27 (21), 2927–2935. 10.1093/bioinformatics/btr525
169
van DijkE. L.JaszczyszynY.NaquinD.ThermesC. (2018). The Third Revolution in Sequencing Technology. Trends Genet.34 (9), 666–681. 10.1016/j.tig.2018.05.008
170
VenterJ. C.AdamsM. D.MyersE. W.LiP. W.MuralR. J.SuttonG. G.et al (2001). The Sequence of the Human Genome. science291 (5507), 1304–1351. 10.1126/science.1058040
171
VerliH. (2014). Bioinformática: da biologia à flexibilidade molecular.
172
VitakS. A.TorkenczyK. A.RosenkrantzJ. L.FieldsA. J.ChristiansenL.WongM. H.et al (2017). Sequencing Thousands of Single-Cell Genomes with Combinatorial Indexing. Nat. Methods14 (3), 302–308. 10.1038/nmeth.4154
173
WangJ. (2009). Protein Structure Prediction by Comparative Modeling: An Analysis of Methodology.
174
WangK.LiM.HakonarsonH. (2010). ANNOVAR: Functional Annotation of Genetic Variants from High-Throughput Sequencing Data. Nucleic Acids Res.38 (16), e164. 10.1093/nar/gkq603
175
WangK.SinghD.ZengZ.ColemanS. J.HuangY.SavichG. L.et al (2010). MapSplice: Accurate Mapping of RNA-Seq Reads for Splice junction Discovery. Nucleic Acids Res.38 (18), e178. 10.1093/nar/gkq622
176
WangX.XiongX.CaoW.ZhangC.WerrenJ. H.WangX. (2019). Genome Assembly of the A-Group Wolbachia in Nasonia oneida Using Linked-Reads Technology. Genome Biol. Evol.11 (10), 3008–3013. 10.1093/gbe/evz223
177
WattanachaiN.KaewmoongkunS.PussadhammaB.MakarawateP.WongvipapornC.KiatchoosakunS.et al (2017). The Impact of Non-genetic and Genetic Factors on a Stable Warfarin Dose in Thai Patients. Eur. J. Clin. Pharmacol.73 (8), 973–980. 10.1007/s00228-017-2265-8
178
WenL.TangF. (2018). Boosting the Power of Single-Cell Analysis. Nat. Biotechnol.36 (5), 408–409. 10.1038/nbt.4131
179
WestholmJ. O.MiuraP.OlsonS.ShenkerS.JosephB.SanfilippoP.et al (2014). Genome-wide Analysis of drosophila Circular RNAs Reveals Their Structural and Sequence Properties and Age-dependent Neural Accumulation. Cel Rep.9 (5), 1966–1980. 10.1016/j.celrep.2014.10.062
180
WhiteakerJ. R.LinC.KennedyJ.HouL.TruteM.SokalI.et al (2011). A Targeted Proteomics-Based Pipeline for Verification of Biomarkers in Plasma. Nat. Biotechnol.29 (7), 625–634. 10.1038/nbt.1900
181
WilliamsR.ZipserD. D. (1989a). A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Comput.1. 10.1162/neco.1989.1.2.270
182
WishartD. S.FeunangY. D.GuoA. C.LoE. J.MarcuA.GrantJ. R.et al (2018). DrugBank 5.0: a Major Update to the DrugBank Database for 2018. Nucleic Acids Res.46 (D1), D1074–D1082. 10.1093/nar/gkx1037
183
XiaoT.ZhouW. (2020). The Third Generation Sequencing: the Advanced Approach to Genetic Diseases. Transl Pediatr.9 (2), 163–173. 10.21037/tp.2020.03.06
184
YangH.WangK. (2015). Genomic Variant Annotation and Prioritization with ANNOVAR and wANNOVAR. Nat. Protoc.10 (10), 1556–1566. 10.1038/nprot.2015.105
185
YangJ.YanR.RoyA.XuD.PoissonJ.ZhangY. (2015). The I-TASSER Suite: Protein Structure and Function Prediction. Nat. Methods12 (1), 7–8. 10.1038/nmeth.3213
186
YangW.SoaresJ.GreningerP.EdelmanE. J.LightfootH.ForbesS.et al (2013). Genomics of Drug Sensitivity in Cancer (GDSC): a Resource for Therapeutic Biomarker Discovery in Cancer Cells. Nucleic Acids Res.41, D955–D961. 10.1093/nar/gks1111
187
YouX.ConradT. O. (2016). Acfs: Accurate circRNA Identification and Quantification from RNA-Seq Data. Sci. Rep.6 (1), 38820–38911. 10.1038/srep38820
188
ZhangA.SunH.WangX. (2014). Urinary Metabolic Profiling of Rat Models Revealed Protective Function of Scoparone against Alcohol Induced Hepatotoxicity. Sci. Rep.4 (1), 6768–8. 10.1038/srep06768
189
ZhangA.SunH.YanG.WangP.WangX. (2015). Metabolomics for Biomarker Discovery: Moving to the Clinic. Biomed. Res. Int.2015, 354671. 10.1155/2015/354671
190
ZhangA.-h.SunH.HanY.YanG.-l.YuanY.SongG.-c.et al (2013). Ultraperformance Liquid Chromatography-Mass Spectrometry Based Comprehensive Metabolomics Combined with Pattern Recognition and Network Analysis Methods for Characterization of Metabolites and Metabolic Pathways from Biological Data Sets. Anal. Chem.85 (15), 7606–7612. 10.1021/ac401793d
191
ZhangX.-O.WangH.-B.ZhangY.LuX.ChenL.-L.YangL. (2014). Complementary Sequence-Mediated Exon Circularization. Cell159 (1), 134–147. 10.1016/j.cell.2014.09.001
192
ZhangX. H.-F.LeslieC. S.ChasinL. A. (2005). Computational Searches for Splicing Signals. Methods37 (4), 292–305. 10.1016/j.ymeth.2005.07.011
193
ZhengG. X.TerryJ. M.BelgraderP.RyvkinP.BentZ. W.WilsonR.et al (2017). Massively Parallel Digital Transcriptional Profiling of Single Cells. Nat. Commun.8 (1), 14049–14112. 10.1038/ncomms14049
194
ZhengG. X. Y.LauB. T.Schnall-LevinM.JaroszM.BellJ. M.HindsonC. M.et al (2016). Haplotyping Germline and Cancer Genomes with High-Throughput Linked-Read Sequencing. Nat. Biotechnol.34 (3), 303–311. 10.1038/nbt.3432
195
ZhengY.JiP.ChenS.HouL.ZhaoF. (2019). Reconstruction of Full-Length Circular RNAs Enables Isoform-Level Quantification. Genome Med.11 (1), 2–20. 10.1186/s13073-019-0614-1
196
ZhouX.RenL.MengQ.LiY.YuY.YuJ. (2010). The Next-Generation Sequencing Technology and Application. Protein Cell1 (6), 520–536. 10.1007/s13238-010-0065-3
197
ZhuJ.MayedaA.KrainerA. R. (2001). Exon Identity Established through Differential Antagonism between Exonic Splicing Silencer-Bound hnRNP A1 and Enhancer-Bound SR Proteins. Mol. Cel.8 (6), 1351–1361. 10.1016/s1097-2765(01)00409-9
198
ZuradaJ. (1992). Introduction to Artificial Neural Systems. Wuhan , China: West Publishing Co.
Summary
Keywords
Single nucleotide polymorphisms (SNPs), Human Splice finder (HSF), Next Generation Sequencing (NGS), in silico, bioinformatics
Citation
Bhat GR, Sethi I, Rah B, Kumar R and Afroze D (2022) Innovative in Silico Approaches for Characterization of Genes and Proteins. Front. Genet. 13:865182. doi: 10.3389/fgene.2022.865182
Received
29 January 2022
Accepted
11 April 2022
Published
18 May 2022
Volume
13 - 2022
Edited by
Prashanth N Suravajhala, Amrita Vishwa Vidyapeetham University, India
Reviewed by
Christos K. Kontos, National and Kapodistrian University of Athens, Greece
George Potamias, Foundation for Research and Technology Hellas (FORTH), Greece
Indra Mani, University of Delhi, India
Updates
Copyright
© 2022 Bhat, Sethi, Rah, Kumar and Afroze.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Dil Afroze, afrozedil@gmail.com
This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.