<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<?covid-19-tdm?>
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="brief-report" dtd-version="2.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Virol.</journal-id>
<journal-title>Frontiers in Virology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Virol.</abbrev-journal-title>
<issn pub-type="epub">2673-818X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fviro.2022.1028335</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Virology</subject>
<subj-group>
<subject>Brief Research Report</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>SARSNTdb database: Factors affecting SARS-CoV-2 sequence conservation</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Orgera</surname>
<given-names>John</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kelley</surname>
<given-names>James J.</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/2057459"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bar</surname>
<given-names>Omri</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Vaidhyanathan</surname>
<given-names>Sathyanarayanan</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1976794"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Grigoriev</surname>
<given-names>Andrey</given-names>
</name>
<xref ref-type="author-notes" rid="fn001">
<sup>*</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/187350"/>
</contrib>
</contrib-group>
<aff id="aff1">
<institution>Biology Department and Center for Computational and Integrative Biology, Rutgers University</institution>, <addr-line>Camden, NJ</addr-line>, <country>United States</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>Edited by: Evangelia Georgia Kostaki, National and Kapodistrian University of Athens, Greece</p>
</fn>
<fn fn-type="edited-by">
<p>Reviewed by: Ye Qiu, Hunan University, China; Divya Mishra, Kansas State University, United States; Timokratis Karamitros, University of Oxford, United Kingdom</p>
</fn>
<fn fn-type="corresp" id="fn001">
<p>*Correspondence: Andrey Grigoriev, <email xlink:href="mailto:andrey.grigoriev@rutgers.edu">andrey.grigoriev@rutgers.edu</email>
</p>
</fn>
<fn fn-type="other" id="fn002">
<p>This article was submitted to Bioinformatic and Predictive Virology, a section of the journal Frontiers in Virology</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>01</day>
<month>12</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>2</volume>
<elocation-id>1028335</elocation-id>
<history>
<date date-type="received">
<day>25</day>
<month>08</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>15</day>
<month>11</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Orgera, Kelley, Bar, Vaidhyanathan and Grigoriev</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Orgera, Kelley, Bar, Vaidhyanathan and Grigoriev</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>SARSNTdb offers a curated, nucleotide-centric database for users of varying levels of SARS-CoV-2 knowledge. Its user-friendly interface enables querying coding regions and coordinate intervals to find out the various functional and selective constraints that act upon the corresponding nucleotides and amino acids. Users can easily obtain information about viral genes and proteins, functional domains, repeats, secondary structure formation, intragenomic interactions, and mutation prevalence. Currently, many databases are focused on the phylogeny and amino acid substitutions, mainly in the spike protein. We took a novel, more nucleotide-focused approach as RNA does more than just code for proteins and many insights can be gleaned from its study. For example, RNA-targeted drug therapies for SARS-CoV-2 are currently being developed and it is essential to understand the features only visible at that level. This database enables the user to identify regions that are more prone to forming secondary structures that drugs can target. SARSNTdb also provides illustrative mutation data from a subset of ~25,000 patient samples with a reliable read coverage across the whole genome (from different locations and time points in the pandemic. Finally, the database allows for comparing SARS-CoV-2 and SARS-CoV domains and sequences. SARSNTdb can serve the research community by being a curated repository for information that gives a jump start to analyze a mutation&#x2019;s effect far beyond just determining synonymous/non-synonymous substitutions in protein sequences.</p>
</abstract>
<kwd-group>
<kwd>SARS-CoV-2</kwd>
<kwd>SARS-CoV</kwd>
<kwd>database</kwd>
<kwd>genome analysis</kwd>
<kwd>bioinformatics</kwd>
</kwd-group>
<contract-sponsor id="cn001">National Science Foundation<named-content content-type="fundref-id">10.13039/100000001</named-content>
</contract-sponsor>
<counts>
<fig-count count="3"/>
<table-count count="1"/>
<equation-count count="0"/>
<ref-count count="21"/>
<page-count count="7"/>
<word-count count="3347"/>
</counts>
</article-meta>
</front>
<body>
<sec id="s1" sec-type="intro">
<title>Introduction</title>
<p>The COVID-19 pandemic caused by the SARS-CoV-2 virus has generated a massive amount of data, which needs to be organized in order to understand the virus biology. This data includes sequences, papers, medical information, and proteomics data. Several databases related to SARS-CoV-2 have studied and reported on the evolution of the virus and identifying variants (<xref ref-type="bibr" rid="B1">1</xref>). For example, GISAID (<xref ref-type="bibr" rid="B2">2</xref>), the important primary repository of assembled SARS-CoV-2 genomes at the time of writing, has over 11 million genome sequence samples submitted and list on their website several databases that use this new data to track the evolution of the virus over time. However, interpreting a reported nucleotide or amino acid substitution often requires sifting through pieces of information related to affected proteins of the virus. Such information is scattered throughout the web in many papers and databases. If a mutation is found at a certain coordinate, a thorough investigation delving into multiple papers is required to understand the functional importance of that one nucleotide and its surroundings. To remedy this, we created SARSNTdb, a compact database of highly interlinked data records that can allow the user to rapidly navigate from genome positions to functional/selective constraints on the corresponding nucleotides and amino acids.</p>
<p>Genome databases typically list coordinates of coding and non-coding regions and provide their annotations per such region. In contrast, SARSNTdb is nucleotide-centric, it allows querying annotations for every position in the genome from the perspective of potential selection factors affecting the corresponding nucleotide. Public attention to SARS-CoV-2 virus has generally focused on mutations occurring in its genome (and their impact on vaccine efficacy and virus spread). Most frequently, SARS-CoV-2 mutations are viewed through the prism of immune system evasion (<xref ref-type="bibr" rid="B3">3</xref>, <xref ref-type="bibr" rid="B4">4</xref>). While this is relevant for the (most widely known) viral spike protein, the general public and scientific community are often at a loss when other substitutions are considered, especially silent ones or short insertions/deletions (indels). Given the significant interest in variants of concern (VOC), strong focus on selection would provide complementary functional context for respective VOC mutations, beyond the trivial synonymous/non-synonymous designations. Examples of such context include repeats, secondary structure formation, intragenomic interactions, nucleotide and amino acid conservation, and mutation prevalence. For example, it is known that repeats and their variations play a critical role in production of subgenomic mRNAs in coronaviruses (<xref ref-type="bibr" rid="B5">5</xref>), and recent VOCs, such as Omicron, display large number of both spike (<xref ref-type="bibr" rid="B6">6</xref>) and non-spike substitutions or indels.</p>
<p>To ensure the consistent cataloguing of the nucleotide and amino acid substitutions, we re-evaluated mutations across ~25,000 patient samples, for which raw metatranscriptome datasets of sufficient quality were available in NCBI&#x2019;s SRA (<xref ref-type="bibr" rid="B7">7</xref>). We avoided taking mutations reported in GISAID, which contained already assembled genomes. Analysis of raw data for some of these genomes reveals cases of incomplete and often peculiar patterns of genome coverage (<xref ref-type="supplementary-material" rid="SM1">
<bold>Supplementary Figure&#xa0;1</bold>
</xref>). Such genomes often contained segments of low genome coverage (jeopardizing mutation calling or producing massive sections with missing data) and it was not possible to tell how reliable these assemblies were. We did not aim for (and did not expect) detecting new variants but focused on cataloguing mutations in representative well-sequenced samples. By calling mutations from genomes with controlled coverage we increased the consistency of the substitution data collected, that is provided in SARSNTdb to illustrate substitution trends. In addition to selection acting upon immunity evasion or similar fitness gains, such trends may reveal interesting patterns related to a balance of selective forces versus random mutations related to basic viral processes, as shown in SARS-CoV (<xref ref-type="bibr" rid="B8">8</xref>).</p>
</sec>
<sec id="s2" sec-type="materials|methods">
<title>Material and methods</title>
<sec id="s2_1">
<title>Sequencing data collection and processing</title>
<p>We downloaded mutation data from the NCBI&#x2019;s SRA using the prefetch feature to download SRA files. Selection of files was complicated by divergent numbers of samples submitted in different projects, so we selected them from several labs across the world, which provided large batches of samples over extended periods of time. We reasoned such labs would have likely perfected the sequencing process and provided reliable samples for different VOCs.</p>
<p>Thus we obtained &gt;35,000 samples and filtered them as follows. Samples under 10MB in size were excluded as they had low read depth overall, preventing reliable detection of single-nucleotide variants (SNVs). We selected samples with &lt;700 nt of zero read depth over the whole genome, to avoid effects of biased coverage and of lack of coverage for potential SNVs (<xref ref-type="supplementary-material" rid="SM1">
<bold>Supplementary Figure&#xa0;1</bold>
</xref>). To process the data in consistent way, we used fastq files (when available) or unaligned the reads from BAM files using Samtools (<xref ref-type="bibr" rid="B9">9</xref>) to convert them to fastq files. We then aligned the fastq files to the SARS-CoV-2 reference sequence (<xref ref-type="bibr" rid="B10">10</xref>) (NC_045512.2) using BWA mem.</p>
<p>We then used GROM (<xref ref-type="bibr" rid="B11">11</xref>) to find SNVs in the data. In total we found ~25,000 unique SNVs using GROM across &gt;25,000 samples. GROM was run using default settings and the &#x201c;remove duplicates&#x201d; option, to minimize PCR duplicates. Relevant data from output VCF files were then consolidated to SQL files detailing the sequencing platform, coordinates, and alternate nucleotides for each sample.</p>
<p>To identify repeats in the SARS-CoV-2 genome, we analyzed the Wuhan reference sequence (<xref ref-type="bibr" rid="B10">10</xref>) using UGENE (<xref ref-type="bibr" rid="B12">12</xref>) with the default settings and selected repeats &gt;5 nt long. We then identified super-repeats of one another (superstrings of shorter repeat strings) using in-house scripts.</p>
</sec>
<sec id="s2_2">
<title>Data on protein and RNA structure</title>
<p>
<xref ref-type="table" rid="T1">
<bold>Table&#xa0;1</bold>
</xref> describes datatypes, sources, and tools used to generate the data that populates the database. Protein structures were obtained from the Zhang group who has used I-TASSER to predict protein structure for all SARS-CoV-2 proteins (<xref ref-type="bibr" rid="B13">13</xref>). Their predictions are highly accurate for the SARS-CoV-2 proteins despite relatively few homologous sequences with available protein structures.</p>
<table-wrap id="T1" position="float">
<label>Table&#xa0;1</label>
<caption>
<p>A display of the types, sources, and tools used to generate data that populate the database.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left">Data Type</th>
<th valign="top" align="center">Tool Used</th>
<th valign="top" align="center">Source</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Protein Structure Visualizations</td>
<td valign="top" align="center">I-TASSER &#x2013; M.L. based protein structure predictor</td>
<td valign="top" align="center">Zhang group (<xref ref-type="bibr" rid="B13">13</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">SHAPE reactivities of RNA</td>
<td valign="top" align="center">SHAPE-MaP</td>
<td valign="top" align="center">Yang et&#xa0;al. (<xref ref-type="bibr" rid="B14">14</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">SHAPE reactivities of RNA</td>
<td valign="top" align="center">icSHAPE</td>
<td valign="top" align="center">Sun et al (<xref ref-type="bibr" rid="B15">15</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">Normalized SHAPE reactivities of RNA</td>
<td valign="top" align="center">SHAPE-MaP and DMS-MaPseq</td>
<td valign="top" align="center">Manfredonia et al (<xref ref-type="bibr" rid="B16">16</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">Intragenome RNA interactions</td>
<td valign="top" align="center">SPLASH</td>
<td valign="top" align="center">Yang et al (<xref ref-type="bibr" rid="B14">14</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">Repeat Detection and Coordinates</td>
<td valign="top" align="center">UGENE</td>
<td valign="top" align="center">Scripts ran in-house</td>
</tr>
<tr>
<td valign="top" align="left">SNV Data</td>
<td valign="top" align="center">GROM</td>
<td valign="top" align="center">Produced in-house</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To show the secondary structure of SARS-CoV-2 genomic RNA we collected several datasets from groups that have measured the viral RNA accessibility at a single base resolution. The first of these was taken from Manfredonia et&#xa0;al. (<xref ref-type="bibr" rid="B16">16</xref>) who has used SHAPE and DMS mutational profiling to find secondary structure maps with single base resolution. Yang et&#xa0;al. (<xref ref-type="bibr" rid="B14">14</xref>) has used SHAPE-MaP to find the reactivities of the reference sequence as well as a delta variant sequence. Finally, Sun et&#xa0;al. (<xref ref-type="bibr" rid="B15">15</xref>) has used icSHAPE to map reactivities.</p>
<p>Data presented as <italic>Intragenome Interaction Data</italic> represent regions of pairwise RNA interactions across the genome. Such regions have been detected <italic>via</italic> proximity ligation sequencing was performed using SPLASH to find these regions in Vero-E6 infected cell (<xref ref-type="bibr" rid="B14">14</xref>).</p>
</sec>
<sec id="s2_3">
<title>Gene, protein and functional domain data</title>
<p>We obtained the coordinates of viral non-coding regions, its genes and proteins, their respective nucleotide and amino acid sequences from the NCBI record of the SARS-CoV-2 (NC_045512.2). SARS-CoV&#x2019;s information was retrieved in the same way from the NCBI record of the Tor reference sequence (NC_004718.3). We then performed a thorough literature review (across hundreds of papers) of proteins in SARS-CoV-2 and SARS-CoV to obtain their functional descriptions. Next, we identified the available coordinates of functional domains in both viruses. Using BLAST (<xref ref-type="bibr" rid="B17">17</xref>) and CLUSTAL-W (<xref ref-type="bibr" rid="B18">18</xref>), we further performed pairwise alignments of the proteins of SARS-CoV-2 and SARS-CoV to evaluate the levels of amino acid identity of the homologous functional domains. We manually curated mismatched coordinates of such homologous domains between different studies, produced reconciled coordinates and transferred the domain annotations, further accompanied on respective pages by the publications describing them.</p>
</sec>
</sec>
<sec id="s3" sec-type="results">
<title>Results and discussion</title>
<p>Succinctly, the data in the database is retrieved by users <italic>via</italic> two main query hubs. One is the <italic>Genome Search</italic> page and is comprised of several datasets and information retrieved from literature. The other is made available in the <italic>Mutation Search</italic> page (and <italic>Repeat</italic> page), presenting results of our re-analysis of &gt;25,000 patient samples obtained from NCBI&#x2019;s publicly available SRA SARS-CoV-2 genomes. We interlinked these sections comprehensively in order to provide the user an easy way to carry over the findings gained in one section to another.</p>
<p>See <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref> (top) for an overview of the data sources and functionality of the database.</p>
<fig id="f1" position="float">
<label>Figure&#xa0;1</label>
<caption>
<p>Overview of the data sources and functionalities (top). A walkthrough of the case study described in text (bottom).</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fviro-02-1028335-g001.tif"/>
</fig>
<sec id="s3_1">
<title>Web app implementation and user interface</title>
<p>Users can access SARSNTdb at <uri xlink:href="https://grigoriev-lab.camden.rutgers.edu/sarsntdb/">https://grigoriev-lab.camden.rutgers.edu/sarsntdb/</uri>
</p>
<p>The website is implemented in PHP (version 7.4.29) and the SQL server through mysql (version 15.1) with MariaDB (distribution 10.3.34).</p>
<p>The interface of the database consists of several tabs. The <italic>Search</italic> tab has a dropdown menu that brings the user to a <italic>Genome Search, Mutation Search</italic>, and a <italic>Repeat Search</italic>. These searches are interconnected to allow the user to take information gleaned from one search into another. The <italic>Help</italic> tab instructs the user on how to use the website by providing an example. The <italic>Reference</italic> tab brings the user to this article where they can learn about the data sources and how the website was constructed.</p>
</sec>
<sec id="s3_2">
<title>Accessing gene, protein and functional domain details</title>
<p>The <italic>Genome Search</italic> page allows the user to specify nucleotide coordinate intervals and find information about functionally relevant regions of the SARS-CoV-2 virus that overlap or are contained between these coordinate pairs. Such regions most often correspond to genes and functional domains they encode. Also, this search reports about nearby repeats and intragenomic interactions obtained using a SPLASH technique (<xref ref-type="bibr" rid="B14">14</xref>).</p>
<p>One can also select a single ORF or Nsp from a menu to get to such genes. Their protein products are described on the <italic>Protein Detail</italic> page (<xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2</bold>
</xref>). In addition to images of the predicted structure of the SARS-CoV-2 proteins, their functional domains, smaller motifs, and certain amino acid residues with annotated functionality are also displayed graphically. At the bottom of this page there are the relevant RNA and Protein sequences derived from the respective NCBI reference.</p>
<fig id="f2" position="float">
<label>Figure&#xa0;2</label>
<caption>
<p>The <italic>Genome Detail</italic> page of the S protein. Data presented includes a protein structure simulation, functional detail, coordinates for start and end, domain map/ranges, and more. This page links to the SARS-CoV vs SARS-CoV-2 comparison page and reference sequences for the gene and protein.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fviro-02-1028335-g002.tif"/>
</fig>
<p>Similar to evaluations of SARS-CoV-2 mutations using variant effect predictors (<xref ref-type="bibr" rid="B19">19</xref>), a link to online analysis of all possible amino acid substitutions for a given protein by SNAP2 (<xref ref-type="bibr" rid="B20">20</xref>) is also provided here. A single click would copy the protein FASTA sequence and redirect to the SNAP2 server, so users just need to paste it there, start the analysis and view its results.</p>
<p>Knowing that a substitution occurred in a protein domain with certain function may provide more specific information on the functional effect. Since SARS-CoV-2 domains are typically derived from the previous body of work on SARS-CoV, we devoted a special page for each protein in both viruses for comparing domains. This page is linked form the <italic>Protein</italic> page and contains a table detailing the similarities of the two viruses and an alignment of both protein sequences created using CLUSTALW (<xref ref-type="bibr" rid="B18">18</xref>) and BLAST (<xref ref-type="bibr" rid="B17">17</xref>). The coordinates in the table are derived from primary literature and review papers (that can be accessed by clicking the hyperlinks on the coordinates) and sometimes they differ, despite being reasonably well aligned. These pages also illustrate the degree of conservation of the two viruses and links to mutations for each domain are also provided.</p>
</sec>
<sec id="s3_3">
<title>Visualization of mutation and RNA structure details</title>
<p>The <italic>Mutation Search</italic> page allows the user to search for mutations in a nucleotide range or within a gene. The search results are bar graphs depicting the number and type of substitutions in the range. The bars are also subdivided by sequencing platforms. If a more granular view of the mutations is needed the user can click on <italic>Mutation Detail</italic> to see expanded information related to the mutations in the that nucleotide range. Below the <italic>Substitution Frequency</italic> table there is a histogram that displays SNV frequency across the nucleotide range selected.</p>
<p>Also on this page is SHAPE data that may help inform the user why certain regions may be conserved due to the secondary structure constraints. When the size of the searched region is large, the SHAPE Data is displayed in intervals where the SHAPE value is averaged across that region. If the size of the searched interval is under 100 nucleotides, each position and the SHAPE value is displayed individually. If the shape value is above 0.5 it is displayed in blue indicating a high reactivity while below 0.5 is displayed in red and indicates a low reactivity. We selected several SHAPE datasets and displayed them in separate graphs. These data make up our <italic>Mutation Search</italic> page and it is visualized on the page using CanvasJS (Fenopix Inc.).</p>
</sec>
<sec id="s3_4">
<title>Repeats in the genome</title>
<p>The <italic>Repeat Page</italic> (<xref ref-type="fig" rid="f3">
<bold>Figure&#xa0;3</bold>
</xref>) allows the user to search the SARS-CoV-2 reference sequence for repeats of size 6 nucleotides or greater. Displayed on this page is the genome schematic with proteins colored distinctly. When a repeat is found red lines appear on the genome indicating repeat locations, and a table displaying the coordinates of the repeats as well as which protein they appear in is displayed. Also available are repeats, which are super-strings containing the searched repeat; these are deemed super-repeats. For example, the repeat AACAGGA is a super-repeat of AACAGG as the former is a super-string of (i.e., contains) the latter. Clicking on these super-repeats brings the user to a <italic>Repeat Page</italic> for the super-repeat (with their respective super-repeats, if available). For the default search on the <italic>Repeat Page</italic> and a clear biological example, we provide the minimal repeat of the transcription regulatory sequence (TRS) from the SARS-CoV-2 virus (<xref ref-type="bibr" rid="B5">5</xref>), with all locations of canonical TRS visualized throughout the genome for the user.</p>
<fig id="f3" position="float">
<label>Figure&#xa0;3</label>
<caption>
<p>Visualization of the leader sequence repeat ACGAAC across the genome. This page visualizes the location of repeats across the genome visible as red ticks on top of the genome. In addition, the locations of repeats within coding regions are listed in a table where each protein name is a link to its genome detail page.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fviro-02-1028335-g003.tif"/>
</fig>
</sec>
<sec id="s3_5">
<title>Case study</title>
<p>As stated previously there are many databases tracking the waves of VOCs and their typical mutations. The virus continues to evolve, and even the general public is made aware of new substitutions in the best-annotated spike protein. When new mutations appear, it is important to be able to quickly identify where they occur and analyze their effects by detecting genome features nearby. Furthermore, substitutions take place not only the spike protein, yet those affecting other parts of the genome are typically ignored in the databases and many analyzes.</p>
<p>In contrast, SARSNTdb could be an excellent starting point for such quick evaluation, and a schematic walkthrough is shown in <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref> (bottom). Consider the mutation C28311T, found in Omicron. Let us first go to our <italic>Genome Search</italic> page and input the coordinate 28311 and find it is part of two overlapping genes, encoding the Nucleocapsid (N) protein as well as ORF9b. In N it is located in the N-terminal arm/Intrinsically disordered region. In ORF9b we see it is part of the site that interacts with the host protein NEMO. We also see that it is a part of some common repeats and has intragenomic interactions at the 5&#x2019; end of the protein as well as a region 200nt away that it binds with. These close intragenomic interacting regions could form pockets that may become therapeutic targets (<xref ref-type="bibr" rid="B15">15</xref>). Clicking <italic>View Details</italic> for ORF9b, we find its function and see it supresses the innate immune system through regulating Mitochondrial Antiviral Signaling pathways (<xref ref-type="bibr" rid="B21">21</xref>). In comparing it to SARS-CoV we find this domain is not well conserved with only 63% similarity overall. In the paper linked <italic>via</italic> the domain coordinates in the table, we find that this region, when deleted, resulted in a loss of function of the protein and its interaction with NEMO (<xref ref-type="bibr" rid="B21">21</xref>). If this nucleotide change results in a non-synonymous mutation, it could affect the function of the protein. By clicking <italic>Mutations</italic> on the table, we are brought to the mutation page showing the NEMO interaction region&#x2019;s mutation frequencies, SHAPE scores and, if we click Detail, a breakdown if that specific mutation has been found. If it has been found the detail page will also show the type of variant it creates. In this case the SNP has been found previously in thousands of samples, where it changed a proline to a serine. In addition, the SHAPE score of this nucleotide is low according to all datasets, indicating that it may be prone to forming secondary structures within the RNA. Overall, with all such results about this mutation we can conclude that it should be monitored as it has been persisting over time and now, with Omicron spreading rapidly, may be gaining increased prevalence. This mutation could affect the ability of ORF9b to supress the innate immune system through interacting with NEMO and its effects should be explored further.</p>
</sec>
<sec id="s3_6">
<title>Conclusion</title>
<p>SARSNTdb is a database for users of varying levels of knowledge about virology or genomics. It provides nucleotide-level functional information about various aspects of the SARS-CoV-2 genome. It features a quick and easy coordinate-based search for SARS-CoV-2 gene and protein functions, mutations found in patient samples, structural and sequence elements of the virus RNA and several other features. We reviewed, analyzed, and provided visualization for data that could help users to better understand the virus, and to do this rapidly. We will continue to add mutation data as we process other representative samples from the NCBI to detect SNVs in those with GROM on a regular basis. In addition, should new domain definitions be discovered they will also be added to the database.</p>
</sec>
</sec>
<sec id="s4" sec-type="data-availability">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/<xref ref-type="supplementary-material" rid="SM1">
<bold>Supplementary Material</bold>
</xref>. Further inquiries can be directed to the corresponding author.</p>
</sec>
<sec id="s5" sec-type="author-contributions">
<title>Author contributions</title>
<p>JO - developed code and server, produced content, tested database, wrote paper; JK - collected and analyzed patient samples, analyzed repeats, tested database; OB - produced domain annotations and alignments, contributed to Help section, tested database; SV - contributed to domain annotations, tested database; AG - conceived project, obtained funding, oversaw project execution, tested database, wrote paper. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec id="s6" sec-type="funding-information">
<title>Funding</title>
<p>The work in AG&#x2019;s lab is supported by the National Science Foundation [MCB-2027611 to AG] and National Institutes of Health [R15CA220059 to AG]. Funding for open access publication fees - National Science Foundation [MCB-2027611 to AG].</p>
</sec>
<sec id="s7" sec-type="acknowledgment">
<title>Acknowledgments</title>
<p>We thank Jim Schmincke for excellent technical help and Sudheshna Bodapati for contributions to an early prototype.</p>
</sec>
<sec id="s8" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="s9" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<sec id="s10" sec-type="supplementary-material">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fviro.2022.1028335/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fviro.2022.1028335/full#supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="DataSheet_1.docx" id="SM1" mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<label>1</label>
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Hodcroft</surname> <given-names>EB</given-names>
</name>
</person-group>. <source>CoVariants: SARS-CoV-2 mutations and variants of interest</source> (<year>2021</year>). Available at: <uri xlink:href="https://covariants.org/">https://covariants.org/</uri>.</citation>
</ref>
<ref id="B2">
<label>2</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Khare</surname> <given-names>S</given-names>
</name>
<name>
<surname>Gurry</surname> <given-names>C</given-names>
</name>
<name>
<surname>Freitas</surname> <given-names>L</given-names>
</name>
<name>
<surname>Schultz</surname> <given-names>MB</given-names>
</name>
<name>
<surname>Bach</surname> <given-names>G</given-names>
</name>
<name>
<surname>Diallo</surname> <given-names>A</given-names>
</name>
<etal/>
</person-group>. <article-title>GISAID's role in pandemic response</article-title>. <source>China CDC Wkly</source> (<year>2021</year>) <volume>3</volume>(<issue>49</issue>):<page-range>1049&#x2013;51</page-range>. doi: <pub-id pub-id-type="doi">10.46234/ccdcw2021.255</pub-id>
</citation>
</ref>
<ref id="B3">
<label>3</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cao</surname> <given-names>Y</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>J</given-names>
</name>
<name>
<surname>Jian</surname> <given-names>F</given-names>
</name>
<name>
<surname>Xiao</surname> <given-names>T</given-names>
</name>
<name>
<surname>Song</surname> <given-names>W</given-names>
</name>
<name>
<surname>Yisimayi</surname> <given-names>A</given-names>
</name>
<etal/>
</person-group>. <article-title>Omicron escapes the majority of existing SARS-CoV-2 neutralizing antibodies</article-title>. <source>Nature</source> (<year>2022</year>) <volume>602</volume>(<issue>7898</issue>):<page-range>657&#x2013;63</page-range>. doi: <pub-id pub-id-type="doi">10.1038/s41586-021-04385-3</pub-id>
</citation>
</ref>
<ref id="B4">
<label>4</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Karim</surname> <given-names>SSA</given-names>
</name>
<name>
<surname>Karim</surname> <given-names>QA</given-names>
</name>
</person-group>. <article-title>Omicron SARS-CoV-2 variant: a new chapter in the COVID-19 pandemic</article-title>. <source>Lancet</source> (<year>2021</year>) <volume>398</volume>(<issue>10317</issue>):<page-range>2126&#x2013;8</page-range>. doi: <pub-id pub-id-type="doi">10.1016/S0140-6736(21)02758-6</pub-id>
</citation>
</ref>
<ref id="B5">
<label>5</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname> <given-names>D</given-names>
</name>
<name>
<surname>Lee</surname> <given-names>JY</given-names>
</name>
<name>
<surname>Yang</surname> <given-names>JS</given-names>
</name>
<name>
<surname>Kim</surname> <given-names>JW</given-names>
</name>
<name>
<surname>Kim</surname> <given-names>VN</given-names>
</name>
<name>
<surname>Chang</surname> <given-names>H</given-names>
</name>
</person-group>. <article-title>The architecture of SARS-CoV-2 transcriptome</article-title>. <source>Cell</source> (<year>2020</year>) <volume>181</volume>(<issue>4</issue>):<fpage>914</fpage>&#x2013;<lpage>21.e10</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.cell.2020.04.011</pub-id>
</citation>
</ref>
<ref id="B6">
<label>6</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gobeil</surname> <given-names>SMC</given-names>
</name>
<name>
<surname>Henderson</surname> <given-names>R</given-names>
</name>
<name>
<surname>Stalls</surname> <given-names>V</given-names>
</name>
<name>
<surname>Janowska</surname> <given-names>K</given-names>
</name>
<name>
<surname>Huang</surname> <given-names>X</given-names>
</name>
<name>
<surname>May</surname> <given-names>A</given-names>
</name>
<etal/>
</person-group>. <article-title>Structural diversity of the SARS-CoV-2 omicron spike</article-title>. <source>Mol Cell</source> (<year>2022</year>) <volume>82</volume>(<issue>11</issue>):<fpage>2050</fpage>&#x2013;<lpage>68.e6</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.molcel.2022.03.028</pub-id>
</citation>
</ref>
<ref id="B7">
<label>7</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sayers</surname> <given-names>EW</given-names>
</name>
<name>
<surname>Bolton</surname> <given-names>EE</given-names>
</name>
<name>
<surname>Brister</surname> <given-names>JR</given-names>
</name>
<name>
<surname>Canese</surname> <given-names>K</given-names>
</name>
<name>
<surname>Chan</surname> <given-names>J</given-names>
</name>
<name>
<surname>Comeau</surname> <given-names>DC</given-names>
</name><etal/>
</person-group> <article-title>Database resources of the national center for biotechnology information</article-title>. <source>Nucleic Acids Res</source> (<year>2016</year>) <volume>44</volume>(<issue>D1</issue>):<fpage>D7</fpage>&#x2013;<lpage>19</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1093/nar/gkab1112</pub-id>
</citation>
</ref>
<ref id="B8">
<label>8</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Grigoriev</surname> <given-names>A</given-names>
</name>
</person-group>. <article-title>Mutational patterns correlate with genome organization in SARS and other coronaviruses</article-title>. <source>Trends Genet</source> (<year>2004</year>) <volume>20</volume>(<issue>3</issue>):<page-range>131&#x2013;5</page-range>. doi: <pub-id pub-id-type="doi">10.1016/j.tig.2004.01.009</pub-id>
</citation>
</ref>
<ref id="B9">
<label>9</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>H</given-names>
</name>
<name>
<surname>Handsaker</surname> <given-names>B</given-names>
</name>
<name>
<surname>Wysoker</surname> <given-names>A</given-names>
</name>
<name>
<surname>Fennell</surname> <given-names>T</given-names>
</name>
<name>
<surname>Ruan</surname> <given-names>J</given-names>
</name>
<name>
<surname>Homer</surname> <given-names>N</given-names>
</name>
<etal/>
</person-group>. <article-title>The sequence Alignment/Map format and SAMtools</article-title>. <source>Bioinformatics</source> (<year>2009</year>) <volume>25</volume>(<issue>16</issue>):<page-range>2078&#x2013;9</page-range>. doi: <pub-id pub-id-type="doi">10.1093/bioinformatics/btp352</pub-id>
</citation>
</ref>
<ref id="B10">
<label>10</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname> <given-names>F</given-names>
</name>
<name>
<surname>Zhao</surname> <given-names>S</given-names>
</name>
<name>
<surname>Yu</surname> <given-names>B</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>YM</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>W</given-names>
</name>
<name>
<surname>Song</surname> <given-names>ZG</given-names>
</name>
<etal/>
</person-group>. <article-title>A new coronavirus associated with human respiratory disease in China</article-title>. <source>Nature</source> (<year>2020</year>) <volume>579</volume>(<issue>7798</issue>):<page-range>265&#x2013;9</page-range>. doi: <pub-id pub-id-type="doi">10.1038/s41586-020-2008-3</pub-id>
</citation>
</ref>
<ref id="B11">
<label>11</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Smith</surname> <given-names>SD</given-names>
</name>
<name>
<surname>Kawash</surname> <given-names>JK</given-names>
</name>
<name>
<surname>Grigoriev</surname> <given-names>A</given-names>
</name>
</person-group>. <article-title>Lightning-fast genome variant detection with GROM</article-title>. <source>GigaScience</source> (<year>2017</year>) <volume>6</volume>(<issue>10</issue>):<fpage>gix091</fpage>. doi: <pub-id pub-id-type="doi">10.1093/gigascience/gix091</pub-id>
</citation>
</ref>
<ref id="B12">
<label>12</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Okonechnikov</surname> <given-names>K</given-names>
</name>
<name>
<surname>Golosova</surname> <given-names>O</given-names>
</name>
<name>
<surname>Fursov</surname> <given-names>M</given-names>
</name>
<collab>team tU</collab>
</person-group>. <article-title>Unipro UGENE: a unified bioinformatics toolkit</article-title>. <source>Bioinformatics</source> (<year>2012</year>) <volume>28</volume>(<issue>8</issue>):<page-range>1166&#x2013;7</page-range>. doi: <pub-id pub-id-type="doi">10.1093/bioinformatics/bts091</pub-id>
</citation>
</ref>
<ref id="B13">
<label>13</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zheng</surname> <given-names>W</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>C</given-names>
</name>
<name>
<surname>Li</surname> <given-names>Y</given-names>
</name>
<name>
<surname>Pearce</surname> <given-names>R</given-names>
</name>
<name>
<surname>Bell</surname> <given-names>EW</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Y</given-names>
</name>
</person-group>. <article-title>Folding non-homologous proteins by coupling deep-learning contact maps with I-TASSER assembly simulations</article-title>. <source>Cell Rep Methods</source> (<year>2021</year>) <volume>1</volume>(<issue>3</issue>):<fpage>100014</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.crmeth.2021.100014</pub-id>
</citation>
</ref>
<ref id="B14">
<label>14</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname> <given-names>SL</given-names>
</name>
<name>
<surname>DeFalco</surname> <given-names>L</given-names>
</name>
<name>
<surname>Anderson</surname> <given-names>DE</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Y</given-names>
</name>
<name>
<surname>Aw</surname> <given-names>JGA</given-names>
</name>
<name>
<surname>Lim</surname> <given-names>SY</given-names>
</name>
<etal/>
</person-group>. <article-title>Comprehensive mapping of SARS-CoV-2 interactions <italic>in vivo</italic> reveals functional virus-host interactions</article-title>. <source>Nat Commun</source> (<year>2021</year>) <volume>12</volume>(<issue>1</issue>):<fpage>5113</fpage>. doi: <pub-id pub-id-type="doi">10.1038/s41467-021-25357-1</pub-id>
</citation>
</ref>
<ref id="B15">
<label>15</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sun</surname> <given-names>L</given-names>
</name>
<name>
<surname>Li</surname> <given-names>P</given-names>
</name>
<name>
<surname>Ju</surname> <given-names>X</given-names>
</name>
<name>
<surname>Rao</surname> <given-names>J</given-names>
</name>
<name>
<surname>Huang</surname> <given-names>W</given-names>
</name>
<name>
<surname>Ren</surname> <given-names>L</given-names>
</name>
<etal/>
</person-group>. <article-title>
<italic>In vivo</italic> structural characterization of the SARS-CoV-2 RNA genome identifies host proteins vulnerable to repurposed drugs</article-title>. <source>Cell</source> (<year>2021</year>) <volume>184</volume>(<issue>7</issue>):<fpage>1865</fpage>&#x2013;<lpage>83.e20</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.cell.2021.02.008</pub-id>
</citation>
</ref>
<ref id="B16">
<label>16</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Manfredonia</surname> <given-names>I</given-names>
</name>
<name>
<surname>Nithin</surname> <given-names>C</given-names>
</name>
<name>
<surname>Ponce-Salvatierra</surname> <given-names>A</given-names>
</name>
<name>
<surname>Ghosh</surname> <given-names>P</given-names>
</name>
<name>
<surname>Wirecki</surname> <given-names>TK</given-names>
</name>
<name>
<surname>Marinus</surname> <given-names>T</given-names>
</name>
<etal/>
</person-group>. <article-title>Genome-wide mapping of SARS-CoV-2 RNA structures identifies therapeutically-relevant elements</article-title>. <source>Nucleic Acids Res</source> (<year>2020</year>) <volume>48</volume>(<issue>22</issue>):<page-range>12436&#x2013;52</page-range>. doi: <pub-id pub-id-type="doi">10.1093/nar/gkaa1053</pub-id>
</citation>
</ref>
<ref id="B17">
<label>17</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Camacho</surname> <given-names>C</given-names>
</name>
<name>
<surname>Coulouris</surname> <given-names>G</given-names>
</name>
<name>
<surname>Avagyan</surname> <given-names>V</given-names>
</name>
<name>
<surname>Ma</surname> <given-names>N</given-names>
</name>
<name>
<surname>Papadopoulos</surname> <given-names>J</given-names>
</name>
<name>
<surname>Bealer</surname> <given-names>K</given-names>
</name>
<etal/>
</person-group>. <article-title>BLAST+: architecture and applications</article-title>. <source>BMC Bioinf</source> (<year>2009</year>) <volume>10</volume>:<fpage>421</fpage>. doi: <pub-id pub-id-type="doi">10.1186/1471-2105-10-421</pub-id>
</citation>
</ref>
<ref id="B18">
<label>18</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Thompson</surname> <given-names>JD</given-names>
</name>
<name>
<surname>Higgins</surname> <given-names>DG</given-names>
</name>
<name>
<surname>Gibson</surname> <given-names>TJ</given-names>
</name>
</person-group>. <article-title>CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice</article-title>. <source>Nucleic Acids Res</source> (<year>1994</year>) <volume>22</volume>(<issue>22</issue>):<page-range>4673&#x2013;80</page-range>. doi: <pub-id pub-id-type="doi">10.1093/nar/22.22.4673</pub-id>
</citation>
</ref>
<ref id="B19">
<label>19</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mishra</surname> <given-names>D</given-names>
</name>
<name>
<surname>Suri</surname> <given-names>GS</given-names>
</name>
<name>
<surname>Kaur</surname> <given-names>G</given-names>
</name>
<name>
<surname>Tiwari</surname> <given-names>M</given-names>
</name>
</person-group>. <article-title>Comparative insight into the genomic landscape of SARS-CoV-2 and identification of mutations associated with the origin of infection and diversity</article-title>. <source>J Med Virol</source> (<year>2021</year>) <volume>93</volume>(<issue>4</issue>):<page-range>2406&#x2013;19</page-range>. doi: <pub-id pub-id-type="doi">10.1002/jmv.26744</pub-id>
</citation>
</ref>
<ref id="B20">
<label>20</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hecht</surname> <given-names>M</given-names>
</name>
<name>
<surname>Bromberg</surname> <given-names>Y</given-names>
</name>
<name>
<surname>Rost</surname> <given-names>B</given-names>
</name>
</person-group>. <article-title>Better prediction of functional effects for sequence variants</article-title>. <source>BMC Genomics</source> (<year>2015</year>) <volume>16</volume>(<issue>8</issue>):<fpage>S1</fpage>. doi: <pub-id pub-id-type="doi">10.1186/1471-2164-16-S8-S1</pub-id>
</citation>
</ref>
<ref id="B21">
<label>21</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname> <given-names>J</given-names>
</name>
<name>
<surname>Shi</surname> <given-names>Y</given-names>
</name>
<name>
<surname>Pan</surname> <given-names>X</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>S</given-names>
</name>
<name>
<surname>Hou</surname> <given-names>R</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Y</given-names>
</name>
<etal/>
</person-group>. <article-title>SARS-CoV-2 ORF9b inhibits RIG-I-MAVS antiviral signaling by interrupting K63-linked ubiquitination of NEMO</article-title>. <source>Cell Rep</source> (<year>2021</year>) <volume>34</volume>(<issue>7</issue>):<fpage>108761</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.celrep.2021.108761</pub-id>
</citation>
</ref>
</ref-list>
</back>
</article>