GRAND: An Integrated Genome, Transcriptome Resources, and Gene Network Database for Gossypium

With the increasing amount of cotton omics data, breeding scientists are confronted with the question of how to use massive cotton data to mine effective breeding information. Here, we construct a Gossypium Resource And Network Database (GRAND), which integrates 18 cotton genome sequences, genome annotations, two cotton genome variations information, and also four transcriptomes for Gossypium species. GRAND allows to explore and mine this data with the help of a toolbox that comprises a flexible search system, BLAST and BLAT suite, orthologous gene ID, networks of co-expressed genes, primer design, Gbrowse and Jbrowse, and drawing instruments. GRAND provides important information regarding Gossypium resources and hopefully can accelerate the progress of cultivating cotton varieties.


INTRODUCTION
Cotton (Gossypium spp.) produces natural fiber for the textile industry worldwide and also plays an important role in edible oil for daily life. The Gossypium genus includes more than 50 different species, and it is an excellent model for studying genome evolution and polyploidization. Moreover, multiple high-quality de novo assembled genomes of Gossypium have been reported in recent years. These genomes have considerable improvements and contiguity compared to previously assembled draft genomes. For example, the high-quality genomes of Gossypium arboreum (Du et al., 2018), Gossypium austral (Cai et al., 2019), Gossypium raimondii and Gossypium turneri (Udall et al., 2019), Gossypium davidsonii and Gossypium thurberi (Yang et al., 2021) were sequenced and released in 2018, 2019, 2019, and 2021, respectively. Genomes of tetraploid Gossypium barbadense and Gossypium hirsutum were de novo sequenced and released by Hu et al. (2019) and Wang et al. (2019), respectively. Yang et al. (2019) also sequenced and assembled genomes of two upland cotton cultivars TM-1 and zhongmiansuo24 (ZM24). The assembly of these cotton genomes (diploid and tetraploid) transitioned Gossypium research into the genomics and pan-genomics era. However, effective integration and utilization of a large number of cotton datasets to mine valuable information for cotton researchers have become an important research hotspot.
Several online databases about cotton have been designed worldwide. CottonGen (Yu et al., 2014) is a relatively comprehensive cotton database with a collection of cotton genomes, genetic markers, and breeding germplasm accessions, while it is sometimes unfriendly for users and the functional modules need to be further expanded. ccNet (You et al., 2017) is a co-expression network database of diploid G. arboreum and polyploid G. hirsutum. CottonFGD  is a cotton functional genome database, which integrates cotton genomes and transcriptomes as well as sequence retrieval, analysis, and visualization modules, but it does not contain genetic data, such as molecular markers. MaGenDB (Wang et al., 2020) focuses on constructing an integrative database of 13 Malvaceae species, including cotton, to enable users to jointly compare and analyze relevant data. COTTONOMICS 1 is a comparative genomics platform and variation database for G. hirsutum and G. barbadense. CottonGVD (Peng et al., 2021) is a cotton database specifically focused on trait-associated loci visualization. Therefore, it is necessary to build a cotton database that systematically gathers the latest cotton genomes, transcriptomes, and molecular markers data together.
To meet this goal, here, we developed a comprehensive cotton database GRAND by integrating high-quality genomic and transcriptomic resources of cotton and providing tools for multi-level integrative analysis. GRAND covers a systematic view of genomic and transcriptomic information, integrates gene searching, gene list analysis, and visualization tools (such as Expression Visualization, Heatmap Draw, KEGG Dot Plot, and Annotation function). Besides, GRAND is an omics database for cotton (Gossypium spp.), in which all data can be freely accessed and downloaded. 1 http://cotton.zju.edu.cn/

Data Sources
The sequences of 18 cotton genome assemblies representing 14 Gossypium species and their respective gene annotations data, together with four transcriptomes were downloaded directly from relevant databases or sequenced by our laboratory and further used in GRAND (  Du et al. (2018) and Yang et al. (2019) of the Institute of Cotton Research (ICR), respectively. Illumina reads were aligned to references G. hirsutum_TM-1_ICR (Yang et al., 2019), G. hirsutum_ZM24_ICR (Yang et al., 2019), and G. arboreum_ICR (Du et al., 2018), respectively, using TopHat 2.1.1 (Kim et al., 2013). Quantification of gene expression was then performed with Cufflinks version 2.2.1. 2 These data are shared freely on these websites without analysis tools to analyze them online, and the inconsistent format of the different datasets makes it more difficult to use them jointly. We have solved these problems and all the data are available for free download from GRAND.

Development of Database and Website
The GRAND database relies on the Linux operating system, using J2EE as the framework, MySQL as the back-end database, and Apache Tomcat as the server. Genome sequence, annotation, expression, and variation data are stored in the MySQL database. A web interface based on JavaServer Pages (JSP), HTML5 and, CSS3 is constructed to enable end-users to access GRAND data through any modern browser on any kind of device. The GRAND database is hosted on a server equipped with eight 14-cores Intel Xeon Gold 5120 processors.
Based on gene expression data, networks of co-expressed genes of three cotton species (G. arboreum_ICR, G. hirsutum_TM-1_ICR, and G. hirsutum_ZM24_ICR) were constructed using Pearson's correlation coefficient (PCC) values between pairs of genes and visualized by using JavaScript Cytoscape.js. For a given query gene, the network of top 20 target genes with the highest correlation values with the query gene is shown. In addition, a summary table of all co-expressed genes and corresponding functional annotations is provided below the network.

Overview of Website Structure and Function
To provide users with a wealth of information about cotton, the GRAND database was built containing the latest and most comprehensive Gossypium genomic/transcriptomic datasets (including 18 assembled genomes and four transcriptomes; Figure 1; Table 1). The main structure of GRAND is shown in Figure 1 with four major modules: Browse, Search, Tools, and Download. GRAND provides search functions for various genomic information, including gene annotation information (KOG, GO, KEGG, and NR), gene sequences, genome variations (SNPs and INDELs), and expressional profiles and gene families, by entering a chromosomal region or longest transcripts ID. GRAND also integrated the genome visualization tools Gbrowse (Stein et al., 2002) and Jbrowse FIGURE 1 | Schematic of GRAND database structure and web interface features. GRAND gathers 18 genomes, four transcriptomes of cotton and associated genome variations, and annotations data. All data are stored in a MySQL database. (Buels et al., 2016), allowing users to instantly browse, visualize, and retrieve sequence data and offer gene co-expression networks for different developmental stages and tissues/organs. Moreover, GRAND provides a suite of the toolbox for online analysis, such as BLAST-and BLAT-based sequence comparisons, orthologous gene ID across different species and PCR primer design (Untergasser et al., 2012). Besides, users can download cotton data selectively or in full. Tutorials for using all the tools in the database are provided in the Help module. This information in GRAND will be useful for both dry lab and wet lab biologists.

Browse Functions in GRAND
The browse detail page mainly includes the following modules: SNP Variation, INDEL Variation, Gene Annotation, and Gene Family. Users can search for the Nr, TrEMBL, KOG, KEGG, and GO annotations using Gene Annotation module. By crosslink, the genome variations related to each gene can be searched, including location, genome sequence, CDs sequence, transcript sequence, and peptide sequence (Figure 2A). Users can also quickly find related information about the gene family by searching for the target gene keywords ( Figure 2B). The SNP or INDEL Variation shows the genome variations data (SNP or INDEL) on each chromosome of a group of individuals. Users can filter the data by reference genome, and SNP, or INDEL type ( Figure 2C).
For example, the gene "Gh_D03G071700" is located on chromosome D03 of the G. hirsutum_TM-1_ICR genome and is annotated as the hAT family C-terminal dimerization region, which can be also found at the end page by searching the gene family module with PF05699. The SNP and Indel Variation can be directly queried by clicking on the Variation option and visualized by clicking on Gbrowse and Jbrowse (Figure 2), which are tools for displaying variations (SNPs and INDELs) and genes (structure) of the cotton individuals on chromosomes. The Gbrowse detail page includes the following basic information of this gene: name, position in the scaffold, length, CDS parts, and sequence. The detail page of Jbrowse is displayed in a popup window showing information about the gene, SNP, and INDEL in the 30 zkb region around this gene.

Search Functions in GRAND
GRAND allows users to perform both BLAST and BLAT searches to rapidly align sequences to the genome. BLAST search implemented in GRAND using SequenceServer (Priyam et al., 2019) provides an interface with text-based and interactive visual outputs to search against nucleotide sequences and/or protein sequences, including BLASTn, BLASTp, BLASTx, tBLASTx, and tBLASTn programs. GRAND currently has a BLAST database for whole-genome sequences, CDSs, and predicted proteins for each reference genome assembly. Pasting the DNA/Protein sequences in the query box or uploading a fasta file is acceptable. The search result displayed on a result page comprises two parts: "Graphical view" and "List view" (Figure 3A). The Graphical view presents a brief graphical view of the BLAST results by the chart. The List view is a table showing detailed information on the alignment by the BLAST program, such as gene ID, total score, e-value, and length. The sequence of FASTA files can be annotated by comparing with database (Nr_vs_GO, KEGG, COG, SwissProt, TrEMBL, KOG and Pfam;Ashburner et al., 2000;Tatusov et al., 2000;Koonin et al., 2004;El-Gebali et al., 2019) in the "Anno function" section. The results of the KEGG annotation can be visualized in the "KEGG Dot Plot" section. Besides, to make it easier for users to quickly search for data of interest, the current version of GRAND has four submodules under the "Search" module. (i) Multicriteria Search. Search the genome variations (SNPs and INDELs) of each individual in this database using gene, region, or variation. Data can be filtered by variation type and genotype. (ii) Phenotype Search. The phenotype data of fiber-related traits, floral traits, seed-related traits, and other traits for cotton species. (iii) Comparative Search. Search and compare the genome variations (SNPs and INDELs) of two or more cotton individuals in this database using gene, region, and variation. (iv) Gene Search. Search and achieve gene annotation (KOG, GO, KEGG, and NR), gene structure, sequences by inputting the chromosomal region and gene ID. The SNPs, and INDELs can be also searched by cross-link ( Figure 3B). After searching, a new webpage will pop out and display all the matched results. The details of each matched result can be viewed by clicking on it. The Orthologous Gene ID function can obtain orthologous gene IDs among different cotton species and different versions of cotton. For example, the orthologous genes ID of gene "Ga01G0003" are "Gh_A01G000300, " "Gh_ D03G199000, " "Ghicr24_A01G000500, " and "GB_A10G2858" in other cotton species, respectively ( Figure 3C). This result was consistent with the result of the BLAST above.

Tools for Online Analysis in GRAND
In addition to the modules mentioned above, GRAND also offers several additional tools. The "Expression Visualization" section shows the expression profiles in different tissues, and users can perform the analysis by partial selection or by selecting all. The results are presented as heatmaps and the expression values (FPKM) for each data set are displayed in the table at the bottom ( Figure 4A). Users can also import the results generated above into the "Heatmap Draw" section for further adjustment and embellishment. Gene network analysis can be used to identify related genes in the same biological processes or pathways. Networks of co-expressed genes are constructed based on inter-gene expression data using Pearson's correlation coefficient (PCC) between genes (Langfelder and Horvath, 2008). Enter the gene ID and set the PCC value threshold to visualize the top 20 target genes with the highest correlation value with the query gene, and click on any co-expressed genes in the network to view their co-expression network. In addition, a summary table of all co-expressed genes and corresponding functional annotations is provided below the network. Links to basic information about the genes are created for each target gene in the summary table (Figure 4B). GRAND database provides primer design function based on gene sequences from cotton ( Figure 4C). Users can also design primers for CRISPR/Cas9/Cpf1 genome editing using the targetDesign tool via the website link ( Figure 4D; Xie et al., 2017). Additionally, we provide an FTP server to store all the publicly released datasets used in GRAND, with an enhanced user interface, text preview, and directory download.

Download Functions in GRAND
The download page provides users with selective FTP download for genome sequences and their annotation information, transcriptomics data, CDS, protein, etc.

Limitations
GRAND currently still has some limitations. For example, only cotton genomic, transcriptomic and phenotypic data were collected here, some additional data, such as cotton molecular markers and metabolic data need to be expanded in the future. Moreover, there is no sequence feature extraction tool in the current database.

CONCLUSION AND PERSPECTIVES
GRAND provides access to the various data, such as the genomic, transcriptomic, and phenotypic data for cotton. It can be browsed, mined, analyzed, and even downloaded. Moreover, GRAND provides an interface to visualize genomes, annotated genes, gene expression, and networks of co-expressed genes. The plenty available data contributes to highly resolving comparative genomics studies that shed light on the evolution and diversification of the various cotton species. During subsequent upgrades, the GRAND database will add sequence feature extraction tool and newly generated cotton data.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
ZZ and LF wrote the initial draft. ZZ, ZhY, ZuY, MC and LF collected, curated, and formatted the data, and tested the GRAND and the use examples. All authors were involved in reviewing and editing the manuscript.

FUNDING
This work was supported by funding from the National Natural Science Foundation of China (grant 31621005) and Xinjiang Changji Hui Autonomous Prefecture Science and Technology Projects (grant 2021Z01).

ACKNOWLEDGMENTS
We wish to thank all researchers who have generated invaluable cotton genomic resources that are gathered in the GRAND database. We thank Biomarker Technology Co., Ltd. for assisting in GRAND construction.