RIGD: A Database for Intronless Genes in the Rosaceae

Most eukaryotic genes are interrupted by one or more introns, and only prokaryotic genomes are composed of mainly single-exon genes without introns. Due to the absence of introns, intronless genes in eukaryotes have become important materials for comparative genomics and evolutionary biology. There is currently no cohesive database that collects intronless genes in plants into a single database, although many databases on exons and introns exist. In this study, we constructed the Rosaceae Intronless Genes Database (RIGD), a user-friendly web interface to explore and collect information on intronless genes from different plants. Six Rosaceae species, Pyrus bretschneideri, Pyrus communis, Malus domestica, Prunus persica, Prunus mume, and Fragaria vesca, are included in the current release of the RIGD. Sequence data and gene annotation were collected from different databases and integrated. The main purpose of this study is to provide gene sequence data. In addition, attribute analysis, functional annotations, subcellular localization prediction, and GO analysis are reported. The RIGD allows users to browse, search, and download data with ease. Blast and comparative analyses are also provided through this online database, which is available at http://www.rigdb.cn/.

only found in eukaryotes (Tine et al., 2011). Furthermore, studies on intronless genes help to solve some evolutionary issues, including (1) the main factors leading to the emergence of intronless genes (gene duplication, inheritance from ancient prokaryotes, retroposition or other mechanisms), (2) the evolutionary significance of retroposition (retrogenes are considered to be intronless), and (3) the biological origins of introns (is the introns-early hypothesis or introns-late hypothesis more correct) .
In eukaryotes, the proportion of intronless genes varies from 2.7 to 97.7% of the genome (Louhichi et al., 2011). Currently, researchers have identified intronless genes in some species of mammals, hindmouths, bony fish, and plants (Agarwal and Gupta, 2005;Sakharkar et al., 2006;Jain et al., 2008;Zou et al., 2011). In Jain et al. (2006) studied the early auxin response SAUR (small auxin-up RNA) gene family in rice and found that all 58 members of the gene family were intronless genes. In the process of studying the functions of gene families in Arabidopsis, researchers also found a large number of intronless genes in the f-box protein family, DEAD box RNA helicase family, and PPR (pentatricopeptide repeat) gene family (Aubourg et al., 1999;Lecharny et al., 2003;Lurin et al., 2004). In addition, some of the largest families, such as the G-protein receptor family and the olfactory receptor family, are also composed of intronless genes (Gentles and Karlin, 1999;Takeda et al., 2002). Currently, the most studied intronless gene is the histone gene in the human genome. Researchers aim to explore the role of intronless genes in life processes by studying these gene families.
Since researching intronless genes in eukaryotes can help researchers better understand the evolutionary mechanism of related genes and genomes, the study of intronless genes has attracted more and more attention. In recent years, the construction of intronless gene databases has attracted great attention as the research on intronless genes. Relevant databases can provide important data resources for functional and evolutionary studies, facilitate researchers to carry out relevant research. So far, there are mainly databases on intronless genes: GENOME SEGE , IGD (Louhichi et al., 2011), PIGD (Yan et al., 2014), and IGDD (Yan et al., 2016). GENOME SEGE contains NCBI data regarding the intronless genes of eukaryotes, however, the database website has stopped updating the data, and users are unable to access it. The IGD database, which includes 687 human intronless genes, was published in 2011. PIGD provides a platform for the collection, integration, and analysis of intronless genes in Poaceae. IGDD provides a comprehensive platform for researchers to explore intronless genes in dicot plants.
To build a centralized platform, we present the Rosaceae Intronless Genes Database (RIGD) 1 . This database, with a userfriendly web interface, covers a collection of intronless genes from six genome-sequenced Rosaceae species. The RIGD integrates functional and evolutionary annotations, making it easy for researchers to find content of interest and download detailed information. The RIGD provides a comparative analysis of genome data from six species in conjunction with the Blast 1 http://www.rigdb.cn/

Identification of Rosaceae Intronless Genes
A set of strict standards was used to identify Rosaceae intronless genes. First, we used a Perl script to extract genes containing only one line of "exon" from each genome information in the genome annotation files (GFF/GFF3 format files) and then used them as candidate intronless genes for further screening. The basis for the screening was: if there was only one row of "exons" in the genome information, indicating that the coding sequence is not disrupted by an intron, then the gene is an intronless gene. Since the mitochondrial genes and chloroplast DNA do not contain introns, the genes annotated as "Mt" (mitochondria) and "Pt" (chloroplast) were rejected. In addition, genes that are not mapped to the chromosome were removed. Genes defined as "pseudogene" or "transposable element" in the annotation files were deleted because a pseudogene cannot be transcribed or translated, it is usually not functional. Through the above steps, we obtained no redundant intronless genes in six Rosaceae species. Using the identified intronless gene number, we used a Perl script to extract the protein sequence and CDS sequence of the intronless genes and renumber them according to certain criteria.

Intronless Gene Annotation
We established the following procedure to analyze each intronless gene stored in RIGD: (Figure 1).
(1) A Perl script was used to extract the position information for intronless genes on corresponding chromosomes, and calculate the length of protein sequences.

Comparisons Between Rosaceae Species
In addition to analyzing the intronless genes of six Rosaceae species, we conducted the following comparative analysis of the intronless genes among species: (1) the number and percentage of intronless genes in each chromosome, (2) the distribution of protein length, (3) the distribution of pI, (4) the distribution of Mw, (5) the statistics of subcellular localization, and (6) the statistics of GO classifications.

The RIGD Implementation and Web Interface
As a web-based platform, the RIGD is constructed in a Tencent cloud server, and the operating system is Ubuntu Server 16.04.1 LTS 64-bit. The RIGD combines the MySQL (version 8.0.17) database management system with a dynamic web interface based on PHP (version 7.3.9-1), Laravel (version 5.8), Nginx (version 1.10.3), and Perl (version 5.22.1) scripts (Figure 2).

Web Interface
The web interface of the RIGD is designed to comprise the following seven components: Home, Species, Search, Blast, Statistics, Upload&Download, and Contact Us. The RIGD provides a user-friendly interaction experience (Figure 3).

Home
The RIGD has seven navigation bars at the top. Scrolling through the home page reveals large photos of six species of Rosaceae. There is a "detail" button on each photo that can be clicked to link to the "species" interface for each species. In addition, the RIGD's project description, author information and contact information are also available on the home page.

Species
The bar opens a drop-down menu with the names of the six Rosaceae species covered in the RIGD. Clicking to enter, you can then see a detailed description of the species and the picture on the page. There is also a table with download links, where much of data is available, including the CDS and protein sequence of intronless genes, the prediction of isoelectric point and protein molecular weight, the results of the sequence compared with the nr database, the results of protein domain analysis, the results of subcellular localization prediction, the results of protein function prediction, and the results of GO analysis. Some statistical charts are also shown on the page, such as the number and proportion of intronless genes on each chromosome, the statistics of subcellular localization prediction, the distribution of pI and Mw, and the statistics of GO classifications (Figure 4).

Search
In the search interface, users can search by species name, chromosome number, classification of subcellular location prediction, and even GO number. The program in the RIGD will search the database for eligible intronless genes and list them, and then users can click to view the detailed information. In addition, the RIGD will renumber the intronless genes after processing, and the rule is the abbreviation of the species name + "IG" + chromosome number + the order number of the gene (starting from 1). The gene number in the original data is still retained, and either the RIGD number or the gene number of the original data can be used for searching (Figure 5).

Blast
The RIGD has Blast software installed on the server, moreover, the intronless gene CDS and protein sequences of the six Rosaceae species stored in the RIGD were formatted into the Blast local database. In the Blast interface, users can paste a sequence or upload a fasta-format file to match with the RIGD's local Blast database and find the putative homologous sequences of these intronless genes in different species. The databases can be compared (CDS/protein), and Blast programs (Blastn/Blastp) and e-values can all be selected or entered into the interface (Figure 6).

Statistics
The results of comparative analysis among the six species are shown on the Statistics interface with statistical charts. Four pictures investigate the general trends in protein length

Upload and Download
In the interface, according to the species name and the chromosome number in each species, users can download the following data in the "Download" section, namely, the CDS and protein sequence of intronless genes, the prediction of pI and Mw, the results of the sequence compared with the nr database, the results of protein domain analysis, the results of subcellular localization prediction, the results of protein function prediction, and the result of GO analysis, according to the species name and the chromosome number in each species. In the "Upload" section, users can upload intronless gene sequence files of other species or analysis result files to the RIGD server to expand the RIGD in the future.

Tools
We designed the Tools interface to collect some tools for intronless gene analysis or other practical bioinformatics analysis that will be developed in the future. The tool now available on this interface is a program that can batch submit sequences to ExPASy for pI/Mw prediction.

Contact Us
The Contact us interface is divided into "Contact us" and "Links." Users can email the RIGD's administrator in the "Contact us" interface to ask any questions or provide valuable suggestions. The "Links" interface contains links to external databases and analysis tools that the RIGD references.

CASE STUDY
The Results of Comparative Analysis Among the Six Species in Rosaceae  Table 2). The distribution of intronless genes on chromosomes was uneven in different species (Supplementary Figures 1-6). Although the number of intronless genes varied greatly from chromosome to chromosome, the proportion of intronless genes on each chromosome did not vary much among species (Supplementary  Figure 7). The average protein length was ∼333.4 amino acids (aa) in Pyrus bretschneideri, 258.7 aa in Pyrus communis, 321.4 aa in Malus domestica, 277.5 aa in Prunus persica, 351.5 aa in Prunus mume, and 275.0 aa in Fragaria vesca ( Figure 7A). The distribution of pI had three peaks (Figure 7B), and the distribution of Mw gathered at the front of the diagram, most predicted protein molecular weights were less than 100000 Da ( Figure 7C). The largest number of intronless genes were categorized as cytoplasmic in their cellular role (Figure 8). The largest number of intronless genes in six species were predicted for pentatricopeptide repeat in their protein function. The second largest number of intronless genes were predicted for AP2/ERF domain in Pyrus bretschneideri, Leucine-rich repeat in Pyrus communis, Zinc finger (RING-type) in Malus domestica, Prunus persica and Fragaria vesca, and protein kinase domain in Prunus mume. Top 10 largest number of intronless genes in protein function were shown in Figure 9. The largest number of intronless genes were classified as biological process in GO categories. The largest proportion of intronless genes in six species were classified as cell and cell part ( Table 3 and Supplementary Figures 8-13).

Analysis of Intronless Pentatricopeptide Repeat Gene Family in Pyrus bretschneideri
In Pyrus bretschneideri, the largest intronless gene family is the Pentatricopeptide Repeat gene family. Meanwhile, PPR gene family is also one of the largest families found in most plants, which plays a wide and crucial role in plant growth and development. We searched RIGD database by using Pfam ID of Pentatricopeptide Repeat gene family (PF01535, PF13041, and PF13812), the predicted protein function was used to determine whether it belonged to PPR gene. The analysis results of isoelectric point, protein molecular weight and subcellular localization were obtained from RIGD by using the search interface. We downloaded the protein sequence, used the MEME SUITE (Bailey et al., 2009) and TBtools (Chen et al., 2020) to analysis the motif of intronless PPR gene in Pyrus bretschneideri. We identified 120 intronless PPR genes in Pyrus bretschneideri. The relative molecular weight of each protein was between 11.5 and 113.7 kD. The molecular weight of gene named LOC103927494 was the smallest, while the molecular weight of LOC103947845 was far higher than that of other genes, 10 times the minimum molecular weight, and more than twice the average molecular weight of 120 amino acid sequences. In addition, the predicted results of theoretical isoelectric points were shown between 5.2 and 9.47. The isoelectric point of 46.3% members was less than 7 and belonged to acidic protein, while the other 53.7% were all basic proteins (Supplementary Table 1). The results of subcellular localization prediction showed that most genes were located in chloroplasts, some genes were in mitochondria and cytoplasm, a few genes were in nucleus, plastids, endoplasmic reticulum and extracellular regions. The above results showed that intronless PPR genes still has the characteristic of typical localization in semi-autonomous organelles, which was consistent with the localization of PPR protein in other plants (Figure 10). We identified three sequence motif: Motif1 (GIKPDVEHYGCMVDLLGRAGRLEEAEELIKEMPFK), Motif2 (IRVVKNLRVCGDCHSAIKLISKVVGREIIVRDANRFHHFKD GSCSCGDYW), and Motif3 (FVGNALIDMYAKCGSLEEARKV FDEMPERNVVSWNAMISGYAQ). Motif1 was covered in 120 intronless PPR genes, and was highly conserved. Thirty-three genes contained only Motif1 (27.5%), 58 genes contained Motif1 and Motif3 (48.3%) and 28 genes contained all three motif (23.3%). It is worth noting that Motif3 only existed at the end of amino acid sequence. In addition, LOC103956483 contained Motif1 and Motif2, which was the only one of the 120 intronless PPR genes contained only Motif1 and Motif2 (Figure 11).

DISCUSSION
In eukaryotes, there are intronless genes because there is no special structure of introns in genes, so studying the functions and evolutionary characteristics of these genes can help us to understand the evolution rules of related genes and genomes. Meanwhile, the exploration of intronless genes can help researchers to explore the effects of introns and selective splicing mechanisms on eukaryotes from the perspective of reverse thinking.
Because of the importance of intronless genes in comparative genomics and evolutionary biology, research on intronless genes in eukaryotes has been the focus of researchers for a long time. It is necessary to establish a centralized data platform for the integration, comparison, and analysis of the function and evolution of intronless genes on a larger scale. Little work has been done, as only a few databases exist, while Genome SEGE and IGDD have stopped providing services. IGD was limited to human intronless genes, which were annotated in different databases. PIGD focused on the intronless genes of Poaceae species and conducted a systematic comparative analysis from the perspective of comparative genomics, but the database has been damaged for providing retrieval services. As a result, users can only download the original data of the intronless gene sequences and the results of the analysis.  The RIGD, as the latest intronless gene database, integrates the intronless gene data of six species of Rosaceae and provides a systematic comparative analysis. The RIGD was designed as a simple, easy-to-use, and esthetically pleasing website interface that provides a feature-rich, user-friendly integrated data and analytics tool. The Species interface provides a download of the original data classified by chromosome number and analysis methods. The Statistics interface presents the results of systematic comparative genomics analysis of six species in the form of graphs. The Search interface allows users to search for data on intronless genes of interest. In addition, NCBI Blast, a common bioinformatics tool, is embedded in the RIGD to help researchers annotate new sequences and predict homology with genes in the RIGD. The RIGD also provides multiple interactive platforms, including Up&Down, Contact us and Links. Through these platforms, users can learn about the RIGD's analytical methods, download data of interest, and upload their important scientific findings to facilitate communication and data sharing among researchers in the same research field.
The RIGD is built on a Tencent cloud server with stable service and convenience for long-term maintenance and updating. In the future, we hope to update and expand the RIGD by communicating with researchers. The number of species collected is expected to increase, and more detailed annotation information on intronless genes, such as spatio-temporal expression data of intronless genes in different growth stages and tissues of plants, homologous genes in the genome, metabolic pathways of genes, and more, are expected to be added. This information will allow researchers to further explore the function and evolutionary mechanisms of intronless genes. Moreover, we are also committed to developing powerful comparative analysis tools to make the RIGD a centralized platform for intronless gene information and analysis, enabling researchers to use the database for data mining and analysis in various aspects.

CONCLUSION
With the development of sequencing technology, an increasing number of plant genomes are sequenced and annotated, and there will be increasing data regarding intronless genes in the future. It is feasible to integrate, compare and analyze the function and evolution of intronless genes in a wide range. We developed the RIGD platform, collected and systematically analyzed the data from intronless genes in six species of Rosaceae, and provided a series of tools for users to search the data of intronless genes of interest and communicate with us. With the support of researchers, we eventually hope to develop a platform for integrating data from eukaryotic intronless genes with tools for comparative genomics analysis, which can greatly promote the research of intronless genes in plants, thus mining valuable genomic resources and helping researchers find more interesting discoveries.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: ftp://ftp.bioinfo.wsu.edu/www.rosaceae.

AUTHOR CONTRIBUTIONS
TC projected the study, constructed the platform, wrote programs in the website background and analysis, involved in the bioinformatics analysis, and drew up the manuscript. DM put into effect the mainly bioinformatics analysis, handled figures and tables, participated in the design of platform, and update of the database. XL, XC, HW, QJ, and XX collected and collated the data, helped with the design and update of the database, provided suggestions, and criticisms for improving the manuscript and website. YCo and YCi participated in the design, helped in writing the manuscript, and supervised the whole project. All authors read and accepted the final manuscript.