An Updated Functional Annotation of Protein-Coding Genes in the Cucumber Genome

Background: Although the cucumber reference genome and its annotation were published several years ago, the functional annotation of predicted genes, particularly protein-coding genes, still requires further improvement. In general, accurately determining orthologous relationships between genes allows for better and more robust functional assignments of predicted genes. As one of the most reliable strategies, the determination of collinearity information may facilitate reliable orthology inferences among genes from multiple related genomes. Currently, the identification of collinear segments has mainly been based on conservation of gene order and orientation. Over the course of plant genome evolution, various evolutionary events have disrupted or distorted the order of genes along chromosomes, making it difficult to use those genes as genome-wide markers for plant genome comparisons. Results: Using the localized LASTZ/MULTIZ analysis pipeline, we aligned 15 genomes, including cucumber and other related angiosperm plants, and identified a set of genomic segments that are short in length, stable in structure, uniform in distribution and highly conserved across all 15 plants. Compared with protein-coding genes, these conserved segments were more suitable for use as genomic markers for detecting collinear segments among distantly divergent plants. Guided by this set of identified collinear genomic segments, we inferred 94,486 orthologous protein-coding gene pairs (OPPs) between cucumber and 14 other angiosperm species, which were used as proxies for transferring functional terms to cucumber genes from the annotations of the other 14 genomes. In total, 10,885 protein-coding genes were assigned Gene Ontology (GO) terms which was nearly 1,300 more than results collected in Uniprot-proteomic database. Our results showed that annotation accuracy would been improved compared with other existing approaches. Conclusions: In this study, we provided an alternative resource for the functional annotation of predicted cucumber protein-coding genes, which we expect will be beneficial for the cucumber's biological study, accessible from http://cmb.bnu.edu.cn/functional_annotation. Meanwhile, using the cucumber reference genome as a case study, we presented an efficient strategy for transferring gene functional information from previously well-characterized protein-coding genes in model species to newly sequenced or “non-model” plant species.


INTRODUCTION
Cucumber (Cucumis sativus L.) (Huang et al., 2009) is an economically important crop as well as a model system for sex determination studies and plant vascular biology (Tanurdzic and Banks, 2004). The whole genome sequence and genome structure annotation of cucumber were published several years ago (Huang et al., 2009;Li et al., 2011). Approximately 23,248 protein-coding genes were predicted in cucumber reference genome, but only a few genes' functions had been verified by experiments (Shang et al., 2014). Recently, the whole genomes of three Cucurbitaceae species have been sequenced and assembled (Huang et al., 2009;Garcia-Mas et al., 2012;Guo et al., 2013). Together with the currently available genomes of other related angiosperms, it provided an opportunity to perform a more accurate functional annotation for these predicted protein-coding genes in cucumber.
Traditional approaches for automatic functional annotation of protein-coding genes in "non-model" species or in newly sequenced genomes rely on homology transfer based on sequence similarity, including Blast2GO (Conesa et al., 2005), identification of conserved domains (Rentzsch and Orengo, 2013) and InterPro2GO (Burge et al., 2012). In addition, OrthoMCL (Li et al., 2003) has been used to identify orthologous relationships between several related genomes using the Bidirectional Best Hits (BBH) strategy and then to transfer annotations of known genes to newly predicted genes. Accurately determining orthologous relationships between genes allows for a better and more robust functional assignment of predicted genes described as the "orthology-function conjecture" (Nehrt et al., 2011). Furthermore, the collinearity information could facilitate reliable orthology inference among multiple related genomes (Zheng et al., 2005). Currently, identification of collinear segments is primarily based on the conservation of genes order (Tang et al., 2008). Unfortunately, many events during plant genome evolution, including whole genome duplications (WGD) (Bowers et al., 2003), reshuffling of short DNA segments by mobile elements and horizontal gene transfer (HGT), have disrupted or distorted the genes order along chromosomes, thus making it difficult to use genes as markers for genomewide comparisons of plant genomes (Tang et al., 2008). In fact, several comparative genomic studies have demonstrated that smaller units, such as evolutionarily stable domains or segments, are more effective genomic markers than coding genes for whole-genome comparisons (Gabaldón and Koonin, 2013).
In this study, using the cucumber genome as a case study, we annotated functions of the protein-coding genes by integrating information about the collinearity of conserved DNA segments originating from multiple alignments of several related plant genomes.

Data Source
In this study, we used three different types of information as detailed in Table 1, including (1) genome sequences for multiple whole-genome alignments, (2) known functional annotations of genes in 14 non-cucumber genomes for functional transfer, and (3) 10 cucumber gene expression datasets used to validate the functional annotation of predicted genes.

Multiple Whole-Genome Alignment
In addition to the cucumber genome, previously published genomes of 14 other related angiosperms were used to perform a multiple whole-genome alignment. To ensure high quality genome assembly and alignment efficiency, we only used sequences that met at least one of the two following requirements: (1) genome sequences were assembled onto chromosomes; (2) N 50 of the scaffolds was >260 kbp. At the same time, scaffolds shorter than 2,000 bp were removed before alignment. As illustrated in Figure 1, we obtained a phylogenetic tree covering all 15 included species with the evaluated branch lengths by phyloFit (Siepel et al., 2005), and the topology was from an angiosperm super-tree (Davies et al., 2004). To generate a pairwise whole-genome alignment against the cucumber genome (Huang et al., 2009), we used LASTZ (Harris, 2007), a local alignment algorithm optimized for whole-genome alignments, to locally compare the cucumber reference genome with all of the sequences of each query genome. Then, the pairwise alignments were passed through the alignment "chaining" and "netting" pipeline as described by Kent et al. (2003) to ensure that each base of the reference genome was aligned with at most one base in each other genome, using synteny to guide the selection procedure. The resulting pairwise alignments of each query genome to the cucumber reference were joined using MULTIZ (Blanchette et al., 2004), guided by the phylogenetic tree topology in Figure 1. The 15-way multiple alignments can be viewed as a series of conserved blocks that exist in all 15 species that contain the best match within the cucumber reference. We termed these 15way blocks conserved across 15 angiosperms genomes as the "multiple alignment anchors" (MAAs).

Collinearity Detection Among 15 Angiosperms Plant Genomes
Two different types of genomic markers, including MAAs identified by multiple alignments and a set of protein-coding gene families constructed by OrthoMCL (Li et al., 2003), were used individually to detect positional orthologous regions (also known as collinear segments) among the 15 genomes using the i-ADHoRe v3.0 (Proost et al., 2012) program. The running parameters for i-ADHoRe were set as follows: "alignment_method = gg4, anchor_points = 3, gap_size = (MAAs-based as 30, protein-based as 10), prob_cutoff = 0.01, level_2_only = false". First, i-ADHoRe identified homologous regions (segments) between two genomes containing at least three homologous genes ("anchor points"), with the anchor points separated by at most 10 non-homologous genes or 30 MAAs ("gap_size"). Then, two types of collinear segments between cucumber and the other 14 angiosperm genomes were identified at the n-way and 2way-d levels using in-house Perl scripts based on the i-ADHoRe raw results. The n-way (n∈ {3,4, . . . , 15}) collinear segments indicate the group of species consisting of cucumber and the other species that were sequentially incorporated based on the topology of the tree in Figure 1, with cucumber as the origin. These segments represent the multiple species level of collinear segments. Each 2way-d (d∈ {2, 3, . . . , 15}, where d is the species index) collinear segment represented a pairwise alignment with one of the 14 non-cucumber species indexed by d, where d was incremented with the degree of divergence from cucumber according to the phylogenetic tree (Figure 1)

Orthology-Inference Rule
Based on the collinear segments among cucumber and the other angiosperms plants identified by using MAAs as markers, g 1 and g 2 (protein-coding genes from species 1 and 2, respectively) were designated as an orthologous protein-coding gene pairs (OPPs) if they met two essential conditions: (i) g 1 and g 2 shared certain sequence similarity (all-vs-all BLASTP, E-value = 1E-05); (ii) at least 50% of the length of the g 1 and g 2 sequences were located in the same collinear segment. We also applied two optional conditions: (iii) g 1 and g 2 were BBH (all-vs-all BLASTP, E-value = 1E-05); (iv) g 1 and g 2 also met condition (ii) when collinear segments were identified using protein-coding genes as genomic markers.
For each OPP, we calculated an orthologous pair support score (OPSS) from three parameters as follows: (i) the M-score, indicating the degree of the MAAs-based collinear segment (2∼15-way) supporting the orthologous relationship; (ii) the Pscore, with weight 0.5, indicating the degree of the proteinbased collinear segment (2∼15-way) supporting the orthologous relationship; and (iii) the B-score indicating the BBH relationship (0: not BBH; 2: BBH).

Six Existing Pipelines for Gene Functional Annotation
To evaluate the quality of the functional annotations from our collinearity-based pipeline, we performed analyses with 6 other commonly-used annotation pipelines, using default parameters, including Blast2GO (Conesa et al., 2005), OrthoMCL (Li et al., 2003), InterPro2GO (Jones et al., 2014), Trinotate-Blast (Grabherr et al., 2011), Trinotate-Pfam (Grabherr et al., 2011), and the UniProt resource, as shown in Table 2. The detailed pipeline parameters settings and procedures were described in the Supplementary Material.

Comparison of the Different Annotation Results
To assess the similarity of the different annotation results, we drew on the pipeline comparison strategy from Amar (Amar et al., 2014). We described each annotation result as a triplet (P, G, GO), where P represents the set of all of the annotated gene-GO term pairs, G represents the set of genes covered by P, and GO is the set of GO terms covered by P. Given the two annotation results from pipelines A = (P A , G A , GO A ) and B = (P B , G B , GO B ), we used three types of similarity indices for comparisons: (i) the Jaccard coefficient (Jaccard, 1912) between P A and P B , which was calculated as the ratio of the intersection of P A and P B to the union of P A and P B , was used to measure the degree of overlap between two annotation results. This is a structurefree index since it does not consider the hierarchical structure of GO, (ii) the Jaccard coefficient between G A and G B was used to measure the tendency of pipelines A and B to annotate the same gene set and is a structure-free index, (iii) the semantic similarity of the same gene annotated by GO A and GO B was calculated by Wang's method (Wang et al., 2007) as a structure-based index considering the parent-child inheritance relationships of GO. The final semantic similarity score for pipelines A and B was calculated as the average of their gene-wise semantic similarity scores.

RNA-Seq Data Processing and Gene Co-expression Calculation
Guided by the cucumber reference genome, the RNA-seq datasets from ten cucumber tissues (Li et al., 2011) were processed by the traditional Tophat (Trapnell et al., 2009) and Cufflinks (Trapnell et al., 2010) pipelines to obtain an FPKM (Fragments Per Kilobase of Transcript Per Million Mapped Reads) value for each gene. Given a gene set U associated with a specific GO term and a gene expression matrix X with genes as rows and the ten tissues as columns, Pearson correlation coefficients were calculated for all of the pairs of genes in U using their expression profiles in X. To evaluate whether the correlations in U tend to be higher than expected by chance, the correlation of randomlysampled gene pairs in X was calculated to obtain the distribution of random correlation scores. Then, the Kolmogorov-Smirnov test (KS test) was used to compare the observed correlation scores of U to the random correlation scores. To improve robustness, this process was repeated 100 times for each gene set U, and the mean p-value was used as the final p-value for that specific GO term.

Alignment of 14 Angiosperm Genomes to the Cucumber Reference Genome
The popular LASTZ (Harris, 2007)/MULTIZ (Blanchette et al., 2004)  The summary of the pairwise alignment is presented in Table 3, in which the species were ordered by the degree of divergence from cucumber based on the phylogenetic tree (Figure 1). The topology of tree was derived from a previously published angiosperm super-tree (Davies et al., 2004) and the branches were evaluated by phyloFit (Siepel et al., 2005). Although there were differences in genome size, sequencing quality and assembly quality, we observed a trend in which the whole genome alignability decreased with increasing divergence from cucumber, ranging from 91.28% (with C. melo) to 24.46% (with O. sativa). For coding regions, the alignability ratios were much higher, ranging from 98.13% (with C. melo) to 67.26% (with O. sativa). In addition, similar patterns were observed when calculating the alignability ratios with different genomic features, as illustrated in Figure 2. Additionally, even when using different reference genomes and including different species for multiple alignments (Hupalo and Kern, 2013), the genome alignability ratios showed a similar decreasing pattern with increasing levels of divergence, as shown in Figure 2.

Multiple Alignment Anchors Conserved Across All 15 Plant Species
The 15-way multiple alignments can be viewed as a series of conserved blocks that exist in all 15 species that contain the best match within the cucumber reference. We termed these 15way blocks conserved across 15 angiosperms genomes as the "multiple alignment anchors" (MAAs), which can potentially be used as genome-wide markers for detecting genomic homology. The rationale for finding MAAs is that due to long enough divergence time any identified regions of sequence similarity are the result of purifying selection rather than neutral carryover. With the cucumber genome as a reference, we assembled the largest comparative genomic dataset in plants to date using whole-genome sequences spanning the breadth of flowering plants. In total, 138,893 MAAs were identified by our local LASTZ/MULTIZ pipeline. The distribution of the MAAs lengths from chromosome 1∼7 is shown in Figure S1. The MAAs ranged in length from 1 to 3,513 bp, with a mean length of 115 bp and a median of 83 bp as showed in Table S3 with cucumber as reference. In total, MAAs covered approximately 8.13% of the cucumber reference genome, which was similar to the proportion of conserved elements in the human genome (3-8%) estimated by (Siepel et al., 2005). The normal composition of genomic features in cucumber was illustrated in Figure 3 and served as a reference to which the MAAs could be compared. The distribution of MAAs in genomic features as illustrated in Figure 3 could be contrasted with the normal distribution, revealing an expansion in the proportion of proteincoding sequence. Meanwhile, inter-genic regions and introns also contained many MAAs, suggesting a substantial number of sequences with undiscovered functionality in the genomes of cucumber and other plants (Hupalo and Kern, 2013).
The identified MAAs were more numerous, shorter in length and more widely distributed than protein-coding genes ( Figure  S2). Interestingly, the GC content of MAAs was significantly (P-value < 2.2E-16) higher than the genome-wide average of cucumber ( Figure S3). Previous studies have shown that the GC content was closely related to DNA structural stability and  (Siepel et al., 2005).
FIGURE 2 | Alignment of Cucumis sativus genome features with corresponding features in species at increasing phylogenetic distances. The solid red line indicates the genome alignability while the blue dashed line represents previously described CDS regions (Hupalo and Kern, 2013); CDS, coding sequences; UTR, untranslated region; ncRNA, noncoding RNA; TE, transposable elements; TR, tandem repeats. MBE indicates the results from one previous study (Hupalo and Kern, 2013) published in molecular biology and evolution. "Substitutions per Site" lists the divergence from cucumber based on the phylogeny in Figure 1.
nucleosome formation (Jansen and Verstrepen, 2011), which implies that the identified MAAs may be from stable, potentially functional regions of the genome. Evaluation of the distribution of genomic markers across the cucumber genome by KS test ( Figure S4) showed that MAAs were more uniformly distributed than protein-coding genes.

Atlas of Collinear Segments Between
Cucumber and Other Angiosperms Inferred by MAAs i-ADHoRe (Proost et al., 2012) was used to infer collinear regions among 15 angiosperm genomes using two different types of genome-wide markers: the MAAs and protein-coding gene families constructed by OrthoMCL (Li et al., 2003). In total, 1,983 n-way collinear MAA-based segments were identified with a mean length of 16,426 bp, while only 487 n-way collinear protein-coding genes-based segments were identified; however, as expected, they had a much longer mean length of 124,405 bp (Tables S1, S2 in Supplementary Material). For the 2wayd collinear segments between cucumber and each of the other 14 angiosperm genomes, we identified 80,910 MAAs-based segments with a mean length of 8,934 bp, while there were only 10,632 protein-coding genes-based segments with a mean length of 152,737 bp. The characteristics of the collinear MAAs-based segments including total length, segments number and segment length are summarized in Figure 4. The mean length of the n-way collinear MAA-based segments dramatically decreased as the number of included species was increased (Figure 4A), while the distribution of the corresponding segments' total length and segment numbers displayed a U-like shape ( Figure 4B). Interestingly, a similar U-like pattern has also been observed in phylogenomic studies of closely related E. coli strains (Touchon et al., 2009;Zhang and Lin, 2012). By contrast, the n-way collinear protein-coding genes-based segment number decreased more dramatically with the increase in the number of included species ( Figure S5). Importantly, we identified 988 15-way collinear segments by the MAAs-based method, with a mean length of 2,549 bp. Despite the short length of these 15-way segments, they may be valuable for inferring gene orthology and identifying candidate noncoding regulatory elements. Next, Jaccard similarity measures were calculated to compare the collinear MAAs-based segments with segments derived from protein-coding genes. As shown in Table S2 and Figure S6, only a weak relationship was observed at the multiple species level (n-way), although there were pairwise relationships (2wayd) with very high similarity especially between closely related genomes. Though the Jaccard similarity between these two sets of collinear genomic segments decreased rapidly with increasing divergence from cucumber, we found that the MAAs-based method identified a greater number of specific collinear segments than the method based on protein-coding genes. Our results showed that, as one type of potential genome-wide markers, MAAs were more suitable for identifying collinear segments among distantly related genomes than protein-coding genes.

Characterization of Orthologous Genes Between Cucumber and Other Plants
The term "ortholog" originally referred to genes in different species that were derived from the same locus in their last common ancestor. Biologists always, even completely, transferred gene functional information from model species to newly sequenced genomes (Gabaldón and Koonin, 2013) based on the "orthology-function conjecture" rule (Koonin, 2005). This hypothesis stated that orthologs tend to retain similar molecular and biological functions, while paralogs tend to diverge over time to perform different functions via sub-or neo-functionalization. The idea that collinearity among related species may be one of the most reliable methods for orthology inferences has been previously reviewed (Kristensen et al., 2011). Based on the collinear segments identified previously between cucumber and 14 other angiosperms, we inferred 94,486 OPPs according to the orthology-inference rule, as described in the materials and methods. All of the OPPs were shown in Table S11 with MAAs as markers and Table S12 with protein-coding genes as markers. The numerical distribution of the OPPs inferred from different levels of collinear segments was shown in Figure 5.
Each OPPs was further assigned an OPSS that integrated multiple evidences, including the level of collinearity detected with MAAs or protein-coding genes as genomic markers and sequence similarity (Table S13). Orthologous pair support score calculations were performed as described in the materials and  Figure 1, with cucumber as the origin. These segments represent the multiple species level of collinear segments. Each 2way-d (d∈ {2,3,…,15}, where d is the species index) collinear segment represented a pairwise alignment with one of the 14 non-cucumber species indexed by d, where d was incremented with the degree of divergence from cucumber according to the phylogenetic tree ( Figure 1). Thus, d represents, in ascending order, Cucumis melo, Citrullus lanatus, Malus domestica, Glycine max, Populus trichocarpa, Citrus sinensis, Brassica rapa, Arabidopsis thaliana, Arabidopsis lyrata, Vitis vinifera, Solanum tuberosum, Setaria italica, Brachypodium distachyon, and Oryza brachyantha. Figure 6, we found that approximately 75% of OPPs had OPSS values between 5 and 15, indicating that these OPPs were deduced from multiple species level of collinear segments (such as the OPP consisting of Csa1P002120.1 in cucumber and XP_006581876.1 in G. max, which was inferred by a 5-way collinear segment) or multiple evidences (such as the OPP consisting of Csa1P050030.1 in cucumber and XP_002864842.1 in A. thaliana, which was inferred from a 10-way collinear segment and was also supported by both a 2way-10 collinear segment using protein-coding genes-based method and the BBH strategy). Only 10% of OPPs had OPSS values smaller than 3, indicating that these OPPs were uniquely inferred from a 2way-2 collinear segment identified by the MAAs-based method. Interestingly, 31% of OPPs were validated by the BBH strategy, and 25% of OPPs were deduced from more than one level of collinear segments. For example, the orthologous gene pair Csa1P050470.1 from cucumber and NP_199849.2 from A. thaliana was deduced from four types FIGURE 5 | Distribution of OPPs (orthologous protein-coding genes pairs) inferred by different levels of collinear segments. Red bar indicates the results using MAAs as genomic markers; Blue bar is from results using protein-coding genes as markers. The n-way(n∈ {3,4, …, 15}) collinear segments indicate group of species consisting of cucumber and other related species which were sequentially incorporated based on the topology of the tree in Figure 1, with cucumber as the origin. These segments represent the multiple species level of collinear segments. Each 2way-d (d∈ {2,3,…,15}, where d is the species index) collinear segment represented a pairwise alignment with one of the other 14 non-cucumber species indexed by d, where d was incremented with the degree of divergence from cucumber according to the phylogenetic tree ( Figure 1). Thus, d represents, in ascending order, Cucumis melo, Citrullus lanatus, Malus domestica, Glycine max, Populus trichocarpa, Citrus sinensis, Brassica rapa, Arabidopsis thaliana, Arabidopsis lyrata, Vitis vinifera, Solanum tuberosum, Setaria italica, Brachypodium distachyon, and Oryza brachyantha.

methods. As shown by the OPSS distribution in
of evidences, including 15-way and 2way-10 collinear segments identified by the MAAs-based method, a 2way-10 collinear segment identified by the protein-coding genes-based method, and the BBH strategy. The distribution of OPPs shown in Figure 6 suggested that the accuracy of the orthology inferences based on collinearity was improved by validation through multiple lines of evidence.

Functional Annotation of Predicted Protein-Coding Genes in the Cucumber Genome
Guided by the OPPs, we collected functional information of cucumber protein-coding genes whose orthologous genes had been annotated in the UniProt-reference-proteomes database or TAIR (Lamesch et al., 2012). Approximately 44.84% of proteincoding genes (10,885) in the cucumber genome were assigned at least one GO term (Table S4). To reduce noise, we ignored GO terms with >300 or fewer than five associated genes. Overall, each cucumber gene was annotated with an average of 4.66 GO terms, and each GO term was associated with an average of 30.47 genes, as illustrated in Figure 7. In addition, based on the OPPs between cucumber and A. thaliana, 4,230 cucumber genes were assigned at least one growth-related Plant Ontology (PO) term (Cooper and Jaiswal, 2016) involving 32 developmental stages or conditions, and 4,797 cucumber genes were annotated with at least one structure-related PO term involving 66 distinct gene expression locations (Table S14).

Improved Functional Annotation Quality for Cucumber Protein-Coding Genes Than Traditional Methods
We compared the annotations derived from a total of eight pipelines, including six traditional methods and our two collinearity-based methods, named opp2go (MAAs-based) and opp2go (protein coding genes-based), with respect to the following aspects: (i) annotated gene coverage and related numbers of GO terms; (ii) the mean number of GO terms per gene and mean number of genes per GO term; and (iii) similarity comparison among these eight annotation results. All the results produced by the eight pipelines were shown in the Tables S4-S10, S15. As illustrated in Figure 7, our opp2go (MAAs-based) pipeline successfully annotated approximately 11,000 protein-coding genes (almost half of the protein-coding genes in cucumber), which was comparable to the results from the 6 traditional methods. By contrast, fewer than 8,000 genes were annotated by the opp2go (protein coding genebased) method, which was probably due to the weak collinearity identified with protein-coding genes as genomic markers, leading to a smaller number of inferred orthologous genes. Comparing with UniProt resource related with cucumber, the number of genes successfully annotated by opp2go (MAAs-based) was 2,730 more than by UniProt, as shown in Figure 7. Interestingly, there were two valuable observations from the functional annotations in Figure 7. First, the results from our opp2go (MAAs-based) pipeline covered an intermediate number of GO FIGURE 6 | Bar plot of OPSS (orthologous protein-coding genes pair support score) distributions for OPPs (orthologous protein-coding gene pairs). Vertical axis indicates the number of OPPs, and the horizontal ordinate represents OPSS value intervals. The green curve is the fitted density.
FIGURE 7 | Summary of annotations from 8 pipelines. To compare our annotation results with traditional methods, we selected six widely used pipelines, using default parameters, to assign GO terms to cucumber protein-coding genes, including Blast2GO (abbreviated as b2go), InterProScan (abbreviated as ips2go), OrthoMCL (abbreviated as orthomcl2go), Trinotate-Blast (abbreviated as trib2go), Trinotate-Pfam (abbreviated as trip2go), and the UniProt resource (abbreviated as uniprot). Detailed settings for each pipeline are described in materials and methods. Our pipelines, using MAAs as genomic markers [abbreviated as opp2go (MAAs)] or protein-coding genes [abbreviated as opp2go (protein-coding genes)]. (A) Coverage of annotated genes; (B) numbers of GO terms covered by annotations; (C) average number of GO terms associated with each gene; (D) average number of genes annotated by each GO term. "Subset" indicates that GO terms associated with > 300 genes or < 5 genes were filtered out. terms, significantly fewer than b2go but more than the two Trinotate-related pipelines and ips2go. In addition, the 4.66 GO terms per gene from opp2go (MAAs-based) was significantly lower than b2go (17.80), but higher than the two Trinotaterelated pipelines and ips2go, which had <2 GO terms per gene.
Drawing on the experience of Amar (Amar et al., 2014), we compared these eight pipelines with three types of similarity indices, including gene-GO term pairs, annotated gene sets and overall sematic similarity, as described in the materials and methods. MF (structure-free) of Figure 8 showed the pair-wise Jaccard similarity values for the eight pipelines, measuring the overall similarity between their gene-GO term pairs, using the Molecular Function (MF) subset of GO terms. Though there was a low overall similarity with a mean score of 0.2254, the two collinearity-based pipelines clustered together with UniProt, as did the two Trinotate-related pipelines with ips2go, and b2g clustered with orthomcl2go, which may be partly explained by the similar approaches or shared annotation resources between the members of each cluster. Gene-set similarity of Figure 8 showed the pair-wise similarity values of the gene sets across all of the annotation results, which measured the tendency of any two pipelines to annotate the same genes. Our two collinear-based pipelines and UniProt clustered together and were relatively distant from the other five traditional pipelines. From this, we inferred that collinearity-based pipelines cover different gene sets from traditional methods, which were mainly based on protein sequence similarity. However, even within the same cluster, the gene set similarity between opp2go (MAAsbased) and opp2go (protein-coding genes-based) was only 0.46, indicating that the collinear genomic regions identified with MAA markers were different from that with protein-coding genes as markers. MF and BP (structure-based) in Figure 8 showed the structure-based pair-wise similarity values of gene-GO term pairs among eight pipelines for both MF and Biological Process (BP) GO terms. Structure-based similarity accounts for the parent-children inheritance relationships in the GO hierarchical structure and represents the functional semantic similarity between different pipelines. Overall, the structurebased sematic similarity values were significantly higher than those of structure-free similarity as MF subset in Figure 8, because the structure-based method obtained higher scores with seemingly different but biologically similar predictions. The CC subset also showed similar observation in Figure S7. Our two collinearity-based pipelines formed one cluster with average similarity scores of 0.98 for MF and 0.93 for BP. We also found that the orthomcl2go pipeline (based on gene family information) could be clustered with opp2go (MAAs-based), with sematic similarity scores of 0.83 for MF and 0.67 for BP. By contrast, b2go was the farthest from opp2go (MAAsbased), with similarity scores of 0.70 for MF and 0.55 for BP. Taken together, these observations suggested that, compared with traditional pipelines, our collinearity-based annotation method could annotate protein-coding genes that was comparable to that of b2go and it may capture gene family information from our set of 15 related angiosperm species.
Due to the lack of a high-quality gene functional annotation dataset in cucumber, we validated the annotation results using the gene co-expression patterns from expression data of 11,754 genes in 10 different tissues. The Pearson correlation coefficient was used to measure the co-expression between pairs of genes as FIGURE 9 | Validation of annotations based on gene co-expression. opp_MAAs, our pipeline using MAAs as genomic markers; opp_protein, our pipeline using protein-coding genes as genomic markers; b2g, Blast2GO; ips, InterProScan pipeline; orthomcl, OrthoMCL pipeline; trib, Trinotate-Blast; trip, Trinotate-pfam, and uniprot, UniProt resource. Given a set of cucumber genes linked to a BP/MF/CC term by a specific pipeline, the average Pearson correlation coefficient for co-expression of genes was compared to that of a random gene set. For each pipeline the number of GO terms with p-value < 0.001 was indicated. detailed in materials and methods. To reduce the GO annotation bias, we removed GO terms associated with more than 300 genes or fewer than 5 genes. Figure 9 showed the ratio of specific GO terms with P-values < 0.001 for each pipeline. The opp2go method (MAAs-based) showed the best performance among the eight pipelines for the total GO terms and both the BP and Cellular Component (CC) GO term subsets. However, for the MF subset of GO terms, the ratios of significantly overrepresented terms (P-value < 0.001) among opp2go (MAA-based) and the other methods were all <0.2. We also observed that all eight pipelines showed better performance for the CC subset than BP and MF. One explanation might be that co-expression between genes was evaluated using RNA expression datasets from 10 specific tissues, in which they might contain some degree of tissue-specific and cellular component information. In summary, the annotation quality of opp2go (MAA-based) obtaineds a partial improvement over those of traditional methods, which might be due to its integration of collinearity between related species into the functional annotation process. Our results also demonstrated that inter-species collinearity was a reliable basis for orthologous gene inference.

DISCUSSION
Using a LASTZ/MULTIZ analysis pipeline, we identified a class of DNA segments (MAAs) that was highly conserved across 15 angiosperm genomes. These MAAs were more numerous, shorter in length, more widely and uniformly distributed than protein-coding genes in cucumber. Using MAAs as genomic markers, we identified multiple levels of collinear segments between cucumber and 14 related species. According to our survey, the mean length of collinear segments decreased dramatically with increasing genetic divergence from cucumber. However, with increasing divergence from cucumber, the number of segments gradually increased in the MAAs-based results, while a nearly opposite trend was observed with proteincoding genes. As a possible explanation, we hypothesized that long collinear segments might have been split into multiple smaller segments due to genomic events that disturbed these plant genomic segments over the course of long-term evolution. In this study, no 9-or 15-way collinear segments were identified by the protein-coding genes-based method, which suggests that using MAAs as genomic markers may be a viable alternative to protein-coding genes for detecting collinear segments among distantly related plants. More importantly, due to the colocalization of genes and their respective regulatory elements, the collinear segments may be used to detect potential cis-regulatory elements for nearby target genes as reviewed previously (Levy et al., 2001;Wittkopp and Kalay, 2011).
With these identified collinear segments, we inferred putative OPPs between cucumber and each of the other 14 species. Each of these orthologs to cucumber genes was then used as a proxy to transfer annotation information to its corresponding cucumber gene if its biological function was known. Although the annotation coverage was comparable with traditional methods, our functional annotation strategy greatly reduced the annotation redundancy and obtained a partial improvement of annotation accuracy evaluated by gene co-expression profiles.
However, the criteria for MAA inclusion should be refined by additional steps, including phylogenetic and evolutionary model analyses, such as BinCons (Margulies et al., 2003) or GERP (Cooper et al., 2005), sequence characters and lineage-specific information. Additionally, it is important to consider that genes tend to diverge over time to perform different functions at the domain level, rather than across the whole gene (Rentzsch and Orengo, 2013). In the future, we plan to improve our annotation pipeline by performing functional annotation of protein-coding genes based on domain level. The comparison of plant genomes has been complicated by recurrent polyploidy and extensive genome rearrangements (Tang et al., 2008), which strongly impact genome alignment and the identification of collinear segments. So, our current results should be interpreted cautiously until these inferred collinear segments have been supported by additional lines of evidence.
In summary, using the cucumber as a case study, we provided a potential alternative resource for the functional annotation of cucumber protein-coding genes that was an alternative strategy for transferring functional information from previously well-characterized protein-coding genes in model species to genes in "non-model" plant species guided by genomic collinearity, accessible from http://cmb.bnu.edu.cn/functional_ annotation. In addition, our study offers a pipeline for identifying collinear segments across multiple related plant genomes based on conserved DNA segments, which can potentially be used as genomic markers instead of protein-coding genes.

AUTHOR CONTRIBUTIONS
EP and KL: contributed the central idea; HS and EP: designed the study; HS: analyzed most of the data and wrote the initial draft of the paper; JH: contributed to refining the ideas. All authors discussed the results and revised the manuscript.