Landscape of the Dark Transcriptome Revealed Through Re-mining Massive RNA-Seq Data

The “dark transcriptome” can be considered the multitude of sequences that are transcribed but not annotated as genes. We evaluated expression of 6,692 annotated genes and 29,354 unannotated open reading frames (ORFs) in the Saccharomyces cerevisiae genome across diverse environmental, genetic and developmental conditions (3,457 RNA-Seq samples). Over 30% of the highly transcribed ORFs have translation evidence. Phylostratigraphic analysis infers most of these transcribed ORFs would encode species-specific proteins (“orphan-ORFs”); hundreds have mean expression comparable to annotated genes. These data reveal unannotated ORFs most likely to be protein-coding genes. We partitioned a co-expression matrix by Markov Chain Clustering; the resultant clusters contain 2,468 orphan-ORFs. We provide the aggregated RNA-Seq yeast data with extensive metadata as a project in MetaOmGraph (MOG), a tool designed for interactive analysis and visualization. This approach enables reuse of public RNA-Seq data for exploratory discovery, providing a rich context for experimentalists to make novel, experimentally testable hypotheses about candidate genes.

Each organism contains species-specific genes (denoted here as "orphan genes"). The challenge of distinguishing orphan genes in genomes and predicting their functions is immense, resulting in an under-appreciation of their importance. The emergence of novel protein coding genes specific to a single species (orphans) is a vital mechanism that allows organisms to survive a changing environment (Domazet-Lošo et al., 2007;Tautz and Domazet-Lošo, 2011;Carvunis et al., 2012;Arendsee et al., 2014;Schlötterer, 2015;McLysaght and Hurst, 2016). Over generations, those orphan genes that continue to provide a survival advantage will be maintained. Orphan genes can be identified from within a list of genes by phylostratigraphy, the classification of each gene according to its inferred age of emergence (Domazet-Lošo et al., 2007;Tautz and Domazet-Lošo, 2011). Two general mechanisms enable orphan gene emergence:(1) de novo evolution and (2) divergence of proteins of existing genes beyond recognition in a short time frame.
Orphan genes can evolve de novo from non-coding sequence in regions of the genome lacking genes entirely or as new reading frames within existing genes (Tautz and Domazet-Lošo, 2011;Ruiz-Orera et al., 2015;Arendsee et al., 2019a;Van Oss and Carvunis, 2019). Indeed, transcriptional and translational "noise" has been suggested as a mechanism that facilitates novel gene emergence (Chen et al., 2013;Hoen and Bureau, 2015;Landry et al., 2015;Gubala et al., 2017;Ruiz-Orera et al., 2018;Xie et al., 2019). This hypothesis is borne out by in vitro and in vivo synthetic biology research demonstrating that novel peptides are often able to bind small molecules (e.g., ATP, and metals)  and induce beneficial phenotypes when expressed Neme et al., 2017). If information on the expression of the dark transcriptome was more easily accessible, the potential roles of expressed transcripts could be better considered.
Evolution of orphan genes from existing protein-coding genes has been estimated to account for about 18% (human), 25% (Drosophila), and 45% (yeast) of annotated taxonomically restricted genes (Vakirlis et al., 2020). However, this estimate can consider only the that can be compared across species, i.e., those that are located within syntenic intervals of related genomes; about ∼50% of genes for yeast (Arendsee et al., 2019b).
A systematic analysis of current computational methods for genome annotation indicates many orphan genes may be missed in annotation projects (Li et al., 2021). This is because genes are often identified from sequenced genomes by combining evidence based on homology with other species (Meyer and Durbin, 2004;Proux-Wéra et al., 2012) with ab initio machine-learning predictions by detecting canonical sequence motifs (e.g., splice junctions) (Cantarel et al., 2008;Hoff et al., 2016). However, homology and ab initio approaches can be problematic in predicting orphan genes. First, orphan genes cannot be identified by homology to genes of other species, since they have none. Secondly, to the extent that an orphan has not yet evolved canonical motifs, ab initio prediction may be ineffective (Li et al., 2021). For example, compared to the gold-standard annotations in the curated TAIR community database (Berardini et al., 2015), the popular ab initio pipeline MAKER (Cantarel et al., 2008) predicted as few as 11% of the annotated Arabidopsis orphan genes, depending on the RNA-Seq evidence supplied (Li et al., 2021). Enhancing ab initio pipelines by other sequence-based information [e.g., motif/domain information, cellular location predictions, predicted isoelectric point (pI), and genomics context] can improve gene predictions (Grandaubert et al., 2015;González et al., 2016;Werner et al., 2018).
However, because it is not a given that newly evolved genes have canonical features, direct alignment of transcriptomic and/or proteomic data to the genome is critical for annotating orphan genes, as well as non-coding transcripts (lncRNAs, etc.) (Carvunis et al., 2012;Ruiz-Orera et al., 2014González et al., 2016;Lu et al., 2017;Wu and Knudson, 2018;Li et al., 2021;Blevins et al., 2021).
Here, we reuse and re-mine aggregated RNA-Seq data to discover new potential gene candidates. The study comprehensively evaluates transcription and ribosomal binding of all open reading frames (ORFs) in the yeast genome over a wide variety of conditions, in the context of annotated genes. The research extends the results of previous studies, in that it globally represents ORFs in the Saccharomyces cerevisiae genome across thousands of samples. Furthermore, we provide these data and extensive metadata via a biologist-friendly platform, MetaOmGraph (MOG; Singh et al., 2020), 1 which provides interactive, exploratory analysis (Tukey, 1977) and visualization of expression levels, expression conditions, and co-expressed genes for the ORF-containing transcripts. This approach enables experimentalists to prioritize ORFs for functional characterization, and to logically define experimental parameters for these characterizations .

Extracting ORFs and Delineating Orphan-ORFs in Saccharomyces cerevisiae
In order to evaluate potential ORFs in the yeast genome comprehensively, all ORFs over 150 nt were extracted from the yeast genome (version: R64-1-1) by emboss getorf (v6.6.0) using "-minsize 150" and "-find 3, " and translated by emboss transeq (Rice et al., 2000). Then, we removed the ORFs identical to annotated genes in Saccharomyces Genome Database (SGD) based on coordinates by bedtools2 (Quinlan and Hall, 2010). We further removed ORFs within annotated genes in same translation frame. These filtrations yielded 24,912 ORFs. To these FIGURE 1 | Annotated genes and dark transcriptome. (A) Definition of Dark transcriptome. Pervasive transcription of unannotated sequences has been found in many species. Some of these might be protein coding genes that have escaped annotation. Most of these unannotated coding genes are orphan (species-specific) genes, which have no homolog to other species, and are hard to predict using current gene prediction tools. These orphan genes could emerge by rapid divergence from ancient genes or could evolve de novo. Other transcribed but unannotated sequences might be non-coding genes. Although many studies have explored the function and classification of the non-coding transcripts, many transcribed sequences are still unclassified. (B) Classification and numbers of transcripts with transcription or translation evidence for annotated genes and open reading frames (ORFs). Orphan-ORFs, protein is unique to Saccharomyces cerevisiae [phylostrata (PS) = 15]; genus-specific-ORFs, protein is unique to Saccharomyces spp. (PS = 10-14); conserved-ORFs, protein has homologs in older species (PS = 1-9). Q3-transcribed, ORFs with mean expression values across the 3,457 samples in the upper (Q3) quantile of the unannotated transcripts. Low-transcribed ORFs, ORFs with mean expression values across the 3,457 samples in the lower 75% quantile of the unannotated transcripts. Non-transcribed conserved-ORFs, no transcription evidence detected (Supplementary ORFs we added two sets of ORFs < 150 nt identified in other studies: the 1,139 small translated sequences (smORFs) identified by ribosome profiling (Carvunis et al., 2012) and the 3,303 of ORFs identified by TIF-Seq (txCDS; Lu et al., 2017) that were less than 150 nt (thus, not included in the EMBOSS extraction). These 29,354 ORFs, together with the 6,692 protein-coding genes annotated in SGD, were subjected to phylostratigraphic analysis.
We inferred the phylostratum for 29,354 ORFs and 6,692 annotated protein-coding genes via the R package, phylostratr (v0.2.0) (Arendsee et al., 2019b). This analysis compared the proteins predicted from the cORFs to UniProt-annotated proteins of 123 target species distributed across phylostrata: 117 species were identified by the phylostratr algorithm; these were supplemented with six manually selected species in the Saccharomyces genus (S. paradoxus, S. mikatae, S. kudriavzevii, S. arboricola, S. eubayanus, and S. uvarum). To minimize false positives when identifying orphan ORFs and CDS from S. cerevisiae, we took advantage of the customization capabilities of phylostratr to include the predicted translation products from all ORFs (>150 nt) from each of the six Saccharomyces genomes, in addition to all annotated proteins of these species [see Supplementary Figure 1 for workflow, Supplementary Section 12 for full species list, and phylostratr_heatmap.pdf for a gene by gene (and ORF by ORF) inference heatmap]. Each gene was assigned to the most evolutionarily-distant phylostratum that contains an inferred homolog based on the adjusted p-value (0.001 as cutoff) as calculated from e-value of BLASTP (blast-plus v2.11.0). A gene or ORF was inferred to be an orphan if its encoded protein was assigned the phylostratum level S. cerevisiae.

Raw Read Processing and Network Optimization
Our RNA-Seq data analysis pipeline is shown in Supplementary Figure 2. We selected all samples with S. cerevisiae taxon ID 4932, Illumina platform, and paired layout from The National Center for Biotechnology Information-Sequence Read Archive (NCBI-SRA) and then filtered out samples with miRNA-Seq, ncRNA-Seq, or RIP-Seq library strategies. In total, we collected raw reads data (FASTQ format) and metadata from 3,457 RNA-Seq samples (177 studies). A kallisto index was created from a FASTA file combining the cDNAs of annotated genes and unannotated ORFs (Weijers et al., 2012) with default setting (kallisto index -i yeast.allcdna allcdna.fasta), and expression levels of annotated genes and ORFs over the 3,457 RNA-Seq samples were quantified by kallisto (v0.43.1) with the bootstrap option "-b 100" and other default settings [Kallisto with and without bias option give similar accuracy (Bray et al., 2016), we used the default setting without bias correction] (see Supplementary Material S.cerevisiae_RNA-seq.mog for RNA-Seq metadata and normalized cpm data; all data including raw counts is accessible at DataHub) 2 .
Strand-specific libraries provide accurate determination of sense vs. antisense transcription (Zhao et al., 2015), however, most yeast RNA-Seq data was non-stranded. We quantified all of the yeast RNA-Seq samples available on NCBI-SRA using the "unstranded" option. Then, we quantified the 5% (177 samples) of the available RNA-Seq samples are strand-specific using the strandness option. We compared the expression of each annotated and unannotated transcript as quantified by specific strandness ("-fr-stranded" or "-rf-stranded") with the expression of each annotated and unannotated transcript using the default option (no strand-specific) for the 177 samples. Then, we examined the pairwise Pearson correlation according to the expression with and without strand-specific option for each sample, the correlations had a median of 0.85 (range 0.71-0.91; Supplementary Figure 3). These high correlations between gene expression in the unstranded mode and stranded mode are consistent with the estimation for the unstranded RNA-Seq samples having little effect on our downstream Pearson correlation-based clustering analysis.
We normalized raw counts by edgeR (v3.22.3) (Robinson et al., 2010) based on the evaluation (Dillies et al., 2013), and evaluated the performance of normalization by comparing to the raw counts (Supplementary Section 3). After normalization, We further defined the robustly expressed ORFs, as defined by mean expression values in the upper quantile (Q3-transcribed ORFs). This subset of ORFs were used in some of the analyses, in particular the network analysis in which consideration of sparsely expressed and low-expressed transcripts are problematic. In subsequent methods and results, if "Q3-transcribed" is not designated, all transcribed ORFs were used in an analysis.
Three positive Pearson Correlation Coefficient (PCC) cutoffs, 0.6, 0.7 and 0.8, were used to create networks of different densities (Supplementary Section 4). We then applied Markov Cluster (MCL) algorithm to partition each network using our in-house Java Spark implementation (GitHub: 3 designed to optimize efficiency. All data analysis in this work, except for MCL clustering and RNA-Seq expression visualization, were performed in R software (v3.5.0).

Cluster Evaluation by GO Term Enrichment Analysis
Clusters resulting from each of the eight MCL analyses obtained from the different normalization methods and PCCs were evaluated by Gene Ontology (GO) enrichment analysis using clusterProfiler (v3.12.0) (Yu et al., 2012); only clusters with over five genes were considered. The GO term enrichment of each experimental result was compared to that of 100 random sets of clusters, which were obtained by permuting gene IDs. For these permutations, the same number and size of clusters as those from the experimental result were assigned to random sets using the method of Mentzen and Wurtele (2008). The best adjusted p-value (p min , smallest adjusted p-value) was recorded for the enriched GO terms in each cluster. Each random cluster set was assigned a score Si, which is the average p min across all clusters in the set where n indicates the number of clusters. The distribution of S values for GO classes, biological process (BP), cellular component (CC), and molecular function (MF), for random sets were compared to the respective values for the real experimental data. In each ontology, the experimental score was less than any of the random scores, indicating that experimental data have biological significance (permutation test, p-value = 0). Based on these GO enrichment results, we chose positive PCC = 0.6 as cutoff (Supplementary Section 4) for future analyses.

Ribo-Seq Analysis
To investigate the translational activity of unannotated ORFs, we analyzed 302 samples (23 studies) of yeast Ribo-Seq data; this represented about half of the available Ribo-Seq in the SRA database. Raw reads (SRA-formatted) were downloaded, and the SRA toolkit was used to convert the raw reads to a FASTQ format. BBDuk (v38.75) was used to find and remove adapter sequences from the 3 end of reads, and rRNA reads were identified and removed using BBMap (v38.75) with default option (Bushnell, 2014). The cleaned Ribo-Seq reads were aligned to the reference genome by HISAT2 (Kim et al., 2015). The actively translating ORFs were detected and quantified by Ribotricer (v1.3.2), which considers the periodicity of ORF profiles and provides multiple options for customization (we used the recommended phase-score cutoff 0.318 for yeast) (Choudhary et al., 2019). The genes/ORFs with mean counts across 302 Ribo-Seq samples higher than 1.83 (This is the maximum mean counts for non-transcribed ORFs. According to Supplementary Figure 13, the Ribo-Seq expression for nontranscribed ORFs is too low so that we regard those expression lower than 1.83 as sequencing noise.) was consider to have translation evidence.

Visualization and Gene Function Exploration
As proof-of-concept for the utility of these data, we used the MOG platform (v1.8.1)  to explore transcript co-expression and make functional inferences. We first created a MOG project S.cerevisiae_RNA-seq_3457_27.mog. This MOG project combines: (1) the levels of expression of each gene and ORF in the SGD+ORF dataset across 3,457 conditions; (2) gene and ORF metadata; and (3) sample metadata. For each gene and ORF, metadata include: functional annotations (from SGD); MCL cluster memberships with GO enrichment analysis; mean expression levels for RNA-Seq and ribosomal profiling; ribosomal binding evidence; genome location relative to UTRs and CDSs; GC content; length; genomic positional coordinates, orientation; and phylostratal assignment. Sample and study metadata (retrieved from NCBI-SRA)in the MOG project include: study ID, title, summary, reference, design description, library construction protocol, sequencing apparatus; sample title, experimental attributes, number of replicates; replicate name, sequencing depth, base coverage.
To explore the genes regulated by specific conditions, we did differential analysis in MOG, using the Mann-Whitney U test for differential expression analysis, with adjusted p-values by Benjamini and Hochberg. We chose all genes and ORFs with expression in the control samples or specific stresses [UV mutagenesis (Huang et al., 2017); under 37 • C at least 30 min (Andrie et al., 2014;Gupta et al., 2014;Presnyak et al., 2015;Wery et al., 2016;Uwimana et al., 2017)]. We designate genes and ORFs with log fold change > 1 and adjusted p-value < 0.01 as upregulated by the stress, " log fold change<−1 and adjust p-value < 0.01 as downregulated by the stress.

Identifying Potential Cryptic Orphan Genes in Saccharomyces cerevisiae
Saccharomyces cerevisiae has the most extensively sequenced and annotated genome within the Saccharomyces genus, or perhaps across eukaryotes. However, despite the large body of research on S. cerevisiae, this genome expresses many transcripts not annotated as genes (Carvunis et al., 2012;Pelechano et al., 2013;Smith et al., 2014;Lu et al., 2017;Wu and Knudson, 2018;Blevins et al., 2021), some of de novo origin (Carvunis et al., 2012;Vakirlis et al., 2018Vakirlis et al., , 2020Arendsee et al., 2019a;Van Oss and Carvunis, 2019), some supported with translational evidence (Van Oss and Carvunis, 2019;Blevins et al., 2021). Our overall goal was to generate a comprehensive overview of expression of ORFs, and make these data available in a format that can be readily explored. For this study, we classified all unannotated ORFs (>150 nt) and annotated genes in the S. cerevisiae genome according to phylostrata, transcription and translation evidence, and genomic context. We also included yeast ORFs < 150 nt with transcription and/or translation evidence that had been characterized in two previous publications: smORFs (Carvunis et al., 2012) and txORFs (Lu et al., 2017). Figure 1B defines our process and lists the numbers of genes and ORFs identified at each step.
We inferred the oldest phylostratum (PS; Šestak and Domazet-Lošo, 2015) to which each S. cerevisiae protein (or candidate protein) could be traced, using the reproducible and customizable phylostratr package (Arendsee et al., 2019b ; Supplementary Figure 1). Similarity to proteins of cellular organisms (i.e., proteins tracing back to prokaryotes) was designated as PS = 1; no similarity to any protein outside of S. cerevisiae was designated as PS = 15 (see Supplementary Material, S.cerevisiae_RNA-seq_3457_27.mog for PS assignments for each annotated and unannotated transcript). This analysis infers that fewer than 4% of annotated genes are orphans. In contrast, 54% of unannotated ORFs are orphans ("orphan-ORFs"), 40% are genus-specific (PS = 10-14), and only 6% are more highly conserved (PS = 1-9; Figure 1B).
In fungi, plants, and animals, the mean lengths of CDSs of annotated genes increase during evolution, with CDSs of orphan genes being the shortest (Toll-Riera et al., 2009;Arendsee et al., 2014Arendsee et al., , 2019aPalmieri et al., 2014; Van Oss and Carvunis, 2019; Supplementary Figure 8A). The ORFs of yeast also follow a similar trend: average lengths of orphan-ORFs are shorter and average length of ORFs increases with increasing phylostrata (Supplementary Figure 7). Consistent with the finding of Basile (Basile and Elofsson, 2017), the mean GC content for annotated orphan genes in S. cerevisiae is slightly lower (though not statistically significant) than that of more conserved genes. Similar to this GC content difference for annotated orphan genes, the Q3-transcribed orphan-ORFs, have a slightly lower mean GC content than the Q3-transcribed ORFs of other phylostratum levels (Supplementary Figure 8B). Vakirlis et al. (2018) reported a higher mean GC content among those orphan genes that have a confirmed de novo origin.
We anticipated that a characteristic of many of the orphan-ORFs that are actual genes that have escaped annotation would be sparse-expression. We aimed to identify RNA-Seq samples comprising diverse developmental, genetic, and environmental conditions, to help to capture expressed but unannotated transcripts. To gather RNA-Seq data from as diverse conditions as were available, we collected raw sequence reads and metadata of 3,457 RNA-Seq samples from 177 studies in NCBI-SRA (see S.cerevisiae_RNA-seq_3457_27.mog for metadata and counts). The experimental variables across these samples include a variety of mutants, chemical treatments, stresses, and growth stages. We quantified the expression of all 29,354 ORFs and 6,692 annotated genes of S. cerevisiae across the 3,457 RNA-Seq samples.
Using RNA-Seq samples drawn from a wide range of conditions has an additional benefit. Functional inference is a particular challenge for orphan genes, which have no homologs in other species, and rarely have recognizable functional domains (Arendsee et al., 2014). Because genes with similar patterns of expression are likely to encode proteins involved in common processes, using datasets incorporating the diverse conditions under which orphans-ORFs or orphan genes might be expressed provides a powerful approach to determine the conditions that induce their expression, and to infer function based on the co-expressed genes of known function. with ribosomal evidence of translation (Carvunis et al., 2012)], and all transcribed orphan-ORFs (>150 nt) across the 3,457 RNA-Seq samples (see Supplementary Figure 9 for heatmap expression plot of all genes and ORFs). The mean expression across all samples for annotated genes is 38 cpm, whereas the mean expression for the Q3-transcribed ORFs is 18 cpm (Supplementary Table 4). Many SGD-annotated genes are expressed in most of the samples. In contrast, as we anticipated based on the erratic pattern of expression of annotated orphan genes, most of the orphan-ORFs show very low expression in most RNA-Seq samples, but accumulate more highly in a few samples. This sporadic expression contributes to the observed lower mean expression of the orphans across all samples. It also demonstrates how transcribed sequences might be missed if smaller, less diverse datasets are considered.
Ninety-nine percent of the 3,457 RNA-Seq samples have transcription evidence for at least one of the orphan-ORFs ( Figure 3A). Some samples are particularly rich in orphan-ORFs. For example, 1,000 samples have transcription evidence for >9,000 of the orphan ORFs; these samples grown under conditions of nutritional or chemical stress and studies from different mutant. The phylostrata of transcript (orphan, genus-specific and conserved) and transcript status (low-transcribed ORFs, Q3-transcribed ORFs, and annotated) showed significant effect on the number of RNA-Seq sample with expression for each gene/ORF according to the twoway ANOVA test (Table 1, Figure 3B). Younger genes often expressed in less samples, which is consistent with previous studies. Unannotated ORFs also expressed in less RNA-Seq samples than annotated genes regardless of the phylostrata, that's one reason why they were omitted from annotation. We chose two stress with sufficient RNA-Seq samples in our dataset as example to verify whether young genes are regulated by stress. Over 2,000 orphan and genus-specific genes and ORFs are upregulated by the UV mutagenesis ( Figure 3C and Table 2), and about 1,000 orphan and genus-specific genes and ORFs are upregulated by the high temperature (under 37 • C at least 30 min) ( Figure 3D and Table 2).
The conserved SGD-annotated genes have higher mean expression than either the orphan annotated genes, the Q3transcribed orphan-ORFs, or the Q3-transcribed conserved-ORFs (Kolmogorov-Smirnov Test, p-values < 0.001; Figure 4). However, despite their generally sparse distribution, over 600 orphan-ORFs have a higher mean expression than 10% of  Phylostrata, orphan, genus-specific, and conserved; Status, low-transcribed ORFs, Q3-transcribed ORFs and annotated genes; ***p-value < 0.001. Mann-Whitney U test for differential expression, adjusted p-value < 0.01. Log fold change > 1 upregulated by stress, log fold change <−1 downregulated by stress. Based on the GO terms assigned to the gene members of known function, Cluster 112 is enriched in the GO terms shown in Table 3. The results indicate a possible role in stress response related to the cell wall for the ORF members of Cluster 112 (see S. cerevisiae_RNA-seq_3457_27.mog for complete clustering and ontology results).
conserved annotated genes, 289 orphan-ORFs have a mean expression higher than 25% of the conserved annotated genes, and 36 orphan-ORFs have a mean expression higher than 90% of conserved annotated genes (Figure 4 and Supplementary Table 4A).

Translation Evidence of Genes and ORFs
Many RNAs in fungi and animals that have been annotated as "lncRNAs" are associated with ribosomes, and/or have proteomics evidence, indicating some of them may function as protein-coding genes (Wilson and Masel, 2011;Wu et al., 2011;Hangauer et al., 2013;Ruiz-Orera et al., 2015. To examine translation evidence in our study, we globally evaluated translation evidence, mapping raw reads from 302 ribosomal profiling RNA-Seq (Ribo-Seq) samples in SRA to the unannotated ORFs and annotated genes of S. cerevisiae (see Supplementary Material Ribo-Seq_counts.csv and Ribo-Seq_metadata.xlsx for raw counts and metadata). These 302 Ribo-Seq studies include a variety of conditions, but are lacking in representation of many stress conditions. About 52% of Q3-transcribed conserved-ORFs, 27% of genus-specific-ORFs, and 30% of orphan-ORFs have translational evidence among these somewhat limited Ribo-Seq samples ( Figure 1B). This compares to 96% of the conserved annotated genes, 27% of genus-specific annotated genes, and 20% of orphan annotated genes. The mean Ribo-Seq raw counts were significantly different (t-test p-value < 0.001) among classes of transcripts, depending on whether they were orphan, genus-specific, or conserved ( Figure 5A). The mean Ribo-Seq raw counts for the lowtranscribed ORFs are significantly lower than for the Q3transcribed ORFs, and the mean Ribo-Seq raw counts for the ORFs with no transcription evidence are 0 or near 0 (Supplementary Figure 13). The proportions of Q3-transcribed ORFs with translation evidence located within, overlapping, or between annotated CDSs are significantly different among orphan-ORFs, genus-specific-ORFs, and conserved-ORFs (Chi-square test, p-value < 0.001) ( Figure 5B). Notably, 60% of Q3-transcribed orphan-ORFs with translation evidence are located in the intervals between annotated CDSs, compared to only 14% of the genus-specific ORFs and 10% of the conserved ORFs ( Figure 5B).
Since yeast was the first model eukaryotic genome, and has been reannotated over time, it would be expected that most conserved genes are already annotated. However, some genusspecific-genes might have been missed because homology is a major criterion used for genome annotation. Orphan genes, which have no homologs in other species, sparser expression, and likely fewer canonical features (Li et al., 2021), are yet less likely to have been annotated. In total, 1,007 Q3-transcribed genus-specific-ORFs and 1,070 Q3-transcribed orphan-ORFs have ribosomal binding evidence. These transcribed, translated ORFs are candidates for protein-coding genes.
Four hundred and forty-nine of the 858 Q3-transcribed conserved-ORFs also have translation evidence. There are several possible explanations for why a transcript with translation evidence and homologs in other species are not annotated as genes. Some of these conserved-ORFs may be pseudogenes that retain some homology and expression, but have lost functional capacity. Other conserved-ORFs might encode active proteins, by because they are expressed only under limited conditions they might not have been sampled when SGD annotations were made. Still other conserved-ORFs may have been ignored because their ORF codes for a shorter protein than the canonical gene family member. [On average, a Q3-transcribed conserved-ORF is significantly shorter than the homologous annotated gene (t-test, p-value < 0.001)]. However, it not a given that because an ORF encodes a shorter protein it is non-functional. Translation of a short conserved-ORF might play a regulatory in that it limits translation of a nearby active protein (Wu et al., 2019). Also, shorter homologs of proteins with known function may play a biological role in regulating signal transduction, modulating enzyme activity, and/or affecting protein complexes, potentially FIGURE 4 | Density plot of mean expression level of transcripts across 3,457 samples for annotated genes and Q3-transcribed ORFs. X-axis, edgeR-normalized mean expression of genes and ORFs. Y -axis, number of transcripts. The area under the curve of the density function represents the probability of a range of mean cpm. The bimodal curve of all orphan-ORFs is attributable to the low mean expression of the smORFs (see Supplementary Figures 10, 11). About half of the Q3-transcribed orphan-ORFs have higher mean expression than annotated orpahn genes. Over 600 orphan-ORFs have a higher mean expression than 10% of annotated conserved genes; 289 orphan-ORFs (gray hatched area) have a higher mean expression than 25% of annotated conserved genes; and, 36 orphan-ORFs have a mean expression higher than 90% of annotated conserved genes (see also Supplementary Table 4).

Network Inference and Co-expression Analysis
To analyze the expression patterns of the ORFs in the context of annotated genes, we optimized correlation and network parameters for the RNA-Seq expression data (Supplementary Section 4), focusing our subsequent interactive co-expression analysis and visualization on a dataset ("SGD + ORF" dataset) composed of 14,885 transcripts (all annotated genes; the 7,054 Q3-transcribed ORFs; and all 1,139 smORFs) across 3,457 RNA-Seq samples.
We then computed the PCC matrix for the SGD+ORF dataset, and partitioned the resultant network with PCC > 0.6 (only consider positive correlation) by Markov chain graph clustering (MCL; van Dongen, 2000) into 544 clusters (Supplementary Table 1 for overview; genes and ORFs with cluster designations at Supplementary Material S.cerevisiae_RNA-seq_3457_27.mog). Forty-six percent of the 273 annotated orphan genes and 59% of the 3,899 Q3transcribed orphan-ORFs are members of clusters containing more than five genes and include genes of known function, thus providing potential for functional inference.
It was possible that ORF expression might be correlated with that of adjacent or overlapping annotated genes, i.e., that ORFs are expressed due to a physical proximity to transcribed annotated genes. We used two approaches to evaluate the extent to which such "piggybacking" might occur. In the first approach, we focused on the 390 ORFs that are located completely within UTRs of annotated genes (88% are orphan-ORFs). About 80% of these ORFs have a PCC less than 0.6 (0.6 is the correlation cut-off we used for MCL) with the encompassing annotated genes, however, about 2% (eight) ORFs have a correlation higher than 0.9. In the second approach, we calculated how many ORFs are in the same cluster as nearby annotated genes. To do this, we randomly selected 366 ORFs that were members of clusters, and made test clusters of the same sizes, each cluster containing randomly selected annotated genes and the identical ORFs as in the experimental data. Then, we calculated the distance of each ORF to each annotated gene in the FIGURE 5 | Mean expression and numbers of genes and ORFs with translational evidence, partitioned by phylostrata and genomic context. Ribo-Seq data were analyzed for genes and ORFs across 302 samples using ribotricer (Choudhary et al., 2019). (A) Mean raw Ribo-Seq counts per transcript for all genes and ORFs. X-axis, genes and ORFs as classified by phylostrata. Y -axis, mean value of raw read counts. The letters above each bar indicate significance in each group according to a t-test (p-value cutoff is 0.01). Similar to mean RNA-Seq counts, the conserved genes and conserved-ORFs have more total mean Ribo-Seq counts. (B). The 3,857 Q3-transcribed ORFs that had Ribo-Seq translation evidence were divided into groups according to their relationship to annotated, and the numbers of genes and ORFs with translational evidence was determined. The gene/ORF with mean counts across 302 Ribo-Seq samples higher than 1.83 was consider to have translation evidence. X-axis, groups of genes and ORFs, classified by phylostrata. Y -axis, number of ORFs in each group. The proportions of ORFs are significantly different among three phylostratal groups according to a chi-square test (p-value < 0.001). Over 60% the orphan-ORFs with translation evidence are located in the interval between protein coding sequences (CDSs).
randomly created and the experimental clusters. The distances were not statistically different in the experimental versus the random clusters (p-value = 0.16 in a t-test for difference). These analysis indicate that the expression of ORFs is not generally associated with their proximity to, or overlap with an annotated gene. However, there is strong support for such a relationship for specific ORFs [e.g., Figure 6, and as reported in Vakirlis et al. (2018)]. About 65% of the Q3-transcribed ORFs could be assigned to clusters in the co-expression matrix. Regardless of whether these ORFs are protein-coding, they could play a biological role. Those with translational activity provide an evidence-based cadre of candidate protein-coding genes with an inferred function based on the genes in the cluster with known functions that could be experimentally tested.

Alternate Analyses
Here, we used positive pairwise Pearson correlations to infer a network, and MCL to partition this network into clusters; we then determined that these clusters are enriched in genes participating in similar biological processes. However, each combination of network inference and partitioning approaches can supply complementary and different information about potential roles of orphan (and indeed all) genes. Networks could be inferred by, for example, correlation, mutual information (Zhang et al., 2012), or relatedness approaches (Netotea et al., 2014). Pearson correlation is highly sensitive at extracting genes whose expression is linearly correlated across multiple conditions, but misses non-linear co-expression. Weighted correlation approaches may minimize the biological bias due to sample redundancy (Obayashi et al., 2019), but improper cutoff of sample correlation for sample redundancy may lead to lost information in the clustering analysis. Likewise, networks can be partitioned by multiple methods, such as MCL, Modularity (Newman, 2006), and a very promising new approach, Reduced Network Extreme Ensemble Learning (RenEEL; Guo et al., 2019). Each combination of network inference and network partitioning method may provide different strengths and weaknesses in terms of extracting different types of useful biological information. For example, some approaches might be better at identifying signal transduction pathways, others at metabolic pathways, stress-responses, or hub genes and their targets. The large SGD+ORF dataset we provide herein could be analyzed by different approaches. Such analyses would that extract the same relationships would provide support for these relationships; also, comparative analysis would help reveal the strengths and weaknesses of various network inference and partitioning methods for extracting different types of biological information.
We have focused here on unannotated protein-coding transcripts of over 50 aa (except the smORFs, which are smaller); similar investigations could incorporate non-coding RNAs or transcripts encoding very small proteins. The information resulting from such studies could be incorporated into a new MOG project to enable interactive analysis and visualization.

Gene Ontology Enrichment Analysis for Co-expressed Clusters
In order to evaluate the significance of the clustering results, we compared the extent of enrichment of GO terms in the set of clusters obtained from MCL-partitioning experimental data to that of 100 randomly generated sets of clusters. For each randomly generated set, the number of clusters and the number of genes per cluster were held the same as the set of clusters from the experimental data; however, the genes assigned to each cluster were changed by random permutation. The best adjusted p-value for enriched GO terms was recorded for each cluster and averaged across all clusters to obtain a mean best p-value (Mentzen and Wurtele, 2008; Figure 7). Distribution of the p-values for GO FIGURE 7 | Gene Ontology (GO) enrichment analysis of experimental data and random permutation test distribution. A Pearson correlation matrix of the Saccharomyces Genome Database (SGD)+ORF dataset was partitioned into clusters by Markov Cluster (MCL). Best p-values (mean of the lowest adjusted p-values for GO terms) were determined across all clusters of the experimental data and all clusters of random permutations, similar to Mentzen and Wurtele (2008); Supplementary Section 4. Red arrow, experimental data. Black bars, best p-value of 100 randomly obtained permutations with size and number of clusters identical to experimental data. BP, biological process; CC, cellular component; MF, molecular function. The clustering result is significantly better for experimental data than any random permutation. terms in the 100 sets of randomized clusters was compared to that of the experimental data (red arrows in Figure 7). In our study, for each GO ontology category BP, CC, and MF, the best mean p-values for the experimental data are 0.019, 0.023, and 0.027, respectively. These values are significantly better than those of any of the randomly obtained cluster sets, indicating that the MCL gene clusters derived from the experimental data is not random. Most co-expression clusters are composed of genes and ORFs distributed across spatially diverse regions of the genome. Co-expressed genes are implicated as being involved in a similar biological process Spellman et al., 1998). This study is based on over 3,000 RNA-Seq samples further strengthens the likelihood that genes in each cluster might share a related biological process. All genes and ORFs, as partitioned into clusters by MCL, are provided in Supplementary Material S.cerevisiae_RNA-seq_3457_27.mog.
Twelve of the genes are in the seripauperin (PAU) family. PAU-rich co-expressed gene clusters have also been identified in independent microarray studies (Magwene and Kim, 2004;Orellana et al., 2014). The precise molecular function of the PAU genes is not known. However, many PAUs are induced by low temperature and anaerobic conditions, and repressed by heme (Rachidi et al., 2000) and individual PAU proteins confer resistances to biotic and abiotic stresses (Rivero et al., 2015). YER011W and YJR150C, both cell wall mannoproteins are also in Cluster 112, are localized to the same cellular compartments as PAUs and are also induced under anaerobic conditions (Kowalski et al., 1995;Kitagaki et al., 1997;Sertil et al., 1997;Cohen et al., 2001). The other annotated genes in this cluster have no functional description. GO enrichment analysis identified eight GO terms as significantly over-represented in Cluster 112 ( Table 3). Figure 9 represents a case study of an approach to develop a meaningful hypothesis. The example shows the co-expression of the genes and ORFs in Cluster 112 across all 3457 samples of the RNA-Seq SGD+ORF dataset (lower panel). Two studies that evaluate oxygen content as an experimental variable are highlighted. Study SRP067275 compares four growth stages of FIGURE 8 | Network view of genes and ORFs in Cluster 112, a cluster enriched in seripauperins and other stress-responsive genes. A Pearson correlation matrix of the SGD+ORF dataset was partitioned into clusters by MCL. Edge colors, Pearson correlations(0.6-1.0). Visualization by igraph in R (Csárdi and Nepusz, 2006).

Exploring Gene Function: Case Study, smORF247301
Though rare, some transcribed ORFs that are located near or in an existing gene share a similar transcription pattern. An example is smORF247301, one of the most highly expressed smORFs, which is 77 nt upstream of YPL223C (Figure 6). MOG analysis indicates smORF247301 and the nearby annotated gene YPL223C have a PCC of 0.95 across the 3,457 RNA-Seq samples. smORF247301 is located on the "+" strand of chromosome 16, while YPL223C is on the "−" strand of the same chromosome. The CDS of YPL223C is 507 nt, while smORF247301 is 33 nt. YPL223C is more highly expressed than smORF247301. YPL223C, a hydrophilin gene that is essential in surviving desiccation-rehydration, is regulated by the high-osmolarity glycerol (HOG) pathway (Garay-Arroyo et al., 2000), and induced by osmotic, ionic, oxidative, heat shock and heavy metals stresses. Analysis using MOG shows smORF247301 and YPL223C have increased expression in response to osmotic, heat, and desiccation stresses in three independent studies (Figures 6B-D).
It is possible that the transcription and translation of smORF247301 is "noise" (Eling et al., 2019) associated with the expression of the nearby YPL223C. A second possibility is that smORF247301 is a young, not-yet-annotated gene. It might be "piggybacking" on the expression apparatus of YPL223C. However, smORF247301 and YPL223C are transcribed in a convergent orientation ( Figure 5B); thus, the process, described by Vakirlis et al. (2018), whereby two transcripts in divergent orientation ( Figure 5B) are co-expressed via a common bidirectional promoter would not apply in the case of smORF247301 and YPL223C. A different "piggybacking" mechanism might apply: perhaps, due to its location in open chromatin, smORF247301 is provided with a ready-made exposure to transcription factors when gene YPL223C is transcribed. If a transcript (e.g., smORF247301) conferred a survival advantage under the same conditions as did its established neighboring gene (e.g., YPL223C), it could emerge as a new, co-expressed, gene by this mechanism.
Five hundred and thirty-seven orphan-ORFs with transcription and translation evidence are in physical proximity to an annotated gene and are transcribed in a divergent orientation (see Supplementary Material, divergent_pairs.csv). Of these pairs, 12 are co-expressed (PCC > 0.6); these 12 ORFs are potentially co-expressed by a bidirectional promoter [e.g., as described by Vakirlis et al. (2018)] The 525 orphan-ORFs that are not co-expressed, might still be controlled by a bidirectional promoter, because yeast ORFs can be transcribed by a bidirectional promoter, but not be correlated in expression because they are influenced by different transcription factors (Xu et al., 2009).

CONCLUSION
In this study we have globally assessed the accumulation of transcripts representing 36,046 annotated genes and unannotated ORFs of S. cerevisiae across 3,457 public RNA-Seq samples derived from diverse biological conditions. Ninety-five percent of the transcribed ORFs are orphans or genus-specific. Despite a strong tendency to be transcribed only under restricted conditions, 269 orphan-ORFs had mean levels of transcription across all conditions greater than 25% of annotated genes. Over 1,600 transcribed ORFs with translation evidence are members of co-expression clusters, providing additional clues as to a potential function.
The proportion of transcribed and translated ORFs that are functional genes is unknown. The SGD+ORF dataset assembled herein represents expression of annotated genes and unannotated ORFs under multiple conditions; it is delivered in a readily explorable, user-friendly format via the MOG platform. Combining this networkinformed view of aggregate RNA-Seq data with textmining of sample and gene metadata provides a powerful approach to develop novel, experimentally testable hypotheses on the potential functions of as-yet-unannotated transcripts.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ Supplementary Material.

AUTHOR CONTRIBUTIONS
JL and EW conceived of the project and drafted the manuscript. JL carried out the design of the study and performed the statistical analysis. US participated in the visualization on MOG and provided Ribo-Seq analysis code and guidance. ZA contributed to the phylostrata analysis. All authors contributed, read, and approved the final manuscript.

FUNDING
This work was funded in part by the National Science Foundation grant NSF-IOS 1546858, Orphan Genes: An Untapped Genetic Reservoir of Novel Traits.