Systematic Nucleotide Exchange Analysis of ESTs From the Human Cancer Genome Project Report: Origins of 347 Unknown ESTs Indicate Putative Transcription of Non-Coding Genomic Regions

Expressed sequence tags (ESTs) provide an imprint of cellular RNA diversity irrespectively of sequence homology with template genomes. NCBI databases include many unknown RNAs from various normal and cancer cells. These are usually ignored assuming sequencing artefacts or contamination due to their lack of sequence homology with template DNA. Here, we report genomic origins of 347 ESTs previously assumed artefacts/unknown, from the FAPESP/LICR Human Cancer Genome Project. EST template detection uses systematic nucleotide exchange analyses called swinger transformations. Systematic nucleotide exchanges replace systematically particular nucleotides with different nucleotides. Among 347 unknown ESTs, 51 ESTs match mitogenome transcription, 17 and 2 ESTs are from nuclear chromosome non-coding regions, and uncharacterized nuclear genes. Identified ESTs mapped on 205 protein-coding genes, 10 genes had swinger RNAs in several biosamples. Whole cell transcriptome searches for 17 ESTs mapping on non-coding regions confirmed their transcription. The 10 swinger-transcribed genes identified more than once associate with cancer induction and progression, suggesting swinger transformation occurs mainly in highly transcribed genes. Swinger transformation is a unique method to identify noncanonical RNAs obtained from NGS, which identifies putative ncRNA transcribed regions. Results suggest that swinger transcription occurs in highly active genes in normal and genetically unstable cancer cells.


INTRODUCTION
Systematic nucleotide exchanges, also called swinger transformations, systematically exchange specific nucleotides with other specific nucleotides during DNA replication and/or RNA transcription. DNA or RNA molecules corresponding to systematic nucleotide changes are called swinger DNAs or swinger RNAs. Only 23 systematic nucleotide exchanges ( Figure 1) are possible with four nucleotide bases (A, T, C and G), i.e. nine symmetric (X ↔ Y, e.g. A ↔ G) (Seligmann, 2013a;Seligmann, 2013b) and 14 asymmetric (X ! Y ! Z ! X, e.g. A ! G ! T ! A) (Seligmann, 2013b;Seligmann, 2013c). For example, in symmetric exchange A ↔ G, all As are replaced by Gs and all Gs by As. The 14 asymmetric exchanges are directional, e.g. A ! G ! T ! A: all As are replaced by Gs, Gs by Ts and Ts by As. It is unclear whether these systematic exchanges occur during DNA/ RNA polymerizations or result from posttranscriptional editions. Their relatively long lengths (> 100 nucleotides) favors the former. Previous correlation analyses (Seligmann, 2013b;Seligmann, 2013c;Michel and Seligmann, 2014) show that the lengths and abundances of swinger RNAs are approximately proportional to rates calculated on the basis of corresponding single nucleotide misinsertions by the human mitochondrial gamma DNA polymerase (from Lee and Johnson, 2006). This suggests that swinger RNAs result from polymerizations where the polymerase is stabilized in the usually transient, unstable state that causes regular single nucleotide misinsertions.
Mitogenomic human swinger RNA reads produced by Illumina confirmed corresponding EST data (Seligmann, 2016b). The swinger transcriptome of the amoeban-hosted Mimivirus was also confirmed by two different sequencing techniques (SOLID and 454, Seligmann and Raoult, 2018). Detected peptides matching translation of the swingertransformed mitogenome tend to map on detected swinger RNAs (Seligmann, 2016a;Seligmann, 2016b;Seligmann, 2016c;Seligmann, 2017a). These findings were further confirmed in purified mitochondrial transcriptomes (Warthi and Seligmann, 2018), rejecting the possibility of cytoplasmic contamination. Chimeric mitochondrial swinger RNAs also exist, partly following swinger polymerization and partly regular polymerization, with abrupt switches between these parts (Seligmann, 2015a;Seligmann, 2015b). This observation is paralleled by chimeric peptides, corresponding to translation of adjacent regular and swinger RNA (Seligmann, 2016d). Swinger RNA coverage associates with secondary structure formation FIGURE 1 | Twenty three systematic nucleotide exchanges. (A) 9 Symmetric exchanges in which one nucleotide is exchanged by another during DNA replication or transcription, e.g. A ↔ C, where all As are replaced by Cs and all Cs by As. (B) 14 Asymmetric exchanges that are directional in nature, e.g. A!C!T!A: all As are replaced by Cs, Cs by Ts and Ts by As. (Seligmann, 2016e). Swinger DNA also occurs (Seligmann, 2014a;Seligmann, 2014b). Molecular functions and associations of swinger polymerizations with healthy or unhealthy cells remain unknown. Recently, swinger RNAs with A ↔ T + C ↔ G transformation were identified in HIV mediated Non-Hodgkin's Lymphoma . The identified RNAs with A ↔ T + C ↔ G transformation are similar to the transcriptional product by polymerase template switching, known in retroviruses (Kandel and Nudler, 2002) and the human genome (Löytynoja and Goldman, 2017).
The Encyclopedia of DNA Elements (ENCODE) project concluded that a human genome includes approximately 20,000 protein-coding genes and predicts that 80% of noncoding regions regulate gene expression (Dunham et al., 2012). However, the transcriptomes generated by current RNA sequencing technologies do not cover the complete human genome, because analyses of sequencing reads assume only canonical transcription. Analyses assuming systematic nucleotide exchanges could identify non-canonical RNAs and their genomic origin.
Expressed sequence tags (ESTs) are short DNA sequences (100-600 bp) generated from the sequencing of cDNA libraries. The ESTs represent RNAs derived from a particular cell irrespective of sequence similarity or dissimilarity to its template genome, unlike transcriptomes, in which sequencing reads with sequence similarity are considered as true reads while ignoring non-homologous reads. This underlines a major limitation in identifying and studying non-canonical RNAs and their association with various genetic diseases like cancers. Therefore, here we apply swinger transformations to identify unknown expressed sequence tags (ESTs) occurring in the FAPESP/LICR Human Cancer Genome Project (Neto et al., 2000). We expect to identify unknown ESTs reported in cancer cells, identify and report their genomic origins and confirm that swinger-transformed RNAs occur beyond mitochondria.

Identification of Unknown Ests
The FAPESP/LICR Human Cancer Genome Project corresponds to 891,011 published (Neto et al., 2000) and 55,248 unpublished ESTs in the NCBI database. These ESTs were blasted with the "Human genomic + transcript" database using the "highly similar sequence" algorithm (megablast) (Zhang et al, 2000;Morgulis et al., 2008). A total of 149,500 ESTs with no sequence similarity with the human genome were further blasted across all sequences in the NCBI database to exclude ESTs corresponding to potential contaminations.

Swinger Transformation of Unknown Ests
The unknown 149,500 ESTs were then swinger-transformed according to the 23 bijective transformations (Figure 1) (Michel and Seligmann, 2014). These 149,500 unknown ESTs were blasted again with the "Human genomic + transcript" database using megablast, in order to detect their genomic template.

RESULTS AND DISCUSSION
In total, 347 ESTs (0.23%) were identified using swinger transformation analyses (Tables 1-3). This rate is 4× lower than that estimated for human mitochondria (about 100/10000 ESTs, 1% (data from Seligmann, 2012;Seligmann, 2013b;Seligmann, 2013c). Swinger-transformed sequences of these identified 347 ESTs are in Supplementary Table 1. The remaining 99.77% of unidentified ESTs could be sequencing artefacts. These could also be ESTs with non-systematic posttranscriptional hyper editing (Porath et al., 2014) or resulting from systematic nucleotide deletion transcription   Figure 2). From here on, these identified ESTs are considered swinger RNAs. Abundances of swinger RNA classes are proportional to their mean length (r = 0.96, two tailed P = 0.00945). Assuming that different swinger classes result from different polymerase states, the positive correlation indicates that the same factor that promotes switching to a given mode of swinger transcription also favors the stability of this mode. The opposite, indicated by a negative association, would mean that frequent types of switches are unstable. We suggest that systematic nucleotide transformations happen during transcription (swinger transcription). This might result from polymerase enzyme fatigue. In carcinogenesis, during malignant transformation, cancer cells produce mutated RNAs and proteins because of genetic instability (Ciocca and Calderwood, 2005;Ruckova et al., 2012). Therefore, swinger RNAs are probably mainly non-or dys-functional, and produce nonand dysfunctional proteins with probable carcinogenic effects.

Swinger-Transformed Genes in Different Cancer Tissues
These 347 swinger-transformed RNAs matched 205 known and two uncharacterized genes (Tables 1-3). These identified swinger RNAs might be artefacts. However, among these 207 genes, swinger RNAs from 10 genes were detected in multiple cancer types and biosamples ( Table 2). These genes are adenylate kinase 1 (AK1), cystatin B (CSTB), DLG associated protein (DLGAP), heat shock protein family A (Hsp70) member 8 (HSPA8), mediator complex subunits (MED), poly(A) binding protein cytoplasmic 1 (PABPC1), MT-16s rRNA, MT-ND1, MT-ND2 and MT-ND4.      Two A ↔ T-transformed swinger RNAs detected in the colon and head-neck cancer lines mapped on the adenylate kinase 1 (AK1) coding gene ( Table 2). AK1 plays an important role in tumor suppression and it is often downregulated in cancer cells (Collavin et al., 1999;Janssen et al., 2004;Vasseur et al., 2005;Jan et al., 2019). Three swinger RNAs detected in breast (A ! G ! C ! T ! A) and two colon cancer (A ↔ T) lines mapped on cystatin B (CSTB) ( Table 2). CSTB plays an important role in expression and epigenetic regulation and is downregulated in lung, gastric and colorectal cancers (Zhang et al., 2016;Ma et al., 2017). CSTB promotes cell proliferation, migration and suppresses apoptosis in gastric cancer cells (Zhang et al., 2016). Downregulation of CSTB also promotes gastric cancer (Zhang et al., 2016). Similarly, swinger RNAs for HSPA8 and PABPC1 were identified. We identified two swinger RNAs from the HSPA8 protein-coding region. The carboxy-terminus of Hsc70 interacting protein (CHIP) plays an important role in cancer initiation and progression (Hatakeyama et al., 2005;Kajiro et al., 2009;Gaude et al., 2012) and has an anti-tumor effect in many cancer types including colon and gastric cancers (Kajiro et al., 2009;Ahmed et al., 2012;Wang et al., 2013;Ying et al., 2013;Wang T et al., 2014;Wang Y et al., 2014). Thirty-three swinger RNAs were transcribed from the PABPC1 gene from colon cancer and head-neck cancer biosamples, respectively ( Table 2). PABPC proteins are RNA processing proteins associated with gene expression regulation (Liu et al., 2012) and are upregulated in prostate and colorectal cancers (Eisermann et al., 2015). PABPC also has a tumor suppressor role in head and neck squamous cell carcinoma . Swinger-transformed RNAs produced from these identified genes will produce non-homologous and putatively n o n f u n c t i o n a l m R N A s t r a n s l a t i n g d y s f u n c t i o n a l nonhomologous proteins, which supports the association between downregulation of these genes and cancer cell types. These swinger RNAs could be early stage factors responsible for cancer induction or result from genetic instability in later stages of malignant cancers (Ruckova et al., 2012;Ciocca and Calderwood, 2005). Results favor the former because the swinger transformed mRNA of tumor suppressor genes would result in disruption of gene function inhibiting programmed cell death. Analyses also identify two swinger RNAs mapped on DLG associated proteins (DLGAP1 and DLGAP4) coding genes in healthy nervous tissue samples and colon cancer cells. DLGAP is a protein overexpressed in the brain (Fagerberg et al., 2014) and promotes invasiveness in cancer cell lines . Similarly, three swinger RNAs were transcribed from mediator complex subunit genes (MED) in colon and breast cancer tissues ( Table 2). Mutations in MED12 are associated with tumorigenesis (Bullerdiek and Rommel, 2018;Xie et al., 2018) and cause benign breast fibroepithelial lesions (Pareja et al., 2019). These two observations suggest that swinger transformation of MED should induce carcinogenesis, whereas, swinger RNAs are transcribed from overexpressing DLGAP genes due to enzyme fatigue in cancer cells.
Among the detected swinger RNAs, 51 swinger RNAs match mitochondrial genes (Tables 1 and 2). It is widely known and     proven that mitochondrial genes are overexpressed in various cancer types to meet up the metabolic requirements of cancer cells and are strongly associated with cancer (Tarantul et al., 2001;Modica-Napolitano and Singh, 2004;Księżakowska-Łakoma et al., 2014;Lin et al., 2018). Among 51 swinger RNAs, 7, 1, 30 and 3 swinger RNAs are transcribed from mitochondrial 16s rRNA, ND1, ND2 and ND4, respectively. These were identified in several biosamples ( Table 2). Sixty-six percent (33 ESTs) of the identified swinger RNAs were transcribed from the MT-ND region. A previous study showed bias for swinger transformation of MT-ND genes (Warthi and Seligmann, 2018). Interestingly, detection of swinger transformed DLGAP1 mRNAs (overexpressed~15× in neurons compared to other tissue cells; Fagerberg et al., 2014), in normal nervous tissue samples suggests that the swinger transformation is not directly associated with cancer but perhaps associated with highly expressed genes. Carcinogenesis could be the outcome of swinger transformed dysfunctional mRNAs. Previously identified swinger RNAs from highly active cancer-associated genes  also support this finding. Indeed, even within mitochondrial genes, rRNAs are more expressed than other genes, and swinger rRNAs are the most frequently observed mitochondrial swinger RNAs in previous publications (Seligmann, 2013b;Seligmann, 2013c;Seligmann, 2014b;Seligmann, 2015b), matching the pattern of positive association between regular and swinger transcriptions. Hence, the positive association between expression and swinger transformation occurs independently for human nuclear and mitochondrial genes. Hence, at least at this qualitative level of analysis, observations on these highly expressed nuclear and mitochondrial genes support that swinger transformations associate with highly active genomic regions that result in cancer induction and progression due to nonfunctional transcripts.
In order to test whether swinger transformations are biased towards some regions of genes, genes for which swinger RNAs were detected in more than one biosample ( Table 2) are regionally compartmentalized in three equal regions i.e. 5' region, mid region and 3' regions, each region spanning 1/3 of the gene. Mitochondrial genes like 16S rRNA (five of seven ESTs), ND4 (two of three ESTs), and nuclear gene Cystatin B (three of three ESTs) mapped on the 3' extremity of their respective genes.
Considering mitochondrial and nuclear genes separately, 38 among 43 (mitochondrial, 88.4%, one tailed sign test P = 0.0011) and 33 among 33 (nuclear, 100%, one tailed sign test P = 1.6 × 10 −6 ) swinger RNAs map either on the mid or the 3' regions of the gene. These tests assume that the probability of mapping randomly on these regions is 2/3. One tailed sign tests are justified by the working hypothesis that polymerase enzyme fatigue (occurring more downstream from transcription initiation) causes swinger transcription. These observations indicate that the mid and 3' regions of highly expressed genes are more prone to produce swinger transformed RNAs than the 5' region, for each nuclear and mitochondrial genes. Similarly, swinger transformed ESTs (from only one biosample) mapping on identified genes, preferentially mapped on the same region ( Table 1 shows the 5' and 3' positions of such ESTs).
Among 275 ESTs from nuclear protein coding genes, 196 ESTs mapped on exons of genes and 76 swinger transformed ESTs mapped on gene introns. Three ESTs mapped partly on gene exons and partly on introns, suggesting that swinger transformations occur before post transcriptional modification, and support our working hypothesis of swinger polymerization during replication/transcription.
The 347 ESTs correspond to 203 protein coding genes and 19 RNAs from non-coding regions. The median and mean size of a protein coding gene in human genome are~26,288 bp and 66,577 bp (Piovesan et al., 2016). The mean and median length of identified 203 swinger transformed protein coding genes (including mitochondrial genes) are 124,427 bp and 55,291 bp which is almost twice the mean and median lengths of human protein coding genes. These genes for which swinger transcripts were detected also included the largest protein coding gene in the human genome i.e. RNA binding fox-1 homolog 1 (RBFOX1). These observations indicate that swinger transformation probably due to polymerase fatigue is not only associated to highly active genes but could also associate with large genes. This could also explain the biased mapping of swinger RNAs on the mid and 3' regions of genes, vs the first third of genes.

Swinger RNAs From Non-Coding and Uncharacterized Genomic Regions
Seventeen and two identified swinger RNAs were transcribed from 15 non-coding genomic regions and two uncharacterized genomic regions, respectively ( Table 3). The 15 non-coding FIGURE 2 | Relationship between A ! T ! C ! G ! A and A ! G ! C ! T ! A asymmetric swinger transformation.
swinger RNAs were not from 5' or 3' UTR or intronic regions of protein-coding genes. To test if these 15 identified genomic regions on various human chromosomes are transcriptionally active, we did a transcriptome search in seventy-one samples (SRX768406-SRX768476, Garzon et al., 2014) with the identified swinger RNAs transcribed from these non-coding genomic regions. Good and complete alignments of the reads on the searched sequences confirmed transcriptions of these 15 noncoding regions (Supplementary Table 2). This result suggests that these 15 regions are transcriptionally active, with potential roles in nerve cells and possible associations with carcinogenesis. However, these specific cancer tissue samples are unavailable, preventing further in-vitro and in-vivo tests and analyses.
The identified swinger RNAs were mapped on human nuclear chromosomes. No compartmentalizations on any chromosomal arm, region or band of a chromosome were detected in relation to swinger expressed regions. Swinger RNAs produced by transformations that are more conservative at the amino acid level are also more abundant than swinger RNAs produced by transformations that cause more amino acid changes (Seligmann, 2018). This effect was also observed when comparing transformations involving the same number of nucleotides in the transformation (2, 3 or 4). For the distribution of swinger RNAs detected in the present study, too few classes of transformations were detected to enable such subdivisions. However, when considering all transformations, conservation increases with swinger RNA abundances (r = 0.4187, one tailed P = 0.0234). C ↔ G is the least conservative among the transformations involving only two nucleotides, and is one among two transformations for which no swinger RNA was detected in the currently examined data.

CONCLUSION
We report genomic origins of 347 previously unidentified ESTs generated by the FAPESP/LICR Human Cancer Genome Project (Neto et al., 2000). These represent 0.23% of the 149,500 unidentified ESTs from that project. Note that other types of systematic transformations apparently produce noncanonical RNAs, such as systematic deletions, which produce so-called delRNAs (Seligmann, 2015c;Seligmann, 2016f;El Houmami and Seligmann, 2017;Seligmann, 2017b; and corresponding peptides (Seligmann, 2015c;Seligmann, 2016a;Seligmann, 2016b;Seligmann, 2016c;Seligmann, 2016d;Seligmann, 2016e), including chimeric peptides . This underlines a general approach to identify unknown sequences generated by various sequencing methods. This study identifies swinger RNAs transcribed from multiple cancer-associated genes (Tables 1 and 2), suggesting highly active genes produce swinger transcripts possibly due to enzyme fatigue and promoting cancer progression. The identified sequences might be sequencing artefacts. However, random artefacts generated by sequencing equipment could not produce systematic exchanges. Systematic sequencing errors should produce swinger RNAs for all sequenced ESTs, however, biosamples where swinger RNAs are detected include regular canonical RNAs. Analyses also identify transcriptionally active non-coding regions on human chromosomes discovering putative ncRNA transcribing regions with potentially significant roles in normal and cancer cells. We provide a unique method to study and identify unknown sequencing reads, reducing loss of important genetic information in raw sequence data. Systematic editing of RNA might contribute to solve the dark DNA conundrum.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article/ Supplementary Material.