Analysis of Long Non-Coding RNA in Cryptosporidium parvum Reveals Significant Stage-Specific Antisense Transcription

Cryptosporidium is a protist parasite that has been identified as the second leading cause of moderate to severe diarrhea in children younger than two and a significant cause of mortality worldwide. Cryptosporidium has a complex, obligate, intracellular but extra cytoplasmic lifecycle in a single host. How genes are regulated in this parasite remains largely unknown. Long non-coding RNAs (lncRNAs) play critical regulatory roles, including gene expression across a broad range of organisms. Cryptosporidium lncRNAs have been reported to enter the host cell nucleus and affect the host response. However, no systematic study of lncRNAs in Cryptosporidium has been conducted to identify additional lncRNAs. In this study, we analyzed a C. parvum in vitro strand-specific RNA-seq developmental time series covering both asexual and sexual stages to identify lncRNAs associated with parasite development. In total, we identified 396 novel lncRNAs, mostly antisense, with 86% being differentially expressed. Surprisingly, nearly 10% of annotated mRNAs have an antisense transcript. lncRNAs occur most often at the 3′ end of their corresponding sense mRNA. Putative lncRNA regulatory regions were identified and many appear to encode bidirectional promoters. A positive correlation between lncRNA and upstream mRNA expression was observed. Evolutionary conservation and expression of lncRNA candidates was observed between C. parvum, C. hominis and C. baileyi. Ten C. parvum protein-encoding genes with antisense transcripts have P. falciparum orthologs that also have antisense transcripts. Three C. parvum lncRNAs with exceptional properties (e.g., intron splicing) were experimentally validated using RT-PCR and RT-qPCR. This initial characterization of the C. parvum non-coding transcriptome facilitates further investigations into the roles of lncRNAs in parasite development and host-pathogen interactions.


INTRODUCTION
Cryptosporidium is an obligate protist parasite that causes a diarrheal disease called cryptosporidiosis which spreads via an oral-fecal route. Human cryptosporidiosis, mainly caused by Cryptosporidium parvum and Cryptosporidium hominis, is typically self-limited and causes 1~2 weeks of intense watery diarrhea in people with healthy immune systems. However, the illness may be lethal among the immunocompromised including individuals with AIDS, cancer, and those receiving transplant antirejection medications. In recent years, several Cryptosporidium species, predominantly C. hominis, have been identified as the second most prevalent diarrheal pathogen of infants globally after rotavirus (Bouzid et al., 2013;Kotloff et al., 2013;Painter et al., 2015;Platts-Mills et al., 2015;Sow et al., 2016) and a leading cause of waterborne disease among humans in the United States (Centers for Disease Control and Prevention) 1 . Thus far, Nitazoxanide, the only FDA-approved drug is not effective for use in infants or those with HIV-related immunosuppression (Amadi et al., 2009) i.e. the most susceptible populations, and no vaccine is available (Amadi et al., 2002).
Cryptosporidium has a complex lifecycle in a single host. The Cryptosporidium oocyst which is shed in feces is a major extracellular lifecycle stage. It can stay dormant and survive in the environment for months (Drummond et al., 2018). After ingestion of oocysts through contaminated water or food, sporozoites are released which are capable of invading intestinal epithelial cells where both asexual and sexual replication occur. Following invasion, sporozoites develop into trophozoites and undergo asexual replication to generate type I meronts and type II meronts. Type I meronts are thought to be capable of reinvading adjacent cells generating an asexual cycle (Fayer, 2008), while Type II meronts contribute to the formation of microgamonts (male) or macrogamonts (female) to complete the sexual stages (Bouzid et al., 2013). Conventional monolayer cell culture does not permit completion of the life cycle much beyond 48 h post-infection (hpi) ( Figure 1A), for as of yet poorly understood reasons but gametogenesis does occur (Tandel et al., 2019). The lack of in vitro culture has historically impeded the development of new drugs and vaccines for this medically important parasite. Recently, there have been several breakthroughs including genetic manipulation (Vinayak et al., 2015;Sateriale et al., 2020) and lifecycle completion. Several promising approaches have been developed including using a cancer cell line as host (Miller et al., 2018), biphasic and three-dimensional (organoid) culture systems, (Morada et al., 2016;DeCicco RePass et al., 2017;Heo et al., 2018;Cardenas et al., 2020), hollow fiber technology (Yarlett et al., 2020), and air-liquid interface (ALI) cultivation system (Wilke et al., 2019;Wilke et al., 2020). These breakthroughs are enabling better, much needed, studies of the parasite's full life cycle. A better understanding of the conditions and regulatory processes necessary for Cryptosporidium development are essential and will prove beneficial for the identification of drug and vaccine targets.
The first genome sequence of C. parvum was published in 2004 with a genome size of~9 Mb and~3800 protein-encoding genes annotated (Abrahamsen et al., 2004). Since this milestone, our understanding of the parasite and its biology have progressed remarkably. Early in vitro transcriptome analyses using semiquantitative RT-PCR over a 72 h post-infection (pi) time course during in vitro development revealed complex and dynamic gene expression profiles. Adjacent genes are not generally co-regulated, despite the highly compact genome (Mauzy et al., 2012). Under UV irradiation, C. parvum oocysts have shown a vital stress-induced gene expression response according to microarray data (Zhang et al., 2012). mRNA expression related to gametocyte and oocyst formation were studied using RNA sequencing of sorted cells (Tandel et al., 2019). Yet, little is known about the regulation of key developmental transitions. How the parasite regulates gene expression in order to invade hosts, amplify, evade the immune system and interact with their host awaits further discovery.
Most canonical eukaryotic enhancer proteins are not detected in Apicomplexa, the phylum that Cryptosporidium belongs to (Iyer et al., 2008), except for the transcriptional activators Myb and zinc finger proteins C2H2 (Cys2His2) and two additional transcription factor families. Instead, an expanded family of apatela-related transcription factors, the AP2 family of proteins (ApiAP2), appear to be the predominant transcription factors in this phylum, including Cryptosporidium (Oberstaller et al., 2014;Jeninga et al., 2019). AP2 domains in C. parvum are reported to have reduced binding diversity relative to the malaria parasite Plasmodium falciparum and proposed to possess less dominancy in transcriptional regulation in C. parvum (Campbell et al., 2010;Yarlett et al., 2020). It has been proposed that C. parvum is less reliant on ApiAP2 regulators in part because it utilizes E2F/DP1 transcription factors, which are present in Cryptosporidium while absent in other studied apicomplexans Yarlett et al., 2020). Based on the similarity of gene expression profiles, it has been suggested that the number of co-expressed gene clusters in C. parvum is somewhere between 150 and 200, and putative ApiAP2 and E2F/DP1cis-regulatory elements were successfully detected in the upstream region of many coexpressed gene clusters (Oberstaller et al., 2013). Additionally, low levels of DNA methylation in oocysts has been reported in several Cryptosporidium spp. , suggesting the requirement of additional regulatory mechanisms. At the level of post-transcriptional regulation, the RNA interference (RNAi) pathway, which plays a crucial role in gene silencing in most eukaryotes, is considered to be missing in Cryptosporidium due to the lack of identifiable genes encoding the microRNA processing machinery or RNA-induced silencing complex (RISC) components (Keeling, 2004). There is much that remains to be discovered with respect to regulation of gene expression in Cryptosporidium.
Long non-coding RNAs (lncRNAs) are transcripts without significant protein-encoding capacity that are longer than 200 nt. In eukaryotes, lncRNAs play critical regulatory roles in gene regulation at multiple levels, including transcriptional, posttranscriptional, chromatin modification and nuclear architecture alterations (Marchese et al., 2017). In humans, 3300 long intergenic ncRNAs (lincRNAs) were analyzed using chromatin state maps, and~20% of these RNAs are bound to the polycomb repressive complex (PCR2, a complex with histone methyltransferase activity) (Khalil et al., 2009). Most lncRNAs share many characteristics of mRNAs, such as RNA polymerase II-mediated transcription, a 5′ 7-methylguanosine cap and a 3′ poly(A) tail [6]. The expression of lncRNAs is usually more tissue-or time-specific than mRNA expression (Necsulea et al., 2014;Tsoi et al., 2015). lncRNA sequences are not well conserved across species, but their structure could be conserved due to functional constraints (Ulitsky et al., 2011;Diederichs, 2014). By forming hybrid structural complexes such as RNA-DNA hybrid duplexes or RNA-DNA triplexes, lncRNAs can recruit or scaffold protein complexes to facilitate localization of protein machinery to specific genome target sites . lncRNAs play important roles in regulating occurrence and progression of many diseases. After infected by C. baileyi, significant expression changes have been observed in the host (Ren et al., 2018). The mis-regulation of lncRNAs in multi-cellular eukaryotes has been shown to cause tumorigenesis (Chakravarty et al., 2014), cardiovascular diseases (Tang et al., 2019), and neurodegenerative dysfunction (Zhang et al., 2018) and thus can be used as diagnostic biomarkers.
Taking advantage of sequencing technologies, numerous lncRNA candidates have been detected in apicomplexans, some with proven regulatory potential during parasite invasion and proliferation processes. These discoveries have ushered in a new era in parasite transcriptomics research (Liao et al., 2014;Siegel et al., 2014;Broadbent et al., 2015;Ramaprasad et al., 2015;Filarsky et al., 2018). In P. falciparum, lncRNAs are critical regulators of virulence gene expression and associated with chromatin modifications (Vembar et al., 2014). Likewise, an antisense RNA of the gene gdv1 was shown to be involved in regulating sexual conversion in P. falciparum (Filarsky et al., 2018). In C. parvum, putative parasite lncRNAs were found to be delivered into the host nucleus, some of which were experimentally proven to regulate host genes by hijacking the host histone modification system (Ming et al., 2017;Wang et al., 2017a;Wang et al., 2017b). The importance of lncRNA in C. parvum has been demonstrated, but no systematic annotation of lncRNA has been conducted. The systematic identification of lncRNAs will increase the pool of candidate regulatory molecules thus ultimately leading to increased knowledge of the developmental gene regulation in C. parvum and control of parasite interactions with its hosts.
In this study, we developed and applied a computational pipeline to systematically identify new lncRNAs in the C. parvum genome. We used a set of stranded RNA-seq data collected from multiple lifecycle stages that cover both asexual and sexual developmental stages. We conducted a systematic analysis of lncRNA that includes sequence characteristics, conservation, expression profiles and expression relative to neighboring genes. The results provide new insights into C. parvum's non-coding potential and suggest several areas for further research.

RNA-Seq Data Pre-Processing/Cleaning
RNA-Seq data sets were downloaded from the NCBI Sequence Read Archive (SRA) and European Nucleotide Archive database (ENA). Detailed information on SRA accession numbers and Bioprojects are listed in Table 1 and Supplementary Table S1. FastQC-v0.11.8 was used to perform quality control of the RNA-Seq reads. Remaining adapters and low-quality bases were trimmed by Trimmomatic-v-0.36 (Bolger et al., 2014) with parameters: Adapters:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:25 MINLEN:25. All reads were scanned with a four-base sliding window and cut when the average Phred quality dropped below 25. Bases from the start and end were removed when the quality score was below 20. The minimum read length was set at 25 bases. The processed reads are referred to as cleaned reads in this work.   (Kim et al., 2015) with maximum intron length (-max-intronlen) set at 3000, and remaining parameters as default. Uniquely mapped reads were selected for further study using SAMtools-v1.10 (view -q 10) (Li et al., 2009). StringTie-v2.0.6 (Pertea et al., 2015) was used to reconstruct transcripts using the reference annotation guided method (-rf -j 5 -c 10 -g 1). At least five reads were required to define a splice junction. A minimum read coverage of 10 was used for transcript prediction. Only overlapping transcript clusters were merged together. The stranded library types were all "fr-firststrand." Transcripts with FPKM lower than three were removed. The transcriptome assemblies from multiple samples were merged into one master transcript file using TACO-v0.7.3 with default settings (Niknafs et al., 2017).

lncRNA Prediction
Transcripts that overlapped with currently annotated mRNAs in the C. parvum IOWA-ATCC annotation in CryptoDB v46 with coverage >70% on the same strand were removed using BEDTools-v2.29.2 (Quinlan and Hall, 2010). The remaining transcripts were examined for coding potential using the online tool Coding Potential Calculator (CPC) v0.9 (Kong et al., 2007). Transcripts considered as "coding" by CPC were removed. Potential read-through transcripts resulting from transcription of neighboring mRNAs were removed using two criteria: 1) The transcript was <50 bp from the upstream coding region of another gene on the same strand.
2) The transcript was always transcribed together with the upstream mRNA on the same strand. Finally, the remaining transcripts which occurred in >2 RNA-Seq samples were kept as putative lncRNA candidates for further studies.

Transcriptome Data Normalization and Identification of Differentially Expressed Genes
The raw read counts for both mRNA genes and predicted lncRNAs were calculated using HTSeq-v0.12.4 (Anders et al., 2015). All genes were filtered to require > 50 reads in at least three samples. Variance stabilizing transformation (VST) from DESeq2-v1.28.1 (Love et al., 2014) was used to normalize the expression between samples. Principal components analysis (PCA) of the RNA-Seq samples was performed using the resulting VST values. VST values for each of the mRNA and lncRNA candidates were visualized using the R package pheatmap-v1.0.12 (https://github.com/raivokolde/ pheatmap). The K-mean approach in pheatmap was applied to cluster genes based on the expression data. Differentially expressed genes including mRNAs and lncRNAs were analyzed by EdgeR-v3.30.3 (Robinson et al., 2010) using the raw read count from HTSeq-v0.12.4. The expression time-points compared were between oocyst/ sporozoites (oocysts, time point 0 h and sporozoites, 2 h), asexual stage (time point 24 h) and mixed asexual/sexual stage (only select samples from 48 h time point in which gametocyte marker genes are clearly expressed, i.e. batch D). The sex marker gene: cgd6_2090 encodes Cryptosporidium oocyst wall protein-1 (COWP1) produced in female gametes and cgd8_2220 encodes the homolog of hapless2 (HAP2) a marker of male gamonts (Tandel et al., 2019). A generalized linear model (GLM) approach was used for differential expression hypothesis testing. P-values were adjusted by the false discovery rate (FDR). Significant differentially expressed genes were declared at a log2-fold change ≥ 1.5 and an FDR < 0.05.

Expression Correlation
The expression correlation analysis between predicted lncRNAs and mRNAs was conducted with normalized expression data from all 33 samples (Table 1) of C. parvum using the Pearson test. P-values were adjusted by FDR.
Upstream Motif Analysis MEME v5.0.0 (Bailey and Elkan, 1994) was used to discover motifs that may be present upstream of putative lncRNAs. For the promoter motif search, we extracted the 100 bp of (+) strand sequence upstream of the predicted lncRNAs and searched both strands using MEME for motifs with length six to 50 bp. The parameters were: -dna -mod anr -nmotifs 15 -minw 6 -maxw 50 -objfun classic -markov_order 0 -revcomp.

Evolutionary Conservation
RNA-Seq data sets from C. hominis and C. baileyi oocysts were mapped to reference genome sequences for C. hominis 30976 and C. baileyi TAMU-09Q1, respectively, downloaded from CryptoDB v46 (https://cryptodb.org/cryptodb/). Mapped reads were assembled into transcripts using the same methods mentioned above. tblastx from NCBI BLAST v 2.10.0 was used to search for conserved antisense candidates among C. parvum, C. hominis and C. baileyi, with parameter of E-value:1e-5 and best hit retained. To assess conservation among other apicomplexans, P. falciparum, antisense and associated sense mRNA data were retrieved from (Broadbent et al., 2015). P. falciparum orthologs of the sense mRNAs in C. parvum were retrieved from OrthoMCL DB v6.1 (Li et al., 2003). If antisense RNAs were detected between P. falciparum and C. parvum orthologs, the lncRNAs between P. falciparum and C. parvum were considered orthologous.

RT-qPCR Validation
We designed PCR primers using the PrimerQuest Tool from IDT ( h t t p s : / / w w w . i d t d n a .c o m / p a g e s / t o o l s / p r i m e r q u e s t ) (Supplementary Table S2) to use for expression validation and exon structure confirmation of select lncRNA candidates using RT-PCR and qPCR. RNA was extracted from oocysts and provided by Boris Striepen. The cDNA for each sample was reverse transcribed using the iScript ™ cDNA Synthesis Kit (Bio-Rad, Hercules, CA) from 1mg of input RNA. The resulting cDNAs were used as templates for PCR amplification and qPCR detection. Strandspecific primers were designed to amplify antisense RNAs. The 18S rRNA gene of C. parvum was used as positive control and samples without RNA or primer were used as a negative control. Each RT-PCR reaction contained 1ul cDNA, 2 ml primer mix (10 mM), 2 ml water and 5 ml MyTaq ™ HS Mix (Bioline). RT-PCR was performed in the following conditions: 35 cycles of 15 s at 95°C, 30 s at 64°C. Then the RT-PCR products were run on a 2% agarose gel. The cDNA was also subjected to qPCR with All-in-One qPCR Mix (QP001; GeneCopoeia, Rockville, MD, USA) using the Mx3005P qPCR system (Agilent Technologies, Santa Clara, CA, USA). All reactions, including no-template controls, were run in triplicate. Following amplification, the CT values were determined using fixed threshold settings. lncRNA expression was normalized to 18S rRNA expression.

Data Availability
GenBank accession records CP044415-CP044422 have been updated to include annotation of the lncRNAs identified in this study. These data have also been submitted to CryptoDB.org.

Analysis of the Cryptosporidium Stranded RNA-Seq Data Used in This Study
To identify and investigate the expression profile of lncRNAs in C. parvum during parasite development, we searched for stranded RNA-Seq data sets from Cryptosporidium available in public databases. Due to the small volume of Cryptosporidium relative to the host cells, RNA-Seq data usually suffer from high host contamination. In this study, we selected samples that had more than 100k Cryptosporidium reads generated from the Illumina platform mapped to the reference genome sequence to reduce bias mostly arising from sequencing platform and sequencing depth. In total, 38 stranded-RNA-seq data sets which originated from 33 C. parvum samples, four C. hominis samples and one C. baileyi sample were selected for further analysis. The details and mapping statistics of each sample are shown in Table 1 and (Supplementary Table S1). The C. parvum samples represented  Table 1 and Supplementary Table S1). The C. hominis RNA-Seq and one C. baileyi RNA-seq sample were obtained from oocysts. The transcriptomes of the 33 C. parvum RNA-Seq samples were compared by principal component analysis (PCA) based on the normalized mRNA gene expression level in each sample ( Figure  1B). Extracellular stages, including oocyst and sporozoites from 0 h and 2 h, are differentiated from intracellular stages, including 24 h and 48 h. The transcriptomes of intracellular stages were demonstrated to be more heterogeneous, while extracellular samples formed a relatively compact cluster. This observation was consistent with a previous transcriptome study of C. parvum oocysts and intracellular stages (Matos et al., 2019). At time points 24 h and 48 h, different host cells and laboratory procedures were used, which could contribute to the distance observed between samples from the same time point.
To further explore whether sexual commitment was initiated in all 48 h samples, we profiled the transcriptome of marker genes cgd6_2090 and cgd8_2220 (Supplementary Figure 1). cgd6_2090 encodes the Cryptosporidium oocyst wall protein-1 (COWP1) which is produced in female gametes (Tandel et al., 2019); cgd8_2220 encodes the homolog of hapless2 (HAP2) which is a class of membrane fusion protein required for gamete fusion in a range of organisms including Plasmodium falciparum (Liu et al., 2008). HAP2 labeled protein was exclusively found in male gamonts in C. parvum (Tandel et al., 2019). In another study, the C. parvum transcriptome was elucidated over a 72 h in vitro time-course infection with HCT8 cells using semi-quantitative RT-PCR (Mauzy et al., 2012). The 48 h-specific genes from that study were also examined in the 33 RNA-Seq data sets analyzed here (Supplementary Figure 2). Both the sex marker genes and 48 h-specific genes show an expression peak at 48 hr. At 48 h, expression levels from batch D samples are much higher than batch C samples. cgd8_2220 is very low/not expressed at 48 h in batch C samples. These results indicate that both batch C and D samples showed the commitment of sexual development, but commitment was more pronounced in batch D. It is possible that sequencing depth could be influencing this difference as the normalization process (VST) between samples tends to reduce the variation of genes with low read support. Interestingly, batch C and D samples used different host cell types ( Table 1). Batch D C. parvum parasites were cultured in HCT-8 (Human intestine cells) while batch C parasites were cultured in MDBK (Bovine kidney cells). Although sexual stages have been observed in MDBK cells (Villacorta et al., 1996), it is possible that the adaptation of C. parvum to MDBK cells is lower than HCT-8 cells as hosts. Thus, a slower or lower conversion rate was observed.

Identification and Characteristics of lncRNAs
We began assembly of the C. parvum transcriptome using mapped RNA-Seq reads of samples in NCBI BioProject PRJNA530692 with a high sample quality. However, C. parvum has an extremely compact genome sequence. As calculated from the C. parvum IOWA-ATCC annotation, the average intergenic distance between the stop and start codon boundaries of neighboring genes is only 504 bp and this distance must also include promoter and UTR regions. The average length of an annotated C. parvum ATCC mRNA coding sequence, CDS, is 1853 bp. The high gene density and the short distance between genes make it difficult to accurately define UTR boundaries using short-read sequencing data as transcripts overlap and become merged. Transcriptome assembly using RNA-Seq without genome and reference annotation guidance leads to a high chimerism rate. Thus, we used reference annotation to guide the assembly process and set parameters to minimize the number of artificially fused transcripts. We then used the program TACO v0.7.3 (Multisample transcriptome assembly), which employs change point detection to break apart complex loci to lower the number of fused transcripts to obtain a non-redundant master transcriptome from all samples, resulting in 5818 transcripts in total. Of these, 4846 transcripts overlapped with an mRNA on the same strand and thus, were removed. Transcripts which were <200 bp or only detected in a single sample were removed to improve the lncRNA prediction quality. Transcripts that were considered as "coding" by the Coding potential analysis tool CPC were filtered out. To identify and remove potential read-through transcripts, predicted transcripts that were located closer than 50 bp from the coding region of the upstream gene on the same strand and always transcribed together with the upstream mRNA were removed (Figure 2A).
In total, 396 transcripts were selected as lncRNA candidates for further analysis, 363 (91.7%) are antisense to mRNAs and only 33 are lincRNAs ( Figure 2B). This high percentage of antisense RNA transcripts is a noteworthy feature of the compact C. parvum genome given that CDSs cover >70% of the genome sequence. Interestingly, the nucleotide composition varies between antisense RNAs and lincRNAs with many antisense RNAs being CT-rich and most of the lincRNAs being AT-rich (Supplementary Figure 3).
Most of the lncRNAs we detected consist of a single exon however five lncRNAs contain introns. This is consistent with the low-intron rate in Cryptosporidium. Additional introns are expected to emerge with deeper RNA-Seq data. The average length of lncRNAs and annotated mRNAs is 1267 bp and 1853 bp, respectively ( Figure 2C). When compared to mRNAs, one of the most distinguishing features of lncRNAs is the low Open Reading Frame (ORF) coding potential relative to the transcript length.
We further investigated the relative location of the antisense transcripts relative to the annotated sense mRNA coding sequence (CDS), and we found that many of lncRNAs' initiation and termination sites are located close to the sense CDS boundaries. The start site of the antisense lncRNA transcript is highly correlated with the end of the sense strand CDS ( Figure 2D). When looking at the coverage of the antisense transcript along the mRNA, the C. parvum lncRNA antisense expression has a bias towards the 3′ end of the mRNA transcript ( Figure 2E). This property has also been reported in other organisms, including the malaria pathogen Plasmodium falciparum (Lopez- Barragan et al., 2011;Siegel et al., 2014;Broadbent et al., 2015).
To better understand their transcription, we searched for potential promoter motifs within 100 bp of the (+) strand upstream from all 396 lncRNAs. This analysis returned five significant motifs (E-value <0.001) for 190 lncRNAs (Supplementary Figure 4 and Supplementary Table S3). The top two motifs are known Cryptosporidium transcriptional factor binding motifs E2F/DP1 (5′-[C/G]GCGC[G/C]-3′) and ApiAP2_1 (5′-BGCATGCAH-3′) motifs (Oberstaller et al., 2013), both of which are palindromic ( Figure 2F). Of the 190 lncRNAs with significant upstream motifs, 142 lncRNAs had only a single type of motif (93 with E2F/DP1 and 19 with ApiAP2_1). 70 (75.3%) of the 93 lncRNAs with upstream E2F/ DP1 motifs had only one motif, 21(22.6%) had two occurrences. Of the 19 lncRNAs with upstream ApiAP2_1 motifs, 12 (63.2%) had only one occurrence of the motif and 6 (31.6%) had two. Cooccurrence of the E2F/DP1 and ApiAP2_1 motif was only observed upstream of 17 lncRNAs. The spacing of the motifs relative to the transcription start site was assessed and no clear pattern was observed (Supplemental Figure 5). These combined results suggest that lncRNA transcripts have the potential for being regulated independently during parasite development.

lncRNAs Are Developmentally Regulated
We visualized gene expression profiles across the 33 RNA-Seq samples used in this study (Table 1) for both mRNA ( Figure 3A) and lncRNAs ( Figure 3B). To identify genes with a similar expression profile, we used a k-means algorithm to cluster mRNAs and lncRNAs separately (Supplementary Table S4 and Supplementary Table S5). The k value was selected as the smallest value that allowed the separation of genes from different time points while keeping genes from different samples of the same timepoint together despite the batch effects present within each time point. As a result, mRNAs and lncRNAs were clustered into seven and nine broad co-expression groups, respectively. Select highly expressed lncRNAs from clusters 2, 4 and 5 are visualized as examples (Supplementary Figure 6).
The expression of mRNAs in the extracellular stages (oocysts and sporozoites from 0h and 2h) showed a similar expression trend that is quite distinct from the latter two stages. For mRNAs, genes from cluster 1 and cluster 2 were more highly expressed in the extracellular stages but still show expression in the intracellular stages when the vast majority of mRNAs are active. This result is consistent with another transcriptome study of C. parvum, which used semi-quantitative RT-PCR over a 72 h time course during in vitro development (Mauzy et al., 2012). On the contrary, many lncRNAs showed enriched expression in extracellular stages (oocyst, 0 h, 2 h) and some had expression later at the intracellular sexual development stage (48 h) ( Supplementary Table S6) while the asexual stage (24 h) showed the least lncRNA expression. When considering the correlation of antisense transcription to parasite development (according to the peak expression time point), 162 antisense transcripts were classified as extracellular stage enriched (oocyst, 0 h and sporozoite 2 h), 50 as merozoite enriched (24 h) and 151 as sexual stage enriched (48 h). It is noteworthy that both mRNAs and lncRNAs have gene sets that are specifically turned on at 48 h (mRNA cluster 5 and lncRNA cluster 4). The average expression level of lncRNAs suggests they are more abundant or were actively expressed in oocysts, 0 h, 2 h and gamont formation 48 h stages than the 24 h merozoite stage (p value <.0001, t-test) ( Figure 3C). As was observed in the PCA to assess batch effect ( Figure 1B), there is increased variation at the 48 h time point. Compared to mRNAs which show more upregulation when transitioning from oocyst/sporozoite to the asexual stage (1715 genes vs 1270 genes), lncRNAs have many more genes downregulated (218 vs 80) ( Figure 3D). Comparing the asexual stage at 24 h to the sexually activated stage at 48 h, fewer mRNAs showed differential expression, with both having 400 genes upregulated and downregulated. Very few lncRNAs were downregulated in the transition from the asexual to sexually activated stage but 85 lncRNAs were upregulated. The 85 upregulated lncRNAs did not significantly overlap with the lncRNAs that were downregulated between the extracellular to the asexual stage. Here we only used 48 h samples from batch D to analyze differential expression since this batch has clear sexual stage marker gene expression, as discussed above. The developmentally regulated expression pattern of lncRNAs is indicative of their importance in extracellular and sexual stages. It is important to note that overall, the levels of lncRNA expression are lower than the expression levels observed for mRNAs ( Figure 3).

Correlation of lncRNA Expression With Sense and Neighboring mRNA Expression
LncRNA mediated gene regulation can be achieved by various mechanisms (Li et al., 2020). One mechanism is transcriptional interference that usually results in repression of the target gene. LncRNAs can also activate target gene expression through epigenetic mechanisms. Additionally, translational regulation by lncRNA has also been reported. Therefore, to understand the potential roles of lncRNA transcription or transcripts, we studied the correlation between lncRNA and neighboring gene mRNA expression in C. parvum using the 33 RNA-Seq samples (Supplementary Table S5). First, to estimate the background transcriptional correlation between genes, we calculated the expression correlation between 10,000 random gene pairs. 5748 pairs had a positive Pearson's correlation with an average r value of 0.56 while 4250 pairs had a negative Pearson's correlation with an average r value of −0.51. Thus, we consider r > 0.56 as an indicator of positive correlation and r >0.70 as significant correlation and r < −0.51 as an indicator of negative correlation with r < −0.7 as significant negative correlation and r values between −0.51 and 0.56 as indicative of no correlation.
Next, transcriptional correlations between the sense mRNA gene and its antisense as well as between mRNAs located upstream and downstream (either DNA strand) of the antisense transcript were analyzed ( Figure 4A). Most senseantisense transcript pairs do not show a strong expression correlation but those that do, tend to have a positive correlation. We note that the levels of positive correlation vary by developmental stage ( Figure 4B). For the antisense transcripts that have a negative correlation with the sense mRNA, we note that the antisense transcripts tend to have their expression peak in extracellular stages ( Figure 4B). To explore the possibility of a repressive role for these negatively correlated anti-sense transcripts in oocysts, we examined GO enrichment for the corresponding sense mRNAs. No significant GO functions were enriched because most sense mRNAs are uncharacterized protein-encoding genes. However, one sense-antisense pair is noteworthy. Cp_lnc_7-RA is highly expressed in oocyst/ sporozoites and negatively correlated with its sense mRNA CPATCC_0034040 (Guanine nucleotide exchange factor/ Initiation factor 2 subunit family) with Pearson's r of −0.90. This protein facilitates and controls protein synthesis in eukaryotes by mediating guanine nucleotide exchange (Gordiyenko et al., 2014). We note that CPATCC_0034040 is silenced in oocyst/sporozoites and becomes active in later stages. Further experiments are needed to determine if Cp_lnc_7-RA plays any role in this regulation. Compared to expression of random gene pairs, antisense transcripts and their upstream mRNAs have a higher positive correlation of expression level on average ( Figure 4A), despite the fact that potential read-through transcripts have already been removed (309 out of 363 antisense genes are arranged in a headto-head orientation with the corresponding upstream mRNA). 110 antisense genes have high positive expression correlation with their upstream mRNAs (r > 0.7) while only 69 and 89 correlated with the downstream and sense mRNAs, respectively. LincRNAs also show a higher positive expression correlation with upstream mRNAs ( Figure 4C). One possible explanation for the higher positive transcription correlation between lncRNAs and upstream mRNAs in C. parvum is the potential existence of bidirectional promoters. Bidirectional promoters have been reported in many organisms especially species with compact genome sequences. In Giardia lamblia, bidirectional transcription is considered to be an inherent feature of promoters and contributes to an abundance of antisense transcripts (Teodorovic et al., 2007).
When antisense expression is compared to downstream mRNA expression, more negative correlations are observed relative to antisense-sense mRNA and antisense-upstream mRNA pairs. This may suggest the presence of transcriptional interference such as transcriptional machinery collision between antisense and its downstream mRNA (Hobson et al., 2012;Kindgren et al., 2018) as one possible explanation.
Finally, we note that a large proportion of antisense transcripts (102 out of 363) and their corresponding sense, upstream and downstream neighboring mRNAs do not show significant expression correlations, i.e. their r values are between −0.51 and 0.56 ( Figure 4A; Supplementary Table  S7). Seven of these uncorrelated transcripts have been reported to be delivered into the host cell nucleus (Wang et al., 2017b) (Supplementary Table S7). Another reason for a lack of correlation may stem from possible roles in posttranscriptional and translational processes rather than transcriptional regulation.

Many lncRNAs Are Conserved Between
Cryptosporidium parvum, Cryptosporidium hominis, and Cryptosporidium baileyi Evolutionary conservation of a lncRNA can imply functional importance. lncRNAs can be conserved in different dimensions: the sequence, structure, function, and expression from syntenic loci (Diederichs, 2014). lncRNAs are considered to be poorly conserved at the primary sequence level between genera as reported in many higher eukaryotes. Here, we looked for expression conservation of C. parvum lncRNA in two other Cryptosporidium species from syntenic loci (observed conserved antisense expression from orthologous genes) with available stranded RNA-Seq data that include C. hominis, a very close relative of C. parvum and C. baileyi, a distant relative that infects birds (Slapeta, 2013). First, we assembled the oocyst/sporozoite transcriptome (the only stranded samples that exist) of C. hominis 30976 and C. baileyi TAMU-09Q1 by the same methods as used previously except we did not use reference genome annotation guidance (-G). A total of 167 C. parvum antisense lncRNAs were detected in both C. hominis and C. baileyi (Supplementary Table S8). Of these, 10 are putatively conserved in P. falciparum (Supplementary Table S9) based on the presence of antisense lncRNA expression of the orthologus sense gene in P. falciparum (Broadbent et al., 2015). The sense mRNAs were related to functions that included a methyltransferase protein, a palmitoyltransferase and a copper transporter. No significant sequence similarities were detected among the antisense lncRNAs of these orthologs in C. parvum and P. falciparum. This is not surprising given the evolutionary distance and AT bias of P. falciparum.
Since the samples of C. hominis 30976 and C. baileyi TAMU-09Q1 were from oocysts/sporozoites, we focused on 48 of the 167 conserved C. parvum lncRNAs that showed a higher expression level in the extracellular stages ( Figure 5). The corresponding sense mRNAs were involved in various biological processes. Translation related functions, including translational initiation (cgd7_2430, translational initiation factor eIF-5) and protein folding (cgd2_1800, DnaJ domain-containing protein), were seen in the sense mRNAs. A positive correlation of 0.78 and a negative correlation of −0.78 was calculated for cgd7_2430 senseantisense pair and cgd2_1800 sense-antisense pair, respectively. In addition, a few mRNAs that encode putative secreted proteins (cgd5_10 and cgd4_3550) also showed a high positive correlation of expression with the corresponding antisense.

lncRNA Prediction Validation
In the RNA-Seq data, Cp_lnc_51 was expressed in oocysts/ sporozoites while the associated sense mRNA cgd1_380 (Ubiquinone biosynthesis protein COQ4) was seen to be silenced the RNA-Seq data. The expression levels for the sense and antisense transcripts were validated by stranded RT-qPCR ( Figure 6A). To validate lncRNA by strand-specific RT-PCR (Ho et al., 2010), a specific RT primer was designed for each gene to generate the appropriate stranded cDNA (Supplementary Table S2). Cp_lnc_51 contains an intron as does its sense mRNA. Splicing of Cp_lnc_51 was confirmed by RT-PCR and agarose gel electrophoresis to assess the transcript size ( Figure 6B), however, the splicing is observed in all RT-PCR products. The existence of splicing was also observed in the sense transcript ( Figure 6B). We randomly selected five additional lncRNAs that were expressed in RNA-Seq data for oocyst/sporozoites for validation. Two out of the five, Cp_lnc_82 ( Figure 6C) and Cp_lnc_93 ( Figure 6D) were validated by qPCR. The relative expression of sense-antisense is consistent with the RNA-Seq data.

DISCUSSION
In this study, we utilized stranded RNA-Seq data from multiple time points during parasite development to systematically identify and characterize lncRNAs in C. parvum. 396 highconfidence lncRNAs were identified, 363 (91.7%) occur as antisense transcripts to mRNAs and 33 are encoded in intergenic locations. Nearly 10% of predicted mRNAs are covered by an antisense lncRNA. This level of antisense transcription suggests an important function for lncRNA in C. parvum. The lncRNAs were analyzed to determine expression profiles, promoter motifs for coordinately expressed transcripts, transcriptional relationships with upstream and downstream mRNAs and conservation among three Cryptosporidium species with stranded RNA-Seq data available.
To investigate the expression relationship of lncRNAs and their neighboring mRNA encoding genes, we calculated the expression correlation of different types of gene pairs by Pearson coefficient and noticed a higher positive correlation of expression between lncRNAs and their upstream mRNAs (primarily from the opposite strand) compared to random gene pairs. Notably, spurious correlations of gene expression can happen if the biological variation among samples is too large. Due to the challenge of in vitro culture for C. parvum and very low number of the parasite transcripts compared to their host cells, samples from the early intracellular stages are rare and usually contain very low levels of parasite transcripts. In this study, transcriptome data from early (< 24 h) intracellular timepoints are absent. We detected a bimodal shape for the distribution of expression correlation with random gene pairs showing trends at both high positive and negative values, probably due to spurious correlation. However, antisense transcripts showed much less negative correlation of expression with upstream than random gene pairs. One possible explanation for the higher positive correlation between lncRNA and their neighboring mRNAs would be the presence of bidirectional promoters in C. parvum. Another possibility is that lncRNAs function as positive regulators of neighboring mRNA expression. In P. falciparum, ncRNAs derived from GC-rich elements that are interspersed among the internal chromosomal var gene clusters are hypothesized to play a role in var gene activation while the mechanism is unclear (Guizetti et al., 2016;Barcons-Simon et al., 2020). lncRNAs have also been associated with chromatin remodeling to achieve transcriptional regulation in many studies (Li et al., 2020). One example is a lncRNA named HOTTIP which is transcribed from the 5′ tip of the HOXA locus that coordinates the activation of HOXA genes by maintaining active chromatin (Wang et al., 2011).
Antisense transcripts have a strong bias towards covering the 3′ end of the sense mRNA. This property has also been reported in other organisms, including the malaria pathogen P. falciparum Broadbent et al., 2015). As these authors suggest, one possibility is that antisense RNAs arise from promiscuous transcription initiation from nucleosome depleted regions . It is also known that the 3′ UTR of mRNAs can contain elements that are important for transcript cleavage, stability, translation and mRNA localization. The 3′ UTR serves as a binding site for numerous regulatory elements including proteins and microRNAs (Jia et al., 2013;Tushev et al., 2018). In humans, the antisense KAT5 gene has been reported to promoted the usage of distal polyA (pA) site in the sense gene RNASEH2C, which generated a longer 3′ untranslated region (3′ UTR) and produced less protein, accompanied by slowed cell growth (Shen et al., 2018). Whether the 3′ end bias of antisense expression related to its function and translation repressor need further investigation.
The enriched expression of lncRNAs in extracellular and late intracellular stages, the time point when the parasite expresses sexual commitment and produces gametes, suggests potential critical roles that lncRNAs may play during these life cycle stages. We specifically analyzed the transcription correlation between sense-antisense pairs in extracellular stages and late intracellular stages. Both significant positive and negative correlations between mRNA sense-antisense transcript pairs are observed in C. parvum which indicates that diverse RNA regulation strategies exist. Antisense transcripts/transcription as a negative regulator of sense mRNAs has been reported in many organisms including P. falciparum (Filarsky et al., 2018), which could also hold true in C. parvum for the negatively correlated sense-antisense pairs. It has been also reported that lncRNAs can regulate translation by stabilizing mRNAs, and in some cases triggering the translation process by interactions with the associated machinery (Li et al., 2020). For transcriptionally positively correlated sense-antisense transcript pairs, we raise the testable hypothesis that lncRNAs may be involved in stabilizing transcripts that may be preloaded in macrogamonts that will eventually become oocysts. To test this hypothesis, one future direction is to take advantage of single-cell sequencing approaches and look at the transcriptomic details of male and female gametocytes. It will be interesting to see if lncRNAs are specifically expressed in male and female gametocytes and whether or not some lncRNAs are restricted specifically to these gamonts or if they are also detected elsewhere, e.g., in oocysts where they could, perhaps, have a role in transcriptional or post-transcriptional gene regulation, or mRNA stability. The amount of active transcription as opposed to RNA pre-loading in oocysts is not known.
Twenty-two C. parvum lncRNAs have been detected in the host cell nucleus (Wang et al., 2017a). Of these, 18 were detected in this study. Motif analysis was conducted on the exported lncRNA transcript sequences, but no significant similarity or motif was detected relative to the larger pool of lncRNA candidates identified in this study. This raises the question of what signal is responsible for lncRNA export. Further studies are needed. A significant roadblock in lncRNA research is the determination of their function. Genetics studies are particularly tricky with antisense transcripts of the sequences overlap coding sequences closely, as they do in C. parvum because genetic alterations of the sequence affect the sense and antisense transcripts. lncRNAs with similar functions often lack sequence similarity (Kirk et al., 2018). Many known lncRNAs function by interacting with proteins. Proteins often bind RNA through short motifs (three to eight bp in length) (Ray et al., 2013). It was hypothesized that lncRNAs with shared functions should harbor motif composition similarities (Kirk et al., 2018). In this study, the difference in nucleotide composition between antisense RNA and lincRNA transcripts gives rise to the speculation that the machineries interacting with antisense and intergenic lncRNAs may be different.
lncRNA prediction using short reads in organisms with compact genome sequences, including C. parvum is limited due to the difficulty of separating independent lncRNA transcription from neighboring transcriptional read-through noise. In this study, we used a customized pipeline with strict criteria designed to minimize false positives from background noise such as transcriptional read-through. Many antisense transcriptions A B D C FIGURE 6 | lncRNA candidate expression and intron structure validation. (A) The expression levels of lncRNA Cp_lnc_51 and its corresponding sense mRNA CPATCC_0039020 were validated by RT-qPCR of RNA from oocyst/sporozoite the expression was normalized to 18S expression levels. The annotated gene model is shown on the top with RNA-Seq reads mapped to the genomic region. Location of RT primers and PCR primers for each gene are shown with the gene models. Reads are separated by the mapped strand: forward strand (F) and reverse strand (R). The intron structure of Cp_lnc_51 is indicated by the presence of split RNA-Seq reads. (B) Agarose gel electrophoresis of RT-PCR products for the sense and antisense transcripts. Splicing of the Cp_lnc_51 transcript and its sense mRNA CPATCC_0039020 are supported by detection of both spliced and unspliced forms for each of the transcripts of the expected sizes. The expected size of the products with/without their intron is indicated next to the gene name in panel A. 18S is used as positive control with expected size of 239bp. Control 1 is a negative control without the RT primer for CPATCC_0039020. Control 2 is a negative control for CPATCC_0039020 with both RT and PCR primers but no RNA template added. (C) The expression levels of lncRNA Cp_lnc_82 and its corresponding sense mRNA CPATCC_0030480 were validated by RT-qPCR. The expression was normalized to 18S expression levels. (D) The expression levels of lncRNA Cp_lnc_93 and its corresponding sense mRNA CPATCC_0030780 validated by RT-qPCR. The expression was normalized to 18S expression levels. The RNA-Seq coverage in C and D is with range 0-100 CPM (counts per million reads mapped). overlap all or most of the sense mRNA transcript. To further improve the discovery of full-length lncRNA and any isoforms, long-read approaches such as Iso-Seq (Pacific Biosystems) and single molecule pore-sequencing approaches [Oxford Nanopore Technologies (ONT)] are needed. Although obtaining sufficient high-quality RNA from intracellular stages is still challenging, hybrid capture approaches (Gnirke et al., 2009;Amorim-Vaz et al., 2015) can be utilized to obtain Cryptosporidium RNA to be used for direct RNA sequencing on the ONT platform providing additional insights into the RNA biology of Cryptosporidium. Long-read sequencing also enables better annotation of mRNA UTR boundaries (Chappell et al., 2020) which can be used to further investigate the 3′ bias (relative to the sense transcript) of antisense transcription observed in this study.
The RNA-Seq samples from intracellular stages used in this study suffered from host contamination though we set thresholds for sample selection. For extracellular samples (sample ID 1-19 for C. parvum), mapping rate ranges from 76%-97% with the majority >95%. The low mapping rate dropped below 4% in intracellular. Low numbers of parasite-specific reads can lead to an underestimated number of lncRNAs, with lowly expressed lncRNAs possibly being undetectable. On the flip side, a lower false positive prediction rate is expected since the goal is to primarily characterize high-quality lncRNAs with good expression level by setting thresholds to filter out low-quality candidates.
Another limitation of this study results from the fact that the lncRNAs discussed here are exclusively generated from conventional poly(A)-enriched libraries. Thus, only polyadenylated transcripts will be identified. While many lncRNAs in the literature share characteristics of mRNAs, such as RNA polymerase II-mediated transcription, a 5′ 7-methylguanosine cap and a 3′ polyadenylated tail (Sun et al., 2018), many nonpolyadenylated lncRNAs have been reported including enhancerderived lncRNAs  and RNA Polymerase III transcribed lncRNAs (Sun et al., 2018). Thus, determination of the full repertoire of lncRNAs requires an analysis of total RNA. The contributions of polyA+ and polyA-lncRNAs to the complete repertoire of lncRNA remains unclear including in model organism species. This is in part due to the instability of polyA-RNAs  and the caveats of Illumina short read sequencing coupled with transcriptome assembly algorithms that predict transcripts (small RNAs excluded) which suffer from low 3′ and 5′ coverage (Uszczynska-Ratajczak et al., 2018). In C. parvum, the non-poly(A) selected library suffered from high levels of rRNA contamination (data not shown) since no commercial kits specific for apicomplexan or other protist parasite rRNA removal are available. In an rRNA depleted non-poly(A)+ selected total RNA library from C. parvum oocysts/sporozoites generated by our lab (data not shown), we were able to detect 134 (82.7%) out of the 162 antisense transcripts with peak expression in extracellular stages found in this study suggesting that we have not missed a huge population of non-poly(A) transcripts by focusing on the available poly(A)+ libraries (Li et al., in prep). Only~20 new potential nonpolyadenylated lncRNAs were detected in the rRNA depleted library. This finding suggests the possibility that lncRNAs in C. parvum are primarily polyadenylated.
It is important to understand how species evolve and adapt to their environment. Due to the poor conservation of lncRNA reported in higher eukaryotes (Johnsson et al., 2014) and the large phylogenetic distance among Cryptosporidium species (Slapeta, 2013), it is noteworthy that many lncRNAs detected in C. parvum were also seen expressed in C. baileyi. It indicates that RNA regulation could be a common and critical strategy for Cryptosporidium gene regulation or interactions with their hosts. The conserved antisense RNA expression between C. parvum and P. falciparum orthologs did not share detectable sequence similarity. However, the structure of lncRNAs could be under stronger selection than their sequence. Since most lncRNAs are antisense in C. parvum, to separate conservation of lncRNA from the conservation of mRNA sequence could provide further insights into lncRNA evolution. Selection pressures that independently act to maintain sequence and secondary structure features can lead to incongruent conservation of sequence and structure. As a consequence, it is possible that analogous base pairs no longer correspond to homologous sequence positions. Thus, possible selection pressures independently acting on sequence and structure should be taken into account . Despite the increasing acknowledgment that ncRNAs are functional, tests for ncRNAs under either positive or negative selection did not exist until recently (Walter . This type of analysis will assist in identifying candidates to prioritize for further functional lncRNA investigations.

DATA AVAILABILITY STATEMENT
The data sets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ Supplementary Material.

AUTHOR CONTRIBUTIONS
YL, RB, and JK conceptualized the work. BS and AS provided the Cryptosporidium parvum RNA and access to stranded libraries pre-publication. YL and JK wrote the manuscript, and all authors edited the manuscript. All authors contributed to the article and approved the submitted version.

FUNDING
This work and publication fees were funded by NIH NIAID R21AI144779-01A1 to JK.