Improved Reference Genome Annotation of Brassica rapa by Pacific Biosciences RNA Sequencing

The species Brassica rapa includes several important vegetable crops. The draft reference genome of B. rapa ssp. pekinensis was completed in 2011, and it has since been updated twice. The pangenome with structural variations of 18 B. rapa accessions was published in 2021. Although extensive genomic analysis has been conducted on B. rapa, a comprehensive genome annotation including gene structure, alternative splicing (AS) events, and non-coding genes is still lacking. Therefore, we used the Pacific Biosciences (PacBio) single-molecular long-read technology to improve gene models and produced the annotated genome version 3.5. In total, we obtained 753,041 full-length non-chimeric (FLNC) reads and collapsed these into 92,810 non-redundant consensus isoforms, capturing 48% of the genes annotated in the B. rapa reference genome annotation v3.1. Based on the isoform data, we identified 830 novel protein-coding genes that were missed in previous genome annotations, defined the untranslated regions (UTRs) of 20,340 annotated genes and corrected 886 wrongly spliced genes. We also identified 28,564 AS events and 1,480 long non-coding RNAs (lncRNAs). We produced a relatively complete and high-quality reference transcriptome for B. rapa that can facilitate further functional genomic research.


INTRODUCTION
Brassica rapa is a member of the genus Brassica, a plant genus, widely used as vegetables (such as turnip, Chinese cabbage, pak choi, caixin), fodders and oilseed crops (e.g., turnip rape and sarson). As one of the diploid species in the genus Brassica, B. rapa was the first species for which complete genome sequencing was performed in 2011 (Wang et al., 2011). The reference genome was updated twice in 2017 and 2018 by using higher depth long reads generated from third-generation sequencing (TGS) and High-through chromosome conformation capture (Hi-C) technology, respectively Zhang et al., 2018). In addition to the reference genome based on Chinese cabbage accession Chiifu-401-42, a number of reference genomes of yellow sarson (Belser et al., 2018) and pak choi (Li et al., 2020;Li et al., 2021) as well as a graphic genome consisting of 18 B. rapa accessions  have been published in recent years, providing valuable resources for genetic research on B. rapa crops. In contrast to the continuous improvement of the reference genomes of B. rapa, annotations for the structural integrity of genes such as the untranslated regions (UTRs) of genes, information concerning alternative splicing (AS), alternative polyadenylation (APA), and long non-coding RNAs (lncRNAs) at the transcriptome level are still relatively scarce.
The eukaryotic transcriptome is highly complex. As two important post-transcriptional regulation mechanisms, AS and APA make significant contributions to the enrichment of the transcriptional diversity. More than 98% of human multi-exon genes occur with AS (Wang et al., 2008), and these genes tend to express many splicing isoforms simultaneously (Djebali et al., 2012). In Arabidopsis, the percentage of AS in genes is as high as 61% (Zhang et al., 2010;Marquez et al., 2012). APA can also generate transcript variants with different 3 ends (Shen et al., 2011;Elkon et al., 2013). More than 70% of genes in Arabidopsis and 48% of expressed genes in rice contain multiple polyadenylation sites Elkon et al., 2013).
Long non-coding RNAs are longer than 200 nucleotides (nt) in length, in contrast to small RNAs. LncRNAs are classified into three major types, the long intergenic non-coding RNAs (lincRNAs), intronic non-coding RNAs (incRNAs) derived from introns, and natural antisense transcripts (NATs) transcribed from the complementary DNA strand of their associated genes (Chekanova, 2015). LncRNAs have been shown to play critical regulatory roles in most eukaryotes. At present, most lncRNAs are studied in the context of protein-coding gene regulation and as a result can be functionally or mechanistically connected to mRNA expression (Wierzbicki et al., 2021). For B. rapa, 2,237 lncRNAs were identified from expressed sequence tag (EST) sequences (Paul et al., 2016), while 2,088 lncRNAs were identified based on the short-read transcriptome sequencing (Shea et al., 2019). Since StringTie assembly of transcripts was involved in these studies, mistakes could inevitably be included in the annotation of lncRNAs.
Isoform sequencing (Iso-Seq) by long reads is a revolutionary technology for plant transcriptome research. At present, single molecule real-time (SMRT) sequencing by Pacific Biosciences (PacBio) and nanopore sequencing by Oxford Nanopore Technologies Inc., (ONT) provide full-length transcripts up to 15∼25 kb for PacBio and >30 kb for ONT from the 5 cap to the polyA tail, thereby avoiding read reconstruction from local information (Oikonomopoulos et al., 2020). The method can retrieve most of the expressed transcripts as full-length sequences, alternative isoforms, and duplicated genes, therefore providing more accurate evidence to guide delimitation of UTRs, introns, exons, and AS junctions through genomic alignment. The continuous sequences guarantee better accuracy of gene annotation compared to ESTs (Adams et al., 1992), RNA-Seq (Grabherr et al., 2011), and homology inference (Jarvis et al., 2017). Because of these advantages, PacBio sequencing has been used to investigate full-length transcripts and to construct reference gene sets in plant species including rice (Zhang et al., 2019), maize (Li et al., 2014b), moso bamboo , and Brassica napus (Yao et al., 2020).
The present study analyzed the full-length transcriptome of the Chinese cabbage accession Chiifu-401-42, the strain used for previous B. rapa reference genome assembly. The genome was derived from five different tissues using PacBio SMRT sequencing. The updated genome annotation provides a reference gene set for future functional genomic studies. Furthermore, improved gene annotations could contribute to better understanding of the complexity of the B. rapa genome and serve as a reference sequence for gene functional studies.

Plant Materials and Growth Conditions
Chinese cabbage accession Chiifu-401-42 (referred as Chiifu), which had been used for B. rapa reference genome assembly, was used in this study. Five different tissues including roots, leaves (NV and V), flowers, petioles, and seeds (Supplementary Table 1) were used for Iso-Seq library construction. Leaves, petioles and roots were collected from plants grown for 21 days in a greenhouse in Haidian District, Beijing. Then, the 21-days seedlings were transferred to a growth chamber for vernalization (V) (12 h light/12 h dark, 4 • C day/night) of 7 days (V1), 14 days (V2), 21days (V3), 28 days (V4), 35 days (V5), and 50 days (V7). Leaves, petioles and roots were collected from vernalized plants. In addition, plants at the stage of heading formation were sampled for leaves and petioles as well as roots. Inflorscences were collected from blooming plants. Seeds were germinated before they were used for RNA isolation.
For RNA-seq library construction, germinated seeds were sown in pots filled with 2:1 mixture of peat and vermiculite and grew in a growth chamber for 28 days under normal condition (12 h light/12 h dark, 21 • C day/18 • C night temperature). Then the plants were vernalized for 7 (V1), 14 (V2), 21 (V3), and 28 (V4) days under 4 • C. Leaves were collected from plants grown under non-vernalized (NV) and vernalization conditions (V).

Isoform Sequencing Library Preparation and Sequencing
The libraries were prepared according to the Iso-Seq protocol using the Clontech SMARTer PCR cDNA Synthesis Kit and the BluePippin Size Selection System protocol as described by PacBio (PN 100-092-800-03). Two libraries, Library I and Library II (Supplementary Table 1), were each constructed with random-size cDNA fragments >4 kb in length. The libraries were subsequently sequenced on the PacBio RS II platform using P6C4 polymerase enzyme.

RNA-Sequencing Library Preparation and Sequencing
Total RNA isolated from leaves was sent to Novogene 1 for transcriptome sequencing. RNA-seq libraries were constructed by a NEBNext R Ultra TM RNA Library Prep Kit (NEB, Beverly, MA, United States). Five qualified libraries (NV, V1, V2, V3, and V4) were sequenced on the Illumina NovaSeq platform and 150 bp paired-end reads were generated.

Data Collected From Public Databases
The reference genomic sequences of B. rapa v3.0 (Zhang et al., 2018), genome annotation (V3.1), 2 as well as genomic sequences and protein sequences of three related species, A. thaliana (TAIR10), B. oleracea (JZS_v2.0), and B. nigra (NI100_v2.0), were downloaded from BRAD 3 . In addition, a total of 31G published Illumina-based RNA-seq datasets of Chiifu were collected form NCBI, 4 including data generated from a variety of tissues (Supplementary Table 2).

Pacific Biosciences Sequencing Data Processing and Error Correction
The PacBio-seq raw reads were filtered into circular consensus sequence (CCS) subreads using SMRTLink. 5 After quality evaluation, CCS subreads were processed to generate the ROIs with a minimum full pass of 1 and minimum accuracy of 95. The pipeline then classified the ROIs in terms of fulllength non-chimeric (FLNC) and non-full-length reads. This was done by identifying the 5 and 3 adapters used in the library preparation as well as the poly (A) tail. Only reads that contained all three in the expected arrangement and did not contain any additional copies of the adapter sequence within the DNA fragment were classified as FLNC reads. Then, the FLNC reads were corrected using LoRDEC (version 0.9) (Salmela and Rivals, 2014) and an RNA-seq dataset consisting of 171,308,229 pairs of reads form six different tissues (callus, root, stem, leaf, flower and silique) (Tong et al., 2013).

Alignment of Pacific Biosciences Reads to the Brassica rapa Reference Genome and Collapse
All FLNC reads were mapped to the B. rapa reference genome using minimap2 (version 2.15) (Li, 2018) software with the options: -ax splice, -uf, -C5, -O6, and 24 -B4. Only the best alignments (the largest alignment with the lowest number of mismatches against the reference genome) were kept. Then the retained FLNC reads were collapsed into full-length nonredundant consensus isoforms using a Python script 6 in the Cupcake program (version 19.0.0) (:-supporting-scripts-for-Iso-Seq-after-clustering-step#collapse) 7 with parameters of coverage >0.90 and identity >0.90.

Expression Level Estimation
We used fastp (Chen et al., 2018) software with default options of quality control of the RNA-seq data. The highquality clean reads were aligned to the B. rapa reference genome using HISAT2 (version 2.1.0) (Pertea et al., 2016).
First, we built a HISAT2 index and aligned the reads for each sample to the reference genome with the default options. Then, we sorted and converted the SAM files to BAM files using SAMtools (version 1.6.3) (Li et al., 2009). Finally, the quantification of isoforms and genes was performed using StringTie (version 2.0.4) (Pertea et al., 2015) based on the GTF/GFF3 annotation file generated by PacBio sequencing, a gene annotation file, and a BAM file containing aligned reads information. The expression levels were measured and normalized as transcripts per kilobase of exon model per million mapped reads (TPM).

Novel Gene Annotation
The isoforms that had no overlap with any annotated regions were defined as candidate novel transcripts. Then, their coding ability was predicted using TransDecoder (version 5.5.0). We selected the longest read as the representative sequence for each isoform to search the following databases. For the tblastx analysis, we built a database consisting of all plant cDNAs of proteincoding genes from Ensembl Plants, release 51 (last accessed on 20210505). 8 For blastx, we used the Swiss-Prot database, the current release (last accessed on 16 Jun 2021) 9 containing 565,254 protein sequences. Finally, we ran tblastx and blastx with the criterion of e-value ≤ 10e-6.

Identification of Long Non-coding RNAs From Pacific Biosciences Data
Putative protein-coding RNAs were filtered using the minlength of sequence and min-length of the ORF thresholds. Non-protein-coding RNA transcripts longer than 200 bp and ORFs with length shorter than 100 bp were sorted as lncRNA candidates using four computational approaches, namely CPAT  (Coding Potential Assessment Tool), CPC (Kong et al., 2007) (Coding Potential Calculator), PLEK (Li et al., 2014a) (predictor of lncRNAs and messenger RNAs based on an improved k-mer scheme), and LGC (Wang et al., 2019a), all of which could distinguish proteincoding and non-coding transcripts. Only reads having consistent results from at least two software programs were considered as lncRNA candidates. To further verify the credibility of lncRNA, the remaining reads were searched against the Swiss-Prot database using blastx (E-value, 10e −6 ).

Analysis of Differential Expression of LncRNAs
The HISAT2 soft was used to align RNA-seq data against the B. rapa reference genome. The output SAM file was then sorted and converted to a BAM file using SAMtools. Next, the htseqcount script of HTSeq (Anders et al., 2015) (version 0.9.1) was used to calculate the number of reads mapping to each lncRNA, with the opitions -f bam and -s no, and the output count file was used for downstream analysis. The DESeq2 package (Love et al., 2014) (version 1.34.0) was used to identify the DELs between NV and V samples with a P-value < 0.05 and |fold change| > 2.

Validation by Reverse Transcription-Polymerase Chain Reaction
Total RNA was extracted using TRIzol reagent (TransGen) as described in the user's manual. For cDNA synthesis, 2 µg of total RNA was treated with gDNA Remove (TransGen) and then used to synthesize the first-strand cDNA using TransScript One-Step gDNA Removal and cDNA Synthesis SuperMix (TransGen). For Reverse Transcription-Polymerase Chain Reaction (RT-PCR) validation of AS events, split genes and the non-coding exons of BrSOC1.1 (BraA05g005300.3.1C) and BrSOC1.2 (BraA04g032590.3.1C), 3 µl of the cDNA was used in a reaction volume of 50 µl using 2× Rapid Taq Master Mix (Vazyme). The specific primers were designed by the Primer-BLAST Tools of NCBI 10 PCR conditions were 3 min at 95 • C followed by 40 cycles at 95 • C for 15 s, 55 • C for 15 s and 72 • C for 15 s, and final extension at 72 • C for 5 min. The PCR amplification was visualized on 2% agarose gel. The primers for the validation are listed in Supplementary Table 3.

Data Availability Statement
Brassica rapa annotated genome V3.5 is publicly available in the Brassicaceae Database (BRAD V3.0), 11 including genome file in fasta format, annotation file in GFF3 format and Gene Ontology file. All the raw Pac-Bio sequencing data have been deposited in the NCBI SRA database under the BioProject number PRJNA783071 (SRX13213456-SRX13213457). And the short read transcriptome data have also been deposited in the NCBI SRA database under the BioProject number PRJNA784003 (SRX13239619-SRX13239633). Codes to analyze the full-length transcript data are available at GitHub. 12 All softwares are available online and other analysis script are available upon request.
In this study, we combined equal amounts of total RNA from five different tissues to build two cDNA libraries of random sized fractionated (see section "Materials and Methods") for acquiring as many full-length transcripts as possible by single-molecule long-read sequencing. In total, 1,323,470 reads of inserts (ROIs) were generated and filtered using the SMRTLink software (see text footnote 5) among which 778,561 were identified as FLNC reads based on the presence of both 5 and 3 signals plus the polyA tails.
As the mapping accuracy is affected by read quality, especially for the correct identification of splicing junctions, we performed read error correction using short-read Illumina data from the SRA database (Supplementary Table 2) with LoRDEC (Long-Read DBG Error Correction, v0.9) (Salmela and Rivals, 2014). The corrected reads had a higher Benchmarking Universal Single-Copy Orthologs (BUSCO) (Seppey et al., 2019) completeness value (97.2%) compared to the uncorrected reads (88.7%) (Supplementary Figure 2).
Next, the corrected FLNC reads were mapped to the B. rapa genome using minimap2 (version 2.15) (Li, 2018). In total, 753,041 corrected reads were uniquely mapped to reference genome, including 17,962 reads being mapped to genomes of chloroplasts and mitochondria that were excluded from further analysis. Following previous studies, we defined collapsed reads as isoforms, and read clusters as loci (Wang et al., 2019b). Through this, the alignments were collapsed into 92,810 fulllength non-redundant consensus isoforms belonging to 26,163 loci. The mean length of isoforms was 2,171 bp, longer than that of annotated genes (1,155 bp).
In addition, we observed that of the 35.5% (32,962) of fulllength non-redundant consensus isoforms with less than 90% coverage of annotated genes, 3,683 were located in intergenic regions. The isoforms that could not be mapped to any annotated genes might be novel genes that had not been identified in previous studies. The low-coverage isoforms may be ncRNAs or different transcripts due to variable splicing, or possibly site mutations. For this reason, the Iso-Seq data in this study better reflected the actual transcription of genes; compared to previous annotations, this analysis provides deeper information concerning RNA processing.

Improving Brassica rapa Genome Annotation by Pacific Biosciences Sequencing
We compared the 92,810 non-redundant isoforms against the gene set from the previous genome annotation. The results showed that 59,848 isoforms (64.49%) could match (coverage >90%) to 22,830 annotated genes, including 58,180 isoforms uniquely overlapped to 21,681 annotated genes (Figure 1A, left). Among these 21,681 genes, 8,980 were covered by one isoform, 5,104 genes were covered by two isoforms, 2,827 genes by three isoforms, 1,596 genes by four isoforms, and 3,174 genes by more than four isoforms (Figure 1A, right). For the 58,180 uniquely mapped isoforms, the coding region sequences and corresponding amino acid sequences were predicted by TransDecoder software. Using the above methods, the UTR regions of 20,340 annotated genes were defined, the remaining 1,341 annotated genes were not defined because their isoforms were considered lacking both initiation codons and termination codons by prediction software.
We then compared the expression levels between the genes covered by multiple isoforms and those covered by a single isoform. A total of 31 GB of transcriptome data was downloaded from the NCBI SRA Database for comparison (Supplementary Table 2). It turned out that the genes covered by multiple isoforms were expressed at higher levels than the genes covered by a single isoform (t-test, P-value < 2e −16 , Figure 1B).
The B. rapa reference genome was annotated at the coding level in previous annotation version; therefore, non-coding exons were neglected. By aligning isoforms to the genome annotation, we detected non-coding exons for 7,436 annotated genes and corrected their gene model in v3.5. For example, there are two transcriptional forms of the SOC1 gene (AT2G45660) in A. thaliana (Araport11, (Cheng et al., 2017; Figure 2A), the main form (AT2G45660.1) is 3,620 bp in length and consists of eight exons, and the other form (AT2G45660.2) is 3,547 bp in length with seven exons caused by AS between the original sixth and seventh exons (Figure 2A). The first exon is non-coding and translation starts from the second exon to the eighth exon of AT2G45660.1. B. rapa contains three copies of SOC1, BrSOC1.1 (BraA05g005300.3.1C), BrSOC1.2 (BraA04g032590.3.1C), and BrSOC1.3 (BraA03g023940.3.1C). Transcripts of two copies, BrSOC1.1 and BrSOC1.2, were captured by Iso-Seq. The alignment results showed the previously annotated BrSOC1.1 containing a non-existing exon 7 while lacking the first exon compared with the full-length transcripts by Iso-Seq ( Figure 2B). The annotated BrSOC1.2 gene also lacked of the first exon ( Figure 2C). ORF prediction showed that the CDS starts from the second exon of both BrSOC1.1 and BrSOC1.2, in line with that of A. thaliana. We then designed primers to amplify fragments across the first and second exon for BrSOC1.1 and BrSOC1.2. The results of reverse transcription (RT)-PCR confirmed the presence of the first exon (Figures 2B,C). In addition to the two fragments in 277 and 409 bp in length in the amplification of BrSOC1.2, a fragment approximately 300 bp in length was also amplified, indicated that there could be an extra transcript model of this gene.
By aligning isoforms to the reference gene annotation, we detected the isoforms covering adjacent annotated genes in previous annotations, suggesting that a single gene was misannotated as multiple genes (Figure 3). In total, 1,540 isoforms simultaneously were found overlapping two to four annotated genes. We then filtered these isoforms according to four criteria, (i) containing intron in length longer than 10 kb; (ii) poor alignment quality (MAPQ <10); (iii) compared with the overlapped annotated genes, the first or last intron of the isoform being longer than the length of any intron of the annotated gene; and (iv) the length of the first or last exon being less than 5% of the total length of the isoform. The isoforms that satisfied any one of these four criteria were removed. After filtering, 1,175 isoforms were defined as candidate sequences. Of these, 1,052 isoforms were predicted to be protein-coding, indicating that these 1,052 isoforms had been mis-annotated in the reference gene annotation. Further transcriptional analysis of RNA-seq data using second-generation sequencing showed that 1,052 isoforms reached a level above 1 TPM, indicating that these isoforms were true transcripts. In order to further validate these gene models, we randomly selected three isoforms to design primers (Supplementary Table 3) that could amplify fragments spanning wrongly split gene pairs and performed RT-PCR using RNA isolated from Chiifu leaves (Figure 3). The successful amplification of fragments further confirmed that these genes were wrongly split. Using the above analysis, we corrected 886 wrongly split genes in the previous gene annotation into 427 genes in v3.5.
Among the 92,810 non-redundant isoforms, 3,683 isoforms were generated from 1,707 novel loci, that were absent from the reference genome annotation. Of these, 1,829 isoforms were predicted to have coding capability by TransDecoder software and were defined as candidate novel protein-coding transcripts. To further confirm these novel loci, we searched the presence of these novel isoforms in other organisms using blast (v2.9.0). Among the 1,829 isoforms, 1,825 isoforms were significantly matched against a collection of plant protein-coding gene cDNAs by tblastx (E-value < 10e −6 ), while 1,468 isoforms were detected in the blastx search against Swiss-Prot proteins (E-value < 10e −6 ). Notably, all 1,468 isoforms detected by blastx completely overlapped with the 1,825 isoforms detected by the tblastx search ( Figure 4A). Given this high ratio of overlap, the 1,825 isoforms from 830 loci detected in the PacBio sequencing data likely represented transcripts from novel protein-coding genes. Through the above analysis, 830 novel protein-coding genes were supplemented to B. rapa genome annotation v3.5.
In addition, we found that the average expression levels of these novel genes (1,825 isoforms, TPM = 7.322) were  lower than those of the previously annotated genes (46,878 gene, TPM = 21.048) based on Illumina short reads (t-test, P-value = 7.60e −67 , Figure 4B). This suggested that a low expression level might be one of the reasons for missing these genes in the previous annotation. In contrast, the average exon number of novel genes (mean = 7.43) was significantly greater than that of the previously annotated genes (mean = 5.13) (t-test, P-value = 1.045e −05 ). This indicated that the number of exons may also affect gene annotation.

Analysis of Alternative Splicing Events and Splice Isoforms
Alternative splicing is prevalent in most plant genomes. One of the most important advantages of PacBio sequencing is its ability to identify AS events by directly comparing different isoforms of the same individual gene. From the 753,041 FLNC reads collapsed to 92,810 non-redundant consensus isoforms, we identified 28,564 AS events using the Astalavista program (Foissac and Sammeth, 2015) without transcriptome assembly, in order to avoid possible artificial results.
The 28,564 AS events were then clustered into four distinct types: alternative 5 -donor (AD, 3,739), alternative 3 -acceptor (AA, 7,098), exon skipping (ES, 1,310), and intron retention (IR,16,417) (Figure 5A). Consistent with previous studies, intron retention occupied the majority of AS events (Campbell et al., 2006). However, since there are no ribosome occupancy data available, we cannot exclude the possibility that some of the retained introns were found in incompletely processed nuclear transcripts. Among the 21,681 annotated genes by Iso-Seq, 8,980 possessed a single transcript, while 12,701 possessed two or more transcripts, producing 59,596 isoforms in total, with an average of 4.69 isoforms per gene. The genes with multiple isoforms had more exons than single isoform covered genes (t-test, P-value < 2e −16 ) ( Figure 5C). Only 28.9% of genes with a single transcript contained more than five exons, however, 53.0% of genes with AS had more than five exons (Supplementary Table 4). There were cases where the same AS type occurred at different sites within a gene locus, generating different isoforms. Multiple AS events most frequently happened for IR ( Figure 5B).
In addition, we investigated the distribution of AS in the three subgenomes of B. rapa. The results showed that there was no significant difference between the frequency of AS on the dominant subgenome LF (29.8%) and that on the two submissive subgenomes MF1 (25.4%) and MF2 (27.6%), especially in cases where the gene number was higher on the LF subgenome. However, when considering only three-copy genes, we found that AS occurred significantly more on LF (228) than on MF1 (162) or MF2 (165) (ANOVA, P-value = 0.026). For the two-copy genes, AS frequency was not significantly different between LF and MFs or between the two MFs (Supplementary Table 5).
We set a sliding window of 1 Mb to investigate the distribution of variable splicing events on chromosomes. The results showed that the variable splicing events mostly occurred in the regions far from the centromere sites ( Figure 5D). On one arm of chromosome A10, the end region showed a very high frequency of AS; in contrast, on the short arm of chromosome A03, AS occurred very rarely. Chromosome A03 is acrocentric, with a heterochromatic upper (short) arm bearing the nucleolar organizer region (NOR) (Mun et al., 2010).
To verify the AS events detected above, we randomly selected four genes and designed primers (Supplementary Table 3) for the AS regions and performed RT-PCR using RNA isolated from Chiifu leaves. The amplified fragments were consistent in length with the splicing isoforms identified from Iso-Seq data ( Figure 5E).

Long Non-coding RNA Identification in Isoform Sequencing Data and Differentially Expressed Long Non-coding RNAs After Vernalization
Long non-coding RNAs, a class of non-coding RNAs are essential regulators of a wide range of biological processes. In this study, we used four computational approaches, namely CPAT , CPC (Kong et al., 2007), PLEK (Li et al., 2014a), and LGC (Wang et al., 2019a), to identify isoforms without coding capability. We eliminated the isoforms with ORFs exceeding 100 amino acids or sequences shorter than 200 bp. To obtain a high-confidence set of lncRNAs, we kept the predicted lncRNA only if it was predicted as noncoding by any two of the above four programs. The analysis generated a total of 11,054 candidate lncRNAs. Then, we screened their homology with Swiss-Prot proteins using blastx (E-value < 10e −6 ) and eliminated 7,655 isoforms homologous to proteins. Among the remaining 3,399 isoforms, 1,919 isoforms could not be defined or classified because their sequences did not satisfy the definition of known lncRNAs. Finally the retaining 1,480 lncRNAs (Supplementary Table 6) were clustered into three main types according to their distribution on chromosomes: 1,148 isoforms mapped to intergenic regions of the genome were defined as long intergenic non-coding RNAs (lincRNAs); 289 isoforms mapped to the anti-sense strand of a known gene locus were defined as NATs; 43 isoforms mapped to the intronic regions of the genome were defined as intronic non-coding RNA (incRNAs) (Figure 6A). The average length of the lncRNAs was 1,402 bp (range, 0.2-7.8 kb), which was longer than the 497 bp from a previous study (Paul et al., 2016). The length of most lncRNAs was between 200 and 2,000 bp (82.5%), and the number of lncRNAs decreased as the length increased (Supplementary Figure 3A). Most of the lncRNAs (1,201) consisted of one or two exons (Supplementary Figure 3B).
Three cold-responsive non-coding RNAs [COOLAIR (Swiezewski et al., 2009), COLDAIR (Heo and Sung, 2011), and COLDWRAP (Kim and Sung, 2017)] were reported within the FLC locus in A. thaliana (Figure 7A). Five NATs (Li et al., 2016) and a single NAT (Shea et al., 2019) were detected at the BrFLC2 locus of B. rapa (Li et al., 2016;Shea et al., 2019; Figure 7B) in two independent studies. We examined the non-coding RNAs for the four BrFLC loci, and only detected one NAT of length of 1,022bp (PB25365.2) on the BrFLC1 (BraA10g027780.3.1C) locus ( Figure 7C); this was named BrFLC1as1022 based on its length. We used the same transcriptome data to analyze the expression level of BrFLC1as1022. It turned out that the expression of BrFLC1as1022 was highest in NV and that the expression level continually declined during vernalization ( Figure 7D).

Update of Brassica rapa Reference Genome Annotation
In summary, in this study, we relied on the B. rapa genome to upgrade genome annotation based on the v3.1 annotation version, including refining the gene structure of 20,340 previously annotated genes, identifying gene UTR regions, and adding unannotated non-coding exons. We corrected 886 wrongly split genes into 427 genes. A set of 830 novel genes missed in the previous annotation was eventually combined with 25,652 annotated genes that were not captured by full-length sequencing, resulting in a total of 47,249 coding genes in the new v3.5 annotation and released in BRAD. 13

DISCUSSION
Continuous refinement of annotation is a prerequisite for correctly interpreting the functional elements of the genome. Taking advantage of long read lengths from the PacBio sequencing technology, we generated a comprehensive full-length transcript dataset of B. rapa by pooling RNA from five tissues to construct two cDNA libraries. The new v3.5 annotation consists of 47,249 protein coding genes (830 novel genes), 13,550 TE containing loci, and 1,480 lncRNAs. The annotated loci span 118.6 Mb, an increase of 17.1 Mb from the previous B. rapa reference genome annotation v3.1.
The previous B. rapa annotations included only a single transcript for each gene, with no information concerning AS isoforms. In this study, we identified 28,564 AS events for 11,870 protein coding genes from Iso-Seq data without the aid of assembly. These genes accounted 25.1% of the total number of annotated genes in v3.5, providing valuable information for functional analysis and better explaining the richness of gene expression. The ratio of genes with AS isoforms was lower than that reported in Araport11 of A. thaliana (Cheng et al., 2017). The estimated frequency of AS forms increases along with the sampling depth, i.e., with the average number of transcripts sequenced for a gene (Zavolan et al., 2003). We believe that annotation of AS isoforms in this study is not saturated, since we only used quite limited types of tissue from plants grown only under normal or vernalization conditions for PacBio sequencing.
In this study, we found that the isoform number was positively correlated with both exon number and gene expression level. This differed from previous reports. There was a positive correlation between the number of exons and isoforms, but the mean expression levels of AS genes were slightly lower than those of non-AS genes in strawberry (Yuan et al., 2019). In maize, the number of isoforms in each gene was positively correlated with the number of exons but had no relationship with the gene expression level (Wang et al., 2016). However, we found that the number of isoforms per gene was positively correlated with exon number (Pearson, r = 0.295, P-value < 0.001) as well as the expression level (Pearson, r = 0.124, P-value < 0.001) in B. rapa. We deduced that along with the increase of exon numbers, the number of splicing sites increases, leading to the formation of more isoforms. From the existing data, we found a total of 256 splicing motifs, among which the main type was the typical GT-AG pattern, accounting for 41.5%, while most of the other splice junctions accounted for less than 1% each. Moreover, such atypical splicing junctions generally occurred in genes with multiple exons (number >5).
In previous studies, 2,237 (Paul et al., 2016) and 2,088 (Shea et al., 2019) lncRNAs were identified in B. rapa using cDNA sequences and Illumina short reads, respectively. In this study, we identified 1,480 lncRNAs in total, fewer than in previous reports. Since the sequences of the lncRNAs reported by Paul et al could not be accessed, we only compared 1,480 lncRNAs with the 2,088 lncRNAs reported by Shea et al. (2019). The blast results showed similarities (E-value < 10e −6 ) in a total of 636 lncRNAs, including 559 lincRNAs, 66 NATs and 11 incRNAs. Shea et al. (2019) compared expression level of lncRNAs in Chinese cabbage leaves with and without vernalization and detected 410 of the 1,444 lincRNAs, 120 of the 551 NATs, and 19 of the 93 incRNAs that was differentially expressed. In our study, we detected many fewer differentially expressed lncRNAs between vernalized and non-vernalized plants. However, the detection of the 209 lncRNAs showed a unique differential expression pattern in response to cold stress, suggesting that these lncRNAs are closely involved in governing the cold-stress responses of B. rapa.
COOLAIR has been suggested to play an important role in the regulation of FLC during vernalization in A. thaliana (Hawkes et al., 2016). Li et al. (2016) isolated five NATs for BrFLC2 by RACE-PCR and grouped them into two classes according to polyadenylation sites in intron 6 or within the promoter of BrFLC2. The NAT BrFLC1as1022 we detected at the BrFLC1 locus by Iso-Seq was polyadenylated within intron 6; therefore, it should belong to Class I. In the study (Shea et al., 2019), a COOLAIR-like transcript at the BrFLC2 locus was detected and named as BrFLC2as. The expression of BrFLC2as was up-regulated upon short-term vernalization. We evaluated the expression of BrFLC2as using our RNA-seq data; the results showed that BrFLC2as expression decreased along with the extension of vernalization period similar to that of BrFLC1as1022 detected by Iso-Seq ( Figure 7D). Hawkes et al. (2016) reported the structural conservation of COOLAIR across Brassicaceae species, and proposed a functional role for COOLAIR transcripts rather than, or in addition to, antisense transcription. To our knowledge, BrFLC1as1022 is the first reported NAT for BrFLC1. This finding refutes the hypothesis proposed by Li et al. (2016), that BrFLC2 NATs contribute to the repression of all of the BrFLC genes in B. rapa.
Taken together, this work greatly improves existing gene models in B. rapa. Despite the substantial improvement in genome annotation v3.5, the work should be ongoing. With the aid of advances in sequencing technology and bioinformatics analysis, and additional samples from developmentally specific stages and different environmental stimuli, discovery of novel features will be included to update B. rapa genome annotation in the future.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are publicly available. This data can be found here: B. rapa annotated genome v3.5 has been published in the Brassicaceae Database (BRAD V3.0; http://brassicadb.cn/). The corresponding NCBI Bioproject encapsulating the Pac-Bio sequencing data and short read transcriptome data are PRJNA783071 and PRJNA784003, respectively.

AUTHOR CONTRIBUTIONS
XW and JW conceived the study. ZZ and XC performed data collection and bioinformatics analysis. JG and XX participated in Chiifu culture processing, RNA extraction and sequencing. YL conducted RT-PCR experiments. JW and ZZ wrote the manuscript. JW, XW, RL, JL, and ZZ revized the manuscript. All authors read and approved the final manuscript.