DASES: a database of alternative splicing for esophageal squamous cell carcinoma

Esophageal carcinoma ranks as the sixth leading cause of cancer-related mortality globally, with esophageal squamous cell carcinoma (ESCC) being particularly prevalent among Asian populations. Alternative splicing (AS) plays a pivotal role in ESCC development and progression by generating diverse transcript isoforms. However, the current landscape lacks a specialized database focusing on alternative splicing events (ASEs) derived from a large number of ESCC cases. Additionally, most existing AS databases overlook the contribution of long non-coding RNAs (lncRNAs) in ESCC molecular mechanisms, predominantly focusing on mRNA-based ASE identification. To address these limitations, we deployed DASES (http://www.hxdsjzx.cn/DASES). Employing a combination of publicly available and in-house ESCC RNA-seq datasets, our extensive analysis of 346 samples, with 93% being paired tumor and adjacent non-tumor tissues, led to the identification of 257 novel lncRNAs in esophageal squamous cell carcinoma. Leveraging a paired comparison of tumor and adjacent normal tissues, DASES identified 59,094 ASEs that may be associated with ESCC. DASES fills a critical gap by providing comprehensive insights into ASEs in ESCC, encompassing lncRNAs and mRNA, thus facilitating a deeper understanding of ESCC molecular mechanisms and serving as a valuable resource for ESCC research communities.


Introduction
Esophageal carcinoma (EC), a type of malignant tumor affecting the esophagus, is a major global health concern with an estimated annual incidence of over 600,000 and mortality of over 500,000, making it the seventh most common malignant tumor and the sixth leading cause of cancer-related death globally (Sung et al., 2021).There are significant regional differences in the incidence of EC, which can be divided into esophageal squamous cell carcinoma (ESCC) and esophageal adenocarcinoma (EAC), according to the pathological type, with nearly 79% of ESCC occurring in Asian countries (Morgan et al., 2022).Although the incidence of ESCC has shown a decreasing trend in certain countries (Liang et al., 2017), ESCC continues to be a pressing public health issue on account of its increased fatality rate (Abnet et al., 2018).Majority of ESCC patients present at an advanced stage during medical consultation, and conventional surgical interventions often exhibit suboptimal effectiveness or even fail to achieve a radical resection in some cases (He et al., 2022;Pape et al., 2022), with a 5-year survival rate of less than 30% (Allemani et al., 2018).As tumor molecular biology and immune escape mechanisms are more thoroughly studied, a growing number of targeted and immune drugs are being investigated as potential treatments to prolong the survival time of patients with ESCC (Kojima et al., 2020;Costoya and Arce, 2023).
Alternative splicing (AS) is a post-transcriptional regulatory process that generates various RNA isoforms by employing diverse splicing patterns, thereby playing a pivotal role in regulating protein production, especially during developmental and differentiation processes (Yang et al., 2016;Bonnal et al., 2020).When AS is not properly regulated, it can result in the production of oncogenic isoforms, which can contribute to the growth and progression of tumors (Zhang et al., 2021).ESCC patients exhibit a high frequency of alternative splicing events (ASEs), which are associated with tumor initiation, progression, invasion, and immune evasion (Dlamini et al., 2021;Wu et al., 2021).Meanwhile, AS has potential importance in the treatment of ESCC, and several studies suggest that intervention in AS can enhance the sensitivity of ESCC cells to chemotherapy drugs (Siegfried and Karni, 2018;Sciarrillo et al., 2020).Additionally, AS has been shown to impact the efficacy of immunotherapy for ESCC by influencing the expression and presentation of tumor antigens, ultimately affecting the recognition and attack of tumor cells by immune cells (Duan et al., 2021;Wu et al., 2021).Thus, AS has important implications and value for a deeper understanding of the molecular mechanisms of ESCC and the development of therapeutic and immunotherapeutic strategies.
Currently, several databases are available that encompass ASEs, including some that cover ESCC, such as TCGASpliceSeq (Deng et al., 2021) and OncoSplicing (Zhang et al., 2022), developed based on data from The Cancer Genome Atlas (TCGA) (Tomczak et al., 2015).However, despite the inclusion of multiple cancer types, the number of ESCC cases in these databases is limited, with only 96 cases available (Yang et al., 2023).Furthermore, most of these databases primarily rely on oligo dT and poly A sequencing techniques, focusing on AS identification in protein-coding genes, with limited attention given to AS events involving long non-coding RNAs (lncRNAs).In contrast, although ESCC-specific databases, such as ESCCdb (Yang et al., 2023) and CCGD-ESCC (Peng et al., 2018), encompass a larger number of cases, they lack the specific annotation of ASEs.Considering the significant relationship between lncRNA expression and ESCC development and progression (Li et al., 2019b;Razavi and Ghorbian, 2019;Sadeghpour and Ghorbian, 2019;Aalijahan and Ghorbian, 2020;Liu et al., 2020;Ghasemzadeh and Ghorbian, 2023) and the absence of specialized AS-related databases for ESCC, we developed the Database of Alternative Splicing for Esophageal Squamous cell carcinoma (DASES) (http://www.hxdsjzx.cn/DASES),which utilizes two main sets of data.The first set consists of our inhouse total transcriptome sequencing data, derived from ESCC patients at the West China Hospital of Sichuan University.The second set is total transcriptome sequencing data from 11 published projects related to ESCC.Through the integration of known transcripts, the identification of novel lncRNAs, and the paired comparison of isoforms between tumor and adjacent normal tissues, DASES provides a comprehensive and precise catalog of ASEs in ESCC, filling a critical gap in the field and offering a valuable resource for ESCC research communities.

Data collection
DASES contains raw data from two sources.The first source includes total RNA sequencing data on both tumor and adjacent normal tissues from 63 ESCC patients in West China Hospital of Sichuan University.The second source includes publicly available total RNA sequencing data on ESCC patients from the European Bioinformatics Institute (EBI).To ensure high-quality data, we employed strict search criteria to select suitable samples from EBI (Figure 1): 1) the samples were obtained from human ESCC tissues; 2) the data included RNA sequencing; and 3) the data had sufficient information available.We excluded the cell line RNA-seq data, RNA-seq data from esophageal adenocarcinoma or other parts of the esophagus, and any data without sufficient information.It is essential to emphasize that the included datasets were not specifically targeted or enriched for circular RNA (circRNA) or small RNA during the sequencing and library preparation processes.

Data quality control and lncRNA identification
In this study, we used a series of bioinformatics tools to identify potential lncRNAs and mRNAs associated with ESCC (Figure 2).First, the raw reads obtained from RNA-seq were subjected to quality control using Trim Galore software (version 0.6.4;https:// www.bioinformatics.babraham.ac.uk/projects/trim_galore) to obtain clean reads.Then, we used STAR (version 2.7.3a) (Dobin et al., 2013) and HISAT2 software programs (version 2.2.1) (Kim et al., 2019) for sequence alignment of the clean data, resulting in the generation of BAM and SAM files for each sample, respectively.Next, we used Cufflinks (version 2.2.0) (Trapnell et al., 2010) and StringTie software programs (version 2.1.4)(Pertea et al., 2015) to assemble the BAM and SAM files, respectively, generating GTF files for each sample.We then used StringTie software to merge all the assembled GTF files, obtaining a preliminary merged GTF file.Subsequently, we utilized GffCompare software (version 0.12.2) (Pertea and Pertea, 2020) and reference transcripts to identify potential lncRNAs and mRNAs.We selected transcripts with class code "i" or "u" as potential lncRNA candidates and those with class code "=," "c," or "j" as mRNA candidates.Finally, we predicted the lncRNA candidates using CPAT (version 3.0.2) (Wang et al., 2013) and PLEK software programs (version 1.2) (Li et al., 2014), and selected those predicted as non-coding RNAs by both tools as lncRNA candidates.We merged the lncRNA and mRNA candidates to generate a comprehensive GTF file containing all potential lncRNA and mRNA candidates associated with ESCC.All the analyses were conducted using the human genome hg38 (release 84) reference provided by Ensembl (https://ensembl.org/Homo_sapiens/Info/Index).

Alternative splicing event identification
ASEs can manifest in different ways, including skipping an exon (SE), including or excluding a mutually exclusive exon (MXE), using alternative 5′ or 3′ splice sites (A5SS or A3SS), or retaining an intron (RI).To determine the occurrence of ASEs, we compared two different transcript isoforms derived from the same gene.Specifically, we performed paired comparisons between tumor and adjacent normal groups.In these comparisons, we assigned the term "included isoform" to the isoform containing exons when comparing two transcripts.Conversely, the isoform lacking exons was referred to as the "excluded isoform."The designation of the "included isoform" was based on having a shorter intron length, whereas the "excluded isoform" had a longer intron length (Figure 3).By comparing the splice junctions and exon-intron boundaries between these two isoforms, we identified and quantified the specific ASEs present in the transcriptome.
To comprehensively identify ASEs associated with ESCC, we employed rMATS software (version 3.1.0)(Shen et al., 2014) with a stringent splicing difference cutoff of 0.0001.Given the publicly available data literature reports, which indicated that all the wholetranscriptome data utilized dUTP-based library construction techniques, we considered the fr-firststrand library type during the analysis of aligned reads in BAM format.By comparing exon inclusion levels between tumor and adjacent normal groups, we detected differential ESCC-related ASEs.We only retained ASEs with a percent spliced in (PSI) (Katz et al., 2010) value greater than 0 and that were present in at least two samples.The results of splicing with only reads that span splicing junction based on GTF files were selected as the ESCC-related ASEs.Furthermore, to establish the coordinates of ASEs, we considered that each ASE comprises two transcript isoforms, with each isoform potentially containing 0-2 introns.In order to define the boundaries of ASE, we determined the minimum coordinate of the intron within the event as the starting coordinate and the maximum coordinate of the intron as the ending coordinate.

Expression quantitative analysis
For quantitative analysis of the RNA-seq data, we employed the merged GTF file as the reference annotation.The BAM files, generated from the alignment step, were subjected to subsequent analysis using Cuffnorm (version 2.2.0), which is a part of Cufflinks software.This tool allowed us to estimate the expression levels of individual isoforms, providing fragments per kilobase of transcript per million mapped reads (FPKM) values.

Potential affected the protein domain by alternative splicing
To assess the potential overlap between ASEs and protein domains, we adopted a conservative approach.We focused only on the known protein domains that intersected with ASEs, disregarding the predictions from various tools.Initially, we retrieved protein domain information from the Ensembl database, specifically focusing on the hg38 version of protein domain annotations (release 109), which includes InterPro coordinates, associated transcripts, and corresponding genes.We then mapped the InterPro coordinates onto genomic coordinates using appropriate alignment algorithms as follows: where Start genomic and End genomic represent the start and end sites of the protein domain on the genome, respectively, Start CDS refers to the first CDS start site of the corresponding transcript, and Start InterPro and End InterPro indicate the start and end sites of the protein domain on InterPro coordinates, respectively.Subsequently, we scrutinized whether there was any intersection or overlap between the genomic coordinates of protein domains and genomic coordinates of ASEs.In the cases where a protein domain exhibited any intersection or overlap with an ASE, we deemed it as having a significant overlap with the respective ASE. 3 Results

Data and database overview
Following the application of quality control measures, a total of 14 patients, corresponding to 28 samples, were eliminated from the dataset.Currently, DASES encompasses data from the in-house study and 11 publicly available studies (Table 1), comprising a total of 346 samples, with 185 distinct ESCC patients represented.We identified 257 novel lncRNAs (Figure 4A) and a total of 59,094 ASEs by using a tumor versus adjacent normal strategy, with 31,777 belonging to SE, 7,546 utilizing A5SS, 9,279 utilizing A3SS, 2,653 involving MXE, and 7,839 RI (Figure 4B).To facilitate easy access and utilization of the database, we designed a user-friendly web interface featuring various modules.The Home page provides users with a concise overview of DASES, accompanied by illustrative diagrams showcasing the five major types of ASEs (Figure 5A).The Search page offers four different search options, namely, gene, transcript, ASE ID, and genomic region, facilitating easy and efficient data retrieval (Figure 5B).The Browse page provides a comprehensive list of all ASE IDs, allowing users to narrow down their queries by applying filters based on the ASE type or study name (Figure 5C).The Genome Browser page enables users to visualize the genomic regions associated with ASEs (Figure 5D).The Download page offers convenient access to essential files, including processed GTF file and ASE-related data, which can be downloaded for further analysis (Figure 5E).Lastly, the About page serves as a valuable resource, providing a detailed pipeline overview of the entire database, along with comprehensive explanations of important interface features, including headers and abbreviations for primary tables, enabling users to fully comprehend and navigate the database with ease.

Diversified search strategies
In DASES, we present a comprehensive search system comprising four dimensions (Figure 5B).The first dimension allows users to conduct searches based on the gene ID or gene name, thereby retrieving pertinent gene information alongside details concerning gene-associated ASEs.By employing the second dimension, users can search using the transcript ID, obtaining transcript-specific information, expression levels across samples, and insights into transcript-associated ASEs.The third dimension facilitates searches based on the ASE ID, yielding ASErelated details, including the exon junction count (EJC), intron junction count (IJC), and PSI values.Finally, the fourth dimension empowers users to search by genomic coordinates, resulting in the retrieval of ASE information specific to designated genomic loci.

Genome Browser visualization
The Genome Browser in DASES comprises two distinct sections.The first section facilitates the visualization of all ASEs (Figure 5D).Within the Genome Browser page, users can utilize diverse tracks to filter ASEs based on specific criteria, including ASE types and chromosome numbers.They also have the option to display or conceal tracks associated with ASEs, transcripts, genes, and protein domains.The second section is accessible via the Search page (Figure 5F).When users conduct searches for genes, transcripts, or specific ASEs, the pertinent information is presented visually on the Genome Browser.This seamless integration of search results with the Genome Browser offers users a contextual perspective on the genomic location of these elements.

Significant association between ESCC TNM staging and ASE frequency
ASEs have been closely linked to tumorigenesis and cancer progression.To investigate whether the frequency of ASEs exhibits an association with TNM staging in ESCC, we conducted a comprehensive analysis using data from DASES.As shown in Supplementary Figures S1A, D, both the frequency of genes undergoing alternative splicing (AS-gene frequency) and the frequency of ASEs in genes exhibiting AS (ASE frequency) exhibited a substantial increase within ESCC tissues when compared to adjacent normal tissues.Furthermore, our analysis unveiled a significant trend in the correlation between ESCC TNM staging and AS-gene frequency (Supplementary Figures S1B, C), as well as ASE frequency (Supplementary Figures S1E, F).These findings underscore a compelling association between ESCC TNM staging and the frequency of ASEs, suggesting their potential relevance in the context of ESCC progression.The source of DASES facilitates the exploration of these intricate relationships, providing a valuable platform for future research in this field.

Consistency with literature findings for COL6A3 in DASES
The expression of COL6A3 in both bulk esophageal tissue and single esophageal tissue samples exhibited a relatively high level, as evidenced by data obtained from the GTEx website (https:// gtexportal.org/home/).Utilizing the search interface of DASES, we specifically queried COL6A3 (Figures 6A, B), leading to the identification of four ASEs, i.e., three SE events and one RI event (Figure 6C).Notably, our findings closely align with the observations reported by Ding, who identified three SE-type ASEs of COL6A3 from 11 samples in their study of ESCC tissues (Ding et al., 2021).Intriguingly, in addition to the ASEs reported by Ding, we discovered an additional RI-type ASE, "DASRI00000001151" (chr2: 237342162-237344349), which was not addressed by Ding.This discrepancy could be attributed to our larger sample size, which enabled us to identify more COL6A3related ASEs.Importantly, we observed statistically significant differences in the PSI values of these four ASEs between the tumor and adjacent normal tissue groups, further highlighting their potential significance in ESCC (Figure 6D).This robust consistency between our findings and those of Ding provides substantial evidence for the reliability of ESCC-related ASEs documented within DASES, thereby reinforcing their validity through corroboration with findings from other literature reports.

Discussion
In this study, we successfully constructed DASES.By integrating publicly available RNA-seq from ESCC patient tissues, DASES DASES employed a tumor versus adjacent normal strategy to identify ESCC-associated ASEs, presenting several notable advantages.First, by comparing samples within the same patient, DASES effectively highlights splicing events that are highly likely to be functionally relevant to the development and progression of ESCC, which could reduce the confounding effects of individual genetic variations or splicing differences that are unrelated to ESCC.This strategy has been demonstrated to be effective in previous studies (Xiong et al., 2015;Kahles et al., 2018).Moreover, in line with other research conclusions, the approach enables the identification of ESCC-specific ASEs that may serve as potential biomarkers or therapeutic targets as they reflect the unique molecular characteristics of ESCC (Kalsotra and Cooper, 2011;Sebestyén et al., 2016).Furthermore, by comparing splicing patterns within the same patient, DASES minimizes inter-individual variations and provides a more robust assessment of the splicing changes specifically related to ESCC, enhancing the reliability of the identified ASEs in DASES.In a word, this approach allows for a more effective, accurate, and reliable characterization of ESCCrelated AS.
DASES serves as a comprehensive resource that includes wholetranscriptome data to investigate both known ASEs linked to ESCC and novel lncRNAs, along with their associated ASEs that could potentially be implicated in ESCC.The identification of novel lncRNAs and their associated ASEs in ESCC holds great promise for advancing our understanding of the disease.Several studies have highlighted the importance of lncRNAs in cancer development and progression, including ESCC (Chen et al., 2018).These long noncoding RNAs regulate gene expression, modulate signaling pathways, and contribute to the hallmarks of cancer (Huarte, 2015).Therefore, the incorporation of lncRNA-associated ASEs in DASES provides valuable insights into the regulatory complexity underlying ESCC.Moreover, we offer a comprehensive GTF file that incorporates both the known transcriptome information and the newly discovered lncRNAs in DASES.This resource enables the in-depth exploration of the expression patterns, functional implications, and potential interactions of these newly identified lncRNAs in the context of ESCC.
We recognize that ESCC is a multifactorial disease with various pathogenic genes, including but not limited to TP53, NOTCH1, CDKN2A, and COL6A3 (Gao et al., 2014;Li et al., 2019b;Liu et al., 2022;Ko et al., 2023).Our Gene Ontology (GO) enrichment analysis of differentially expressed genes highlighted "cell adhesion" as one of the top-ranked GO terms closely linked to cancer, with COL6A3 being among the genes significantly associated with this GO term.Given these factors, we selected COL6A3 as a representative gene for demonstrating the utility of DASES.Our analysis of ASEs within COL6A3 revealed intriguing findings.Although consistent with the study conducted by Ding et al. (2021) on paired ESCC tissues for the most part, our dataset uncovered an additional RI-type ASE within COL6A3 not reported by Ding et al. (2021).This discrepancy could be attributed to our larger sample size, enabling us to capture more ASEs.This novel finding highlights the value of our database in complementing existing knowledge and uncovering potentially clinically relevant splicing events.It also underscores the significance of leveraging a comprehensive resource like DASES to complement existing studies and expand our knowledge of this complex disease.
It is important to acknowledge that DASES also has certain limitations and areas that can be further improved.First, DASES currently focuses exclusively on ESCC patient tissue data and lacks representation from other species.However, human patient tissue data remain valuable, and future versions will include data from diverse organizations to broaden its scope.Second, DASES primarily relies on whole-transcriptome data, neglecting other sequencing data types.Integrating multiple omics data types can enhance our understanding of ESCC mechanisms.Moreover, in evaluating the impact of ASEs on proteins, we only considered instances where ASEs occur directly within protein domains.However, there are other ways in which proteins can be affected, such as a frameshift occurring before a protein domain or ASEs occurring in scaffold regions, which can influence their three-dimensional structure.

FIGURE 1
FIGURE 1Overview of the screening and inclusion criteria for ESCC patient transcriptome data in the study.

FIGURE 2
FIGURE 2Workflow for the identification of lncRNAs.The left panel represents each step of the identification process, while the right panel includes the tools used and the types of input and output files for each step.

FIGURE 3
FIGURE 3Classification of alternative splicing events.

FIGURE 4
FIGURE 4 Composition of the gene type and alternative splicing event type in DASES.(A) Distribution of gene biotypes, where "Others" comprise a combination of transcribed unitary pseudogene, transcribed processed pseudogene, miRNA, TEC, unprocessed pseudogene, and IG C gene.(B) Distribution of alternative splicing event types.

FIGURE 5
FIGURE 5 Overview of DASES module interfaces.(A) Home interface, (B) Search interface, (C) Browse interface, (D) Genome Browser, (E) Download interface, and (F) Genome Browser after performing a search.

FIGURE 6
FIGURE 6Consistency between DASES results and literature findings for COL6A3.This figure showcases the validation process of DASES by employing an example search for the highly expressed COL6A3 in esophageal tissue.(A) Process of searching for COL6A3 using the search interface.The search results, including relevant descriptions of COL6A3 in (B) and detailed information on the identified alternative splicing events that are related to COL6A3 in (C).Discovery of associated alternative splicing events that are consistent with literature reports(Ding et al., 2021).(D) Disparity in percent spliced in values for these four alternative splicing events between the tumor and adjacent normal tissue groups, as determined by the Mann-Whitney test.*p-value < 0.05.

TABLE 1
Information on whole-transcriptome data in ESCC patients from one in-house study and 11 publicly available studies.
a The number of samples on tumor tissues versus the number of samples on adjacent normal tissues for ESCC patients in each study.