OSluca: An Interactive Web Server to Evaluate Prognostic Biomarkers for Lung Cancer

Lung cancer is the principal cause of leading cancer-related incidence and mortality in the world. Various studies have excavated the potential prognostic biomarkers for cancer patients based on gene expression profiles. However, most of these reported biomarkers lack independent validation in multiple cohorts. Herein, we collected 35 datasets with long-term follow-up clinical information from TCGA (2 cohorts), GEO (32 cohorts), and Roepman study (1 cohort), and developed a web server named OSluca (Online consensus Survival for Lung Cancer) to assess the prognostic value of genes in lung cancer. The input of OSluca is an official gene symbol, and the output web page of OSluca displays the survival analysis summary with a forest plot and a survival table from Cox proportional regression in each cohort and combined cohorts. To test the performance of OSluca, 104 previously reported prognostic biomarkers in lung carcinoma were evaluated in OSluca. In conclusion, OSluca is a highly valuable and interactive prognostic web server for lung cancer. It can be accessed at http:// bioinfo.henu.edu.cn/LUCA/LUCAList.jsp.


INTRODUCTION
Lung cancer (LUCA) is an aggressive disease with leading mortality and incidence in the world. Based on histology, there are two types of LUCA, including non-small cell lung cancer (NSCLC), which accounts for 80% of LUCA and small cell lung cancer (SCLC), which accounts for approximately 20% of LUCA (Raponi et al., 2006;Bray et al., 2018). NSCLC can be further sub-divided into four subtypes, including adenocarcinoma, squamous cell carcinoma, large cell carcinoma, and bronchioloalveolar carcinoma (Ramalingam et al., 2011). Classical histological subtypes indeed play a dominant role in treatment and prognosis of lung cancer. Recently, reclassification of lung cancer based on tumor biomarkers improves lung cancer therapy (Beer et al., 2002;Hoadley et al., 2018).
Many studies have demonstrated that using clinical-association-prognostic biomarkers can assist the characterization of cancer subtypes and provide new insights of cancer recurrence and patients response to more precise therapies (Meyerson and Carbone, 2005;Bild et al., 2006;Raponi et al., 2006). It is worth noting that numerous single-or multi-prognostic biomarkers have been identified using highthroughput profiling methods (Raponi et al., 2006). By mining a mass of these profiling data deposited in public database, metaanalysis has exploited potential prognostic genes, such as KRT8 (Xie et al., 2019a). However, for biologists and clinicians, it is technically difficult to analyze these massive public data to screen and develop prognostic biomarkers. Previously, we have built several web servers of prognostic biomarker analysis for breast cancer, esophageal carcinoma, etc. (Wang et al., 2019a(Wang et al., ,b,c, 2020Xie et al., 2019b,c;Yan et al., 2019;Zhang et al., 2019Zhang et al., , 2020Dong et al., 2020). In this current study, we have integrated bulky RNA expression profiles of lung cancer with clinical survival information, mainly from TCGA (The Cancer Genome Atlas) and GEO (the Gene Expression Omnibus) databases, and built a prognostic analysis web server named OSluca (Online consensus Survival for Lung Cancer) to analyze and evaluate prognostic potency of gene in 35 independent lung cancer cohorts.

Collection of Lung Cancer Datasets
The lung cancer cohorts for OSluca with expression profiling and clinical follow-up data were collected from PubMed, TCGA, 1 and GEO 2 by searching the keywords: "lung" AND "cancer" AND "survival" ( Table 1). The dataset for each cohort that met these following criteria will be included in OSluca: (1) have RNA sequencing or gene microarray data; (2) have complete followup data, such as overall survival and status (Liu et al., 2018); (3) all the data were specific for lung cancer, not from secondary or metastatic lung tumor from other types of tumors; (4) the cohort size is no less than 30 cases. The primary clinical pathological characteristics of lung cancer patients are listed in Table 1.

Construction of OSluca Web Server
Online consensus Survival for Lung Cancer is built in a tomcat server as previously described with minor modifications (Wang et al., 2019b,c;Xie et al., 2019b,c;Yan et al., 2019;Zhang et al., 2019). Briefly, front-end application was used for inputting query and displaying the results. Java and R package were used to analyze request and output the results. In addition, profiles and clinical information were stored in the SQL Server database. The prognostic significance of inputted gene is determined by analyzing the association of gene expression and survival time using the R package "survival." In addition, a genome-wide precalculation of Cox proportional regression for all the human genes were performed as well, and the home page of OSluca could display the survival analysis summary with a forest plot and a table of Cox proportional regression result for inputted gene in all cohorts with P-value and HR [(95% confidence interval (CI)] with the built-in upper 25% cutoff. The R package "forestplot" was used to produce the forest plot for inputted gene in OSluca web server. 1 https://cancergenome.nih.gov/ 2 www.ncbi.nlm.nih.gov/geo/

Validation of Previously Reported Prognostic Biomarkers of Lung Cancer in OSluca
Keywords including "lung cancer, " "survival, " "biomarker, " and "prognosis" were used to search biomarkers of lung cancer in NCBI PubMed. We finally obtained 104 prognostic biomarkers using the following criteria ( Table 2): (1) immunohistochemistry (IHC) or qRT-PCR (qPCR) detection of biomarkers in primary cancer tissue; (2) a significant association between biomarker and survival; (3) the sample size must be above 50 cases; (4) the study was published in the English for full access.

Statistical Analysis
The association of lung cancer clinical factors and survival outcomes was analyzed by GraphPad Prism 8.0 software. The Cox proportional hazards regression and Kaplan Meier plot functions from R package "survival" were used in the OSluca to determine the association between gene expression and survival. The P ≤ 0.05 was considered statistically significant.

Clinical Characteristics of Lung Cancer Patients in OSluca
To develop an online survival web server for lung cancer, we collected 35 published high-throughput profiling datasets of lung cancer with long-term follow-up information (2 TCGA datasets, 32 GEO datasets, and 1 Roepman dataset). TCGA comprises 513 lung adenocarcinoma cases and 499 squamous cell carcinoma cases (Tables 1, 2). GEO cohorts and Roepman cohort had more than 4,000 samples and 172 samples, respectively, as shown in Table 2. 4,901 patients have OS (overall survival) data; 2,176 patients have DSS (disease-specific survival) data; and 2,075 patients have PFI (progression-free interval or recurrencefree survival) data, while 608 patients have DFI (disease-free interval) data. The results showed that the patients with lung adenocarcinoma significantly survive longer than those of other histological lung cancer, and small cell lung cancer is associated with the worst prognosis compared to other types of lung cancer ( Figure 1A). Moreover, other clinical characteristics can also prominently affect patients' prognosis, such as gender (P < 0.0001), stage (P < 0.0001), p-TNM stage (P < 0.0001), and smoking status (P < 0.0001) (Figures 1B-E). Besides, these risk factors can influence other survival endpoints, such as PFI (data not shown). These results are in accordance with previous researches (Mao et al., 2016;Bray et al., 2018).

Construction and Usage of Prognostic Web Server OSluca
Online consensus Survival for Lung Cancer includes a set of optional clinico-pathological factors, such as age, sex, histological type, grade, smoking status, and so on. Four survival endpoints can be selected basing on original patient outcomes, containing OS, DSS, DFI, and PFI (Liu et al., 2018). In order to make the user clearly see the prognostic effect of interested gene, a metaanalysis is to summarize the prognostic value for each gene on the home page of OSluca. Briefly, after the user types the official gene symbol into the input box on the home page, OSluca will display the survival analysis summary with a forest plot and a table from Cox proportional regression in each cohort and combined cohorts (combining all the datasets together). Take the tumor suppressor gene TP53 (tumor protein p53) as an example and type "TP53" into the gene symbol box and click on "Survival analysis" (Figure 2A, left). The meta-analysis results with a forest plot and a survival table for the TP53 gene, will display the P-value and HR with 95% CI of each cohort and the combined cohorts (Figure 2A, right). Then, the user can easily obtain KM plots of separate cohorts such as GSE30219 dataset by clicking on the "Go" button in the survival table ( Figure 2B). In addition, it is also available to use a subgroup of certain cohort to obtain specific prognostic information with selectable risk factors, such as cutoff value, histological type, grade, etc. Briefly, OSluca can output survival rates displaying a forest plot and a survival table with KM plot and P-value to measure the association between the investigated gene and survival rate.

Validation of Previously Reported Lung Cancer Prognostic Biomarkers in OSluca
A search for lung cancer biomarkers was performed using a set of keywords in NCBI PubMed, including "lung cancer, " "survival, " "biomarker, " and "prognosis." In total, we collected 104 published lung cancer prognostic biomarkers verified by IHC or qPCR (Supplementary Table S1) to evaluate the performance of OSluca. For example, Hsu et al. reported that ERO1L (ERO1-like protein alpha, also named ERO1A) is significantly overexpressed in tumor tissue and could be as a poor prognostic biomarker for lung adenocarcinoma (Hsu et al., 2016). The prognostic analysis of ERO1L in OSluca showed that high expression of ERO1L gene is significantly associated with poor outcome in eight out of nine cohorts (Top 9 cohorts, the sample size above 150 cases) (Figures 3A-H), except the Roepman dataset ( Figure 3I). Next, each published biomarker was investigated in the Top 9 cohorts in OSluca, and the results showed that approximately 66% of biomarkers (69/104) were consistent with original published findings (Supplementary Table S1). Meanwhile, OSluca can be used to perform the outcome metaanalysis of the interested gene that showed that 14% (14/104) (Supplementary Table S1) of published prognostic genes have the similar prognostic values in one or multiple OSluca cohorts as reported in the literature, but these genes also showed the opposite outcomes in some other cohorts from OSluca. These genes need further investigations, such as the DDIT3 gene (Supplementary Figure S2 and Supplementary Table S1). In contrast, there are some prognostic biomarkers, which have been shown different outcomes between OSluca and previous findings. A total of 9% of the published prognostic genes showed opposite outcome results between OSluca and literatures (9/104) (see Supplementary Table S1), suggesting that these genes need further validation. For example, the transcription factor KLF15 (Krüppel-like factor 15) had been proven to be higher in tumor tissue than that of adjacent non-tumor tissue and played an important role in promoting proliferation and carcinoma diversification in lung adenocarcinoma, associated with poor prognostic outcome . It was not anticipated that the patients with high expression of KLF15 have better survival than those with low expression (Supplementary Table S1 and Supplementary Figure S1). The OSluca result for the KLF15 gene was consistent with other prognostic analysis tools (Gyõrffy et al., 2013;Anaya, 2016), such as the KM plotter [P < 0.001, HR (95% CI) = 0.4 (0.28-0.58)]. In addition, the remaining 12 of 104 previously published prognostic biomarkers (11%) were not significant for prognostic analysis in the Top 9 cohorts in OSluca, but 8 of them (8/12) are significant in one or multiple datasets other than the Top 9 cohorts in OSluca (data not shown). All in all, the OSluca server is an interactive and free web server for researchers to develop potential prognostic biomarkers for lung cancer.

DISCUSSION
Owing to tumor molecular heterogeneity, the prognosis of lung cancer patients is variable and difficult to predict. The prognosis of patients suffering from lung cancer had been demonstrated to be highly dependent on clinical factors of the patient, such as histological type, smoking status, and so on. However, it is also an imperative need to exploit novel prognostic biomarkers for determining the risk of cancerous lesions and predicting lung cancer patient outcomes by all available means, especially by high-throughput sequencing technologies. However, one major challenge to non-bioinformatics researchers is how to integrate the highdimension profiling datasets of lung cancer and discover new biomarkers to potentially guide prognostic stratification. Previous studies had revealed that the online prognostic web Frontiers in Genetics | www.frontiersin.org  The histological type of all the above cohorts is lung adenocarcinoma. ERO1L, ERO1-like protein alpha (also named ERO1A).
servers of cancer (Elfilali et al., 2006;Mizuno et al., 2009;Goswami and Nakshatri, 2013;Gyõrffy et al., 2013;Tang et al., 2017) could substantially help researchers to discover potential biomarkers (Zheng et al., 2020). Herein, we developed a free web server OSluca to assess the prognostic value of the interesting gene in multiple cohorts of lung cancers. In OSluca, all the lung cancer cases are originated from the organ lung, not the second cancer from other cancers or organs. As a result, the prognostic specificity is only for lung cancer. Nevertheless, its prognostic significance in other types of cancers is also worth to be determined. To access the repeatability of previously reported prognostic biomarkers in OSluca, we collected 104 previously published prognostic biomarkers of lung cancer identified by qPCR or IHC, and tested their prognostic significance in OSluca. The testing results showed that most of the biomarkers were verified in OSluca and were confirmed for the published findings. Nevertheless, some genes showed different prognostic outcomes compared to previous literatures. The advantage of OSluca over other online prognostic web servers is that the size of lung cancer samples in OSluca is large, and tens of independent cohorts are available, which is extremely valuable for the identification and validation of cancer prognostic biomarkers, since the most important part for the biomarker development is independent validation across different datasets/cohorts. The limitation of the current study is that OSluca can only test a single gene for outcome analysis. In summary, OSluca is a free web server for non-bioinformatics researchers to study potential lung cancer prognostic biomarkers, accessed at http://bioinfo.henu.edu.cn/LUCA/LUCAList.jsp.

DATA AVAILABILITY STATEMENT
The datasets generated for this study can be found in the TCGA, NCBI GEO, and Roepman dataset.

AUTHOR CONTRIBUTIONS
XG: research design. QW and XG: establish OSluca web server. ZY, ZL, and XS: deal with RNA sequencing with clinical data of lung cancer. ZY, LX, XS, LZ, YL, and XG: draft of the manuscript. YD, XS, LZ, PS, YL, TX, and JM: collect previously reported biomarkers of lung cancer. ZY, LX, LZ, WZ, YZ, and XG: critical revision of the manuscript.

FUNDING
This study was supported by the following funding: The Kaifeng Science and Technology Major Project (18ZD008), the National Natural Science Foundation of China (Nos. 81602362 and 81801569), the Program for Science and Technology Development in Henan Province (Nos. 162102310391, 172102210187, and 192102310302), the Program for Young Key Teacher of Henan Province (2016GGJS-214), the supporting grants of Henan University (Nos. 2015YBZR048 and B2015151), and the Yellow River Scholar Program (No. H2016012).