Identification of a 3-Gene Prognostic Index for Papillary Thyroid Carcinoma

The accurate determination of the risk of cancer recurrence is a critical unmet need in managing thyroid cancer (TC). Although numerous studies have successfully demonstrated the use of high throughput molecular diagnostics in TC prediction, it has not been successfully applied in routine clinical use, particularly in Chinese patients. In our study, we objective to screen for characteristic genes specific to PTC and establish an accurate model for diagnosis and prognostic evaluation of PTC. We screen the differentially expressed genes by Python 3.6 in The Cancer Genome Atlas (TCGA) database. We discovered a three-gene signature Gap junction protein beta 4 (GJB4), Ripply transcriptional repressor 3 (RIPPLY3), and Adrenoceptor alpha 1B (ADRA1B) that had a statistically significant difference. Then we used Gene Expression Omnibus (GEO) database to establish a diagnostic and prognostic model to verify the three-gene signature. For experimental validation, immunohistochemistry in tissue microarrays showed that thyroid samples’ proteins expressed by this three-gene are differentially expressed. Our protocol discovered a robust three-gene signature that can distinguish prognosis, which will have daily clinical application.

molecular analysis using a panel of genetic alterations would overcome the limitation of FNA diagnosis (Muzza et al., 2020). Molecular markers have become a potential tool for TC management to distinguish benign from malignant lesions, predict aggressive biology, prognosis, recurrence, and identify novel therapeutic targets (Nylén et al., 2020).
Recently, with the development of genome sequencing technologies, more and more accumulating evidence has revealed that tumour biomarkers, including protein-coding genes, non-coding RNAs and immune genes, are informative for cancer detection and prognosis classification (Zhong et al., 2020;Gan et al., 2021). mRNAs have a great potential in physiological and pathological processes and predict the prognosis of various types of tumour patients (Feng et al., 2019;Tschirdewahn et al., 2019). Therefore, the dysregulated expression or mutation of RNA may be a promising predictor of poor prognosis in PTC. Thus, mRNAs' dysregulated expression or mutation may be a promising predictor of poor prognosis in PTC.
The accurate determination of cancer diagnosis and treatment risk is a significant unmet need in PTC management. Patients and physicians must weigh the benefits of currently available therapies against the potential morbidity of these treatments. Herein we screen for characteristic genes specific to PTC and establish and validate an accurate model for PTC diagnosis and prognostic evaluation.

Patients and Tissue Samples
Tissue microarrays (TMA) of human TC (IWLT-N-58T53 TC-1503) involved in this experiment and research were purchased from Wuhan Aiwei Biological Technology Co. LTD., along with the detailed clinical information. It included 29 cases of PTC and 29 cases of para-cancer tissue. Of the 29 patients, 21 were female (aged 24-66 years), and eight were male (aged 27-60 years).

Gene-Expression Data Sets
The gene expression and clinical data used for modelling were derived from TCGA (http://www.cbioportal.org/datasets), which contained gene expression data from 568 samples and clinical information from 516 samples. From The Cancer Genome Atlas (TCGA) database, clinical information was screened via the Cancer Type Detailed PTC parameter, of which a total of 399 samples were found. In these 399 PTC patients, 395 cases had RNA-seq data, of which 52 had para-cancer tissue data creating a total number of 447 RNA-seq data points. The gene expression data used to validate our model came from GSE27155 (Giordano et al., 2005;Giordano et al., 2006) of the GEO database (https:// www.ncbi.nlm.nih.gov/geo/). The differentially expressed (DE) mRNAs between normal and PTC samples were assessed using the R Studio. software program (RStudio version 1.1.463; http:// www.rproject.org), and the R package, Limma. log2FC (fold change) > 2 and p-value < 0.05 were considered for subsequent analyses (Gong et al., 2019). The project/collection had a total of 99 samples: four from normal patients, 10 cases of follicular adenomas, 13 cases of follicular thyroid carcinomas, 7 cases of eosinophilic thyroid adenomas, 8 cases of thyroid carcinomas, 51 cases of PTC, 4 cases of anaplastic thyroid carcinomas, and 2 cases of medullary thyroid carcinomas. In this study, we selected the 4 cases of normal patients and 51 cases of PTC to analyze.

Feature Selection Methods
Python 3.6 was utilized to screen TCGA expression data for DE genes. The processing steps were as follows: Delete genes with an average expression value of less than 10 reads, which are considered to be genes of no research value in survival differences. Judge whether or not there is a significant difference at p < 0.05 between the two comparison groups using the SciPy package (https://www.scipy.org/) to perform t-test on the different study groups. Calculate the fold change value difference between groups by taking the mean value of different groupings.
To find genes for use in modelling, we screened for characteristic genes that significantly affected PTC survival. The R 3.6 software was used to perform a univariate cox regression analysis between DE genes and clinical data (time, status) in 395 patients (Gill, 1992). Genes with a hazard ratio (HR) greater than 1, or less than 1, and a Wald test p-value of less than 0.05 were genes that significantly affected PTC survival. Therefore, selected these genes as characteristic genes for use in establishing a diagnostic model. We summarize the selection process in Figure 1.

Establishing the Diagnosis and Prognosis Model
This study used the sklearn package (http://scikit-learn.org) provided in Python 3.6 to establish a Support vector machine (SVM) model to differentiate between cancer and non-cancer. Use the SVM classifier model to explore the optimal three-gene signature prognosis model. Based on the univariate Cox regression analysis of the selected characteristic genes, we established a prognostic model to calculate a patient's prognosis by calculating their 'RiskScore' (Xiong et al., 2017). According to a set threshold (HR > 1 or HR < 1, p < 0.05), threegene ( Table 2) were found to be significantly associated with overall survival.
To test the diagnostic predictive power of the three-gene signature that we selected, we randomized TCGA PTC patients into a training set (312 samples, 70%) and a test set (135 samples, 30%). The training set was used for 10% crossvalidation. The optimal parameters of the final model were ("C": 1, "gamma": 1,000, "kernel": "rbf"), with the final average accuracy being 0.9263 (Standard Deviation: ± 0.0117). The average accuracy of our best model using the training set was 0.9679. To verify the effectiveness of this model, we used the best model predictions that gave an average accuracy of 0.9259. In addition, to verify the diagnostic predictive power of our threegene signature, we also used the three-gene in GSE27155 and established an SVM model in the same way. Due to the small number of negative samples, we chose to use the 3-fold cross-validation and pre-determined optimal parameters of the model ("C": 15, "gamma": 1, "kernel": "rbf"). This gave an accuracy of 0.9464 (SD: ± 0.0430).

Microarray Preprocessing
Briefly, after deparaffinization in xylene and rehydration with graded concentrations of alcohol to distilled water, the TMA slides were washed in Tris-buffered saline with 0.1% Tween 20 (TBST), the slides were incubated with the primary antibody against GJB4 (1:10, Abcam, A9888), RIPPLY3 (1:75, Sigma-Aldrich, HPA055541), ADRA1B (1:50, Abcam, ab84405) at 4°C overnight. After washing three times in TBST, the specifically bound secondary antibody was detected with the DAKO EnVision detection System (Dako Diagnostics, Switzerland). Immunostaining scores were independently performed by two experienced pathologists who did not know the patient's clinical pathology data and the immediate clinical outcome. The staining intensity was scored as negative (1), weak (2), moderate (3) or strong (4). The staining extent was scored as 1 (≤10%), 2 (11-50%), 3 (51-75%) or 4 (>75%). A total expression score was calculated by multiplying the staining intensity score with that of the staining extent. ≤ 8 points were considered as low expression. Otherwise, it is considered as a high expression. Histological classification of the samples, stained with hematoxylin and eosin, was performed by two independent clinical pathologists.

Functional Enrichment Gene Ontology Analysis
GO functional annotation pathway enrichments were performed in R using the "clusterProfiler" package, and P adjusted (FDR) < 0.05 was statistically significant.

Construction of the Protein-Protein Interaction Network
The DE mRNA were imported into the STRING database (https://string-db.org/) (UniProt Consortium, 2010) to construct a PPI network. The network analysis plug-in in Cytoscape software was used to analyze network topological features to screen the hub nodes in the PPI network (Saito et al., 2012). Degree centrality denotes several direct connections of a node to all other nodes in the network.

Data Analysis
Statistical analysis was performed using R 3.4.0 (https://www.rproject.org/), Python 3.6 (https://www.python.org/) and Graphpad Prism version 7.0 (GraphPad Software). A twotailed Student's t-test was used for comparisons between two independent groups. This study used the sklearn package (http://scikit-learn.org) provided in Python 3.6 to establish an SVM model to differentiate between cancer and non-cancer. All statistical analyses were two-sided. p < 0.05 was defined as indicating statistical significance (Ge et al., 2019).

Bioinformatics Analysis Was Used to Screen Differentially Expressed Genes
A total of 20,531 genes were screened from TCGA. After deleting genes with a mean expression of less than 10, 15,370 genes remained. The number of DE genes between tumour tissue and adjacent tissues was 762, of which 545 genes were upregulated, and 217 genes were down- regulated. The expression values of the DE genes were converted by log10 and displayed by heatmap. As seen from the heatmap (Figure 2), there is a significant difference between the tumour and the normal tissue, indicating that the results identified in this study are credible. The top 10 significantly upregulated and the top 10 significantly downregulated DE mRNA are displayed in Table 1.

Cox Regression Analysis Was Used to Screen Characteristic Genes
Through univariate Cox regression analysis, a hazard ratio was calculated for each gene according to the set threshold (HR > 1 or HR < 1, p < 0.05), To screen-specific biomarkers with accurate diagnostic ability. There were three genes found to be significantly related to overall survival ( Table 2). These genes were GJB4, RIPPLY3, and ADRA1B. These three genes will be used as feature genes for subsequent modelling, and the specific information of genes is shown in Table 4.
FIGURE 2 | Heatmap of significantly differentially expressed genes. Each row represents a separate gene, each column represents a separate sample, a gradient from green to red indicates a low to high level of expression, and the samples are clustered from two types of tissue: normal tissue (green) and cancer tissue (red). To test the prognostic prediction ability of the screened threegene signature, we calculated the Risk-Score of each patient in the TCGA training set by using the established prognostic model. The risk assessment score formula was as follows: risk score= (-0.057×expression value of GJB4) + (−0.021×expression value of ADRA1B) + (-0.110×expression value of RIPPLY3). Table 3 Then the patients were ranked according to risk-score, and the risk-score median (−0.7766) was taken as the threshold. The 312 patients were divided into two groups, 156 patients with low risk and 156 patients with high risk. Of these 312 persons, 280 had survival information (OS), of which 156 were low risk, and 124 were high risk. Kaplan-Meier survival curves and ROC curves were used to examine the predictive power of three-gene biomarkers. Kaplan-Meier survival curves showed that the survival rate in the high-risk group was significantly lower than that in the low-risk group (log-rank p value = 0.038), Figure 3. The ROC curve was shown in Figure 4, and the AUC value was 0.7513, indicating that the three-gene signatures had a certain prognostic predictive ability. To further verify the prognostic capability of the three-gene signatures, we calculated the risk score of the patients in the test set and the PTC patient data set of the whole TCGA respectively and divided the patients into high-risk and lowrisk patients' groups according to the same threshold (−0.7766), Table 4. The Kaplan-Meier survival curves for both data sets showed significantly lower survival rates in the high-risk group than in the low-risk group (log-rank p value = 0.017 and 0.0022), with AUC values of 0.9023 and 0.7910, respectively. The training set and the results of two confirmations showed that the threegene biomarkers screened had a strong prognostic capability.

Immunohistochemical Verification Results of Characteristic Gene Tissue Microarray
To further verify the protein expression level of three-gene signatures in PTC tissue and analyze its relationship with clinicopathological features, immunohistochemistry was used to detect the protein expression level of tri-factor in 29 PTC tissue chips. The results of this study showed that the protein   Frontiers in Molecular Biosciences | www.frontiersin.org March 2022 | Volume 9 | Article 807931 5 expression levels of GJB4 and ADRA1B in cancer cells were significantly higher than those in paired para-cancer cells (p < 0.05) (Figure 5), and the protein expression level of RIPPLY3 was not significant difference between the cancer cells and the paracancer cells.

GO Enrichment Analyses and Construction of the PPI Network
Functional enrichment GO analyses were performed to investigate the underlying mechanisms of the DE mRNA genes' prognostic effects. Our results demonstrate that the DE mRNA genes are linked with activating pathways, such as tumour development and progression regulation, based on GO analysis of three cohorts ( Figure 6). PPI network reveal that the potential connection between key mRNA genes ( Figure 7).

DISCUSSION
In our study, the results above clearly demonstrate that the threegene signatures we screened for and selected have a strong ability to distinguish between cancerous and non-cancerous samples. The genes signature include GJB4, RIPPLY3, and ADRA1B. It was reported that GJB4 and ADRA1B genes play an essential role in developing many malignant tumours. It has been shown that GJB4 is involved in tumorigenesis and may act as a tumour promoter, Wang et al. (Wang et al., 2019) indicated that miR-492 promoted cancer progression by targeting GJB4 and was a novel FIGURE 5 | The immunohistochemical results of GJB4, RIPPLY3, and ADRA1B characteristic gene tissue chips indicated that the expression in tumor tissues was significantly higher than that in adjacent tissues.
Frontiers in Molecular Biosciences | www.frontiersin.org March 2022 | Volume 9 | Article 807931 6 biomarker for bladder cancer. Liu et al. (Liu et al., 2019) showed that GJB4 was highly expressed in gastric cancer tissues and cells, the high expression of GJB4 was significantly correlated with the overall survival of gastric cancer patients, and the cell proliferation and migration of gastric cancer cells were significantly inhibited by knockout GJB4. At the same time, targeting GJB4 may be exploited as a modality for improving lung cancer therapy had been proved (Lin et al., 2019). The ADRA1B gene is a member of the adrenergic receptor alpha 1 (ADRA1) subfamily, which also includes ADRA1A and ADRA1D, and has been shown to promote the development of cancer in the epinephrine cell pathway. Adrenergic receptor  antagonists have also been shown to be useful in the treatment of various types of cancer, including prostate and breast cancer (Freudenberger et al., 2006;Harris et al., 2007). In present, no studies have been reported on the relationship between GJB4 and ADRA1B gene in TC. However, our study demonstrated that the GJB4 and ADRA1B genes may not be able to promote the thyroid cancer progression and development, and it may even be a protective gene.
Although no direct studies have proved that RIPPLY3 (also known as DSCR6) is closely related to PTC or other malignant tumours, current studies have found that the RIPPLY3 gene plays a role in developing the pharynx and its derivatives in vertebrates (Tsuchiya et al., 2018). Li et al. (Li et al., 2013) showed that RIPPLY3 is closely associated with Down syndrome (DS). Studies have found that people with DS have an increased risk of thyroid disease (mainly autoimmune), with a lifetime prevalence of between 13 and 63% (AlAaraj et al., 2019). These results suggest that RIPPLY3 may affect the development of the thyroid gland, and its abnormal expression may lead to the occurrence and development of PTC. The GO enrichment analysis revealed that the chief pathways regulated the cellmolecular function and the enzyme activity. Previous studies have demonstrated the gene effect on thyroid cell function and cell morphology (Yu et al., 2017;Rudzińska et al., 2019).
At present, the preoperative diagnosis of PTC is still mainly FNA. However, according to the Bethesda grading standard, the proportion of FNA diagnosis results is suspicious or uncertain is 3-18% (Misiakos et al., 2016). In the era of precision therapy, we need an accurate diagnosis, which requires an accurate prognosis. To find prognostic molecular markers of PTC, this study obtained the gene expression characteristic of tumor prognosis through TCGA to screen characteristic genes and carry out an effective risk assessment of tumour prognosis.
However, our study has some limitations, the most important one is the limited number of patients in our database group confined to the limitation to TCGA、GEO. There is still a lack of large sample data sets and clinical samples to verify the accuracy of the three-signature prognosis model. Also, we should further investigate the correlation between the three gene expression levels and the clinicopathological features. Fortunately, gene sequencing technology is gradually maturing and becoming faster and less expensive. We will continue to collect cases of TC tissue to verify our signature further. Furthermore, although we believe that the three-gene signature is promising in selecting patients who will benefit from the three-gene prognostic model, its significant value still needs to be verified in prospective studies.

CONCLUSION
This study screened for DE genes (GJB4, RIPPLY3, ADRA1B) that were significantly related to the diagnosis and prognosis of PTC. The three-gene diagnostic model could accurately predict the occurrence of PTC and guide prognosis.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary materials, further inquiries can be directed to the corresponding authors.