Identification of Key Functional Gene Signatures Indicative of Dedifferentiation in Papillary Thyroid Cancer

Background: Differentiated thyroid cancer (DTC) is the most common type of thyroid cancer. Many of them can relapse to dedifferentiated thyroid cancer (DDTC) and exhibit different gene expression profiles. The underlying mechanism of dedifferentiation and the involved genes or pathways remained to be investigated. Methods: A discovery cohort obtained from patients who received surgical resection in the Fudan University Shanghai Cancer Center (FUSCC) and two validation cohorts derived from Gene Expression Omnibus (GEO) database were used to screen out differentially expressed genes in the dedifferentiation process. Weighted gene co-expression network analysis (WGCNA) was constructed to identify modules highly related to differentiation. Gene Set Enrichment Analysis (GSEA) was used to identify pathways related to differentiation, and all differentially expressed genes were grouped by function based on the GSEA and literature reviewing data. Least absolute shrinkage and selection operator (LASSO) regression analysis was used to control the number of variables in each group. Next, we used logistic regression to build a gene signature in each group to indicate differentiation status, and we computed receiver operating characteristic (ROC) curve to evaluate the indicative performance of each signature. Results: A total of 307 upregulated and 313 downregulated genes in poorly differentiated thyroid cancer (PDTC) compared with papillary thyroid cancer (PTC) and normal thyroid (NT) were screened out in FUSCC cohort and validated in two GEO cohorts. WGCNA of 620 differential genes yielded the seven core genes with the highest correlation with thyroid differentiation score (TDS). Furthermore, 395 genes significantly correlated with TDS in univariate logistic regression analysis were divided into 11 groups. The areas under the ROC curve (AUCs) of the gene signature of group transcription and epigenetic modification, signal and substance transport, extracellular matrix (ECM), and metabolism in the training set [The Cancer Genome Atlas (TCGA) cohort] and validation set (combined GEO cohort) were both >0.75. The gene signature based on group transcription and epigenetic modification, cilia formation and movement, and proliferation can reflect the patient's disease recurrence state. Conclusion: The dedifferentiation of DTC is affected by a variety of mechanisms including many genes. The gene signature of group transcription and epigenetic modification, signal and substance transport, ECM, and metabolism can be used as biomarkers for DDTC.

Background: Differentiated thyroid cancer (DTC) is the most common type of thyroid cancer. Many of them can relapse to dedifferentiated thyroid cancer (DDTC) and exhibit different gene expression profiles. The underlying mechanism of dedifferentiation and the involved genes or pathways remained to be investigated.
Methods: A discovery cohort obtained from patients who received surgical resection in the Fudan University Shanghai Cancer Center (FUSCC) and two validation cohorts derived from Gene Expression Omnibus (GEO) database were used to screen out differentially expressed genes in the dedifferentiation process. Weighted gene co-expression network analysis (WGCNA) was constructed to identify modules highly related to differentiation. Gene Set Enrichment Analysis (GSEA) was used to identify pathways related to differentiation, and all differentially expressed genes were grouped by function based on the GSEA and literature reviewing data. Least absolute shrinkage and selection operator (LASSO) regression analysis was used to control the number of variables in each group. Next, we used logistic regression to build a gene signature in each group to indicate differentiation status, and we computed receiver operating characteristic (ROC) curve to evaluate the indicative performance of each signature.
Results: A total of 307 upregulated and 313 downregulated genes in poorly differentiated thyroid cancer (PDTC) compared with papillary thyroid cancer (PTC) and normal thyroid (NT) were screened out in FUSCC cohort and validated in two GEO cohorts. WGCNA of 620 differential genes yielded the seven core genes with the highest correlation with thyroid differentiation score (TDS). Furthermore, 395 genes significantly correlated with TDS in univariate logistic regression analysis were divided into 11 groups. The areas under the ROC curve (AUCs) of the gene signature of group transcription and epigenetic modification, signal and substance transport, extracellular matrix (ECM), and metabolism in the training set [The Cancer Genome Atlas (TCGA) cohort] and validation set (combined GEO cohort) were both >0.75. The gene signature based on group

INTRODUCTION
Papillary thyroid cancer (PTC) is the most common type of thyroid cancer, and the majority of PTCs exhibit a relatively good prognosis (1,2). However, it has been observed recently that some PTCs may dedifferentiate in some situations. When PTC appears to be dedifferentiated, its prognosis becomes very poor, and conventional surgical treatment cannot achieve good therapeutic effects. Such patients often experience relapse or metastasis in a short period of time (3)(4)(5). At present, the treatment methods for poorly differentiated thyroid cancer (PDTC) and anaplastic thyroid cancer (ATC) are limited, and their progression mechanism is still unclear.
It has been shown that many ATC and PDTC result from dedifferentiation of DTC (6). In addition, some genetic abnormalities such as TERT and TP53 mutation may play an important role (7,8). A considerable number of studies have shown that occurrence and development of PDTC and ATC are closely related to immune microenvironment and epigenetic changes (9)(10)(11)(12). Our previous study also revealed that some genes may have a significant impact on the initiation and progression of dedifferentiated thyroid cancer (DDTC) through metabolism-related pathways (13). However, considering that dedifferentiation of DTC is accompanied by a great increase in the degree of malignancy, it is likely that dedifferentiation must involve more than one mechanism. The mechanisms of how genes affect DTC dedifferentiation remain to be studied.
This study was oriented toward mining out differentially expressed genes among PDTC, PTC, and normal thyroid (NT) at the level of transcriptome and then classifying them into different groups based on their biological functions to explore possible dedifferentiation-related processes. We expect that our findings could provide a plausible basis for further study of PDTC, thereby helping to indicate prognosis and development of PTC and exploring the possibility of reversing dedifferentiation or re-differentiation.

Sample Collection
Six NT, five PTC, and five PDTC specimens were obtained from eight patients who underwent surgical management in the Fudan University Shanghai Cancer Center (FUSCC) (Supplementary Table 1). The information of the eight patients and 16 samples was described in our previous study (13). These 16 samples were included in a discovery cohort and used for highthroughput RNA sequencing (RNA-seq) to identify differentially expressed genes. Written informed consent was obtained from each patient before his/her specimens were used in this study, and the study was approved by the Medical Ethics Committee of the FUSCC. All procedures performed in this study were in accordance with the Declaration of Helsinki.

RNA-Seq Analysis
Total RNA was extracted from all samples using TRIzol reagent (Life Technologies, Carlsbad, CA). We used RiboMinus eukaryote kit (Qiagen, Valencia, CA) to remove ribosomal RNA of total RNA (∼3 mg) before RNA-seq libraries construction. Strand-specific RNA-Seq libraries were prepared using the Illumina workflow (New England BioLabs, Beverly, MA). Next, the samples were fragmented, reverse-transcribed, and ligated to Illumina adaptors. We purified the ligated cDNA products to remove second-strand cDNA. After 13-15 cycles of amplification, libraries were controlled for quality and quantified using with an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA) and sequenced by a HiSeq 2000 sequencing system (Illumina, San Diego, CA) on a 100-bp paired-end run. Clustal Omega was used for sequence alignment. Human genome version GRCh38.100 was used throughout. RNA-seq data were normalized by fragments per kilobase per million (FPKM). Significant differences were determined by Limma package (version 3.11; https://bioconductor.org/packages/limma/).

The Cancer Genome Atlas and Gene Expression Omnibus Cohorts
The Cancer Genome Atlas (TCGA) mRNA expression data (FPKM) was sourced from the University of California Santa Cruz (UCSC) Cancer Browser (14). The raw data of Gene Expression Omnibus (GEO) combined GEO cohort including the GSE29265 cohort (20 NTs, 20 PTCs, and 9 ATCs) (15), GSE33630 (16), GSE53157 (17), GSE65144 (18), and GSE76039 (19) were obtained from the GEO database (20,21) and were background adjusted and normalized. The ComBat method was performed to remove batch effects by the R package "sva." According to the annotation file, probes were matched with gene symbol, and probes that were not matched to gene symbol were deleted. When more than one probe matched the same gene symbol, average value was calculated as the final expression value. We performed a correlation analysis on expression matrix of PDTC and ATC samples in the combined GEO cohort, and we found that PDTC and ATC are highly correlated (Supplementary Figure 1). Therefore, we used the GSE29265 cohort and GSE33630 cohort as verification cohorts to verify the effectiveness of gene signatures. All patients were staged by the 8th edition of the TNM staging system published by the American Joint Committee on Cancer. Only samples for which all clinical data and thyroid differentiation score (TDS) could be obtained were included in the analysis.

Gene Function Annotation and Interaction Analysis
Gene Set Enrichment Analysis (GSEA) was performed using GSEA software. C5 collection [Gene Ontology (GO)] was utilized to identify GO terms that were differentially regulated in different comparisons. GO and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis were performed by DAVID online function annotation tool (22,23) to classify differentially expressed genes into functional categories. The enriched group was ranked by p-value (p < 0.05) and false discovery rate (FDR < 0.25). The GeneMANIA prediction server was used for interaction analysis between genes (24).

Weighted Gene Co-expression Network Analysis
Weighted gene co-expression network analysis (WGCNA) was performed by the R WGCNA package (25)   metabolism and function, which can be an index to determine the degree of thyroid-specific differentiation. We have eliminated these genes in subsequent calculations to avoid bias (26). According to median of TDS, TCGA dataset was divided into two groups: highly differentiated and poorly differentiated. Least absolute shrinkage and selection operator (LASSO) regression was performed on each function group to reduce the number of variables. Multivariate logistic regression was used to determine genes to be finally included in the signature; expression levels of the differential genes as signatures in each patient were integrated into a risk score fitted by logistic regression ( Table 1). Non-parametric receiver operating characteristic (ROC) analysis was performed for each signature, and we calculated area under the ROC curve (AUC) to test its indicative power for dedifferentiation. The Kaplan-Meier method was used to construct the disease-free survival (DFS) curve. Two-side p < 0.05 was considered statistically significant. Statistical analysis and visualization were carried out through SPSS (v25.0; IBM Corporation, Armonk, NY) and R (v3.6.3; R Foundation for Statistical Computing, Vienna, Austria) software.

Identification of Genes Related to Differentiation Status in Dedifferentiated Thyroid Cancer
To identify genes that may relate to dedifferentiation of DTC, we first analyzed changes in high-throughput transcriptome expression profile of five PDTCs, five PTCs, and six NTs from the FUSCC cohort. Then, 1,465 deregulated genes (690 upregulated and 775 downregulated, fold change ≥ 2, p < 0.05) were found among three groups ( Figure 1A). Then we found out the corresponding expression values of these genes in two GEO datasets (GSE29265 and GSE33630) and confirmed their differential expression (307 upregulated genes and 313 downregulated genes, fold change ≥ 1, p < 0.05), as shown in Venn diagrams and heat maps, which finally validated our target gene set for an in-depth study (Figures 1B-F).

Weighted Gene Co-expression Network Analysis Obtained a Co-expression Module Containing Seven Hub Genes
The WGCNA of 620 valid genes was carried out. Seven gene co-expression modules were detected. WGCNA assigned colors to name each module, and the yellow module showed the highest correlation with TDS (Figures 2A-D). Then we screened genes by direct correlation between genes and specified traits, module identity, and weighted correlation; METTL7A, KCNQ1, ALDH9A1, C16orf46, PLAUR, BCL2, and TPMT were defined as the hub genes. The expression levels of these seven hub genes in each patient were fitted to a dedifferentiation risk score through logistic regression. The AUC value of this risk score in TCGA cohort reached 0.91( Figure 2F) and 0.73 in GEO combined cohort (Supplementary Figure 2). However, after the patients were divided into two groups based on the risk score, the difference in DFS between two groups was not significant (Figure 2E).

Functional Classification of Differentially Expressed Genes
The seven hub genes obtained by WGCNA showed no good results in functional enrichment and thus cannot fully reflect the impact of abnormally expressed genes on differentiation. Therefore, we screened 620 genes through univariate logistic regression analysis and found 396 genes that significantly correlated with TDS. Next, we tried to group these genes into functional groups based on the results of GSEA. We  found that the results of GSEA were mostly concentrated in proliferation, immunity, metabolism, and other related pathways (Supplementary Tables 2, 3). The most significant phenotype of tumor cells after dedifferentiation was the proliferation ability (27,28); as a result, some phenotypes of interest may have been masked by proliferation during GSEA. Therefore, we combined the results of GSEA and data from the literature to divide all deregulated genes into 11 groups including cell differentiation and rhythm, substance and signal transduction, transcription and epigenetic modification, protein modification, cilia formation and movement, extracellular matrix (ECM), apoptosis, proliferation, invasion, metabolism, and immunity ( Table 2). In addition, there were 44 genes that cannot be clearly functionally classified ( Table 1;  Supplementary Table 4).

Gene Signatures of Signal and Substance Transport, Transcription and Epigenetic Modification, Extracellular Matrix, and Metabolism Group Can Indicate Dedifferentiation of Papillary Thyroid Cancer
Among the 396 genes obtained in multivariate regression analysis, many genes have been extensively studied. For example, EZH2, a methyltransferase, is closely related to tumor metastasis and proliferation (29,30). TNC, an extracellular matrix protein, plays an important role in tumor cell invasion and metastasis (31,32). PRDM16, which also plays an important role in protein modification, is also closely related to fat metabolism and tumor growth (33,34). In addition, there are many important tumorrelated genes. We have further screened them in each group to construct a functionally related gene signature.
We first used LASSO regression to screen each group of genes (Supplementary Figure 3) and then performed multivariate logistic regression analysis adjusted by T stage, lymph node metastasis (LNM), and BRAF V600E on each group of genes selected to determine genes to be incorporated into the signature, and we evaluated the indicative ability of each signature by calculating the AUC value in the training set (TCGA cohort) and validation set (GEO combined cohort). We found that the AUC value in the group signal and substance transport, transcription and epigenetic modification, ECM, and metabolism was >0.75 in both the training set and validation set (Figure 3). It was worth noting that in two datasets, the AUC values of the invasion group reached 0.919297 and 0.746622, and the cilia formation and movement group also reached 0.805777 and 0.709767 (Supplementary Figure 4).

Correlations of the Gene Signatures With Clinical Features of Papillary Thyroid Cancer
In further analyses, we investigated the clinicopathological significance of the gene signatures associated with dedifferentiation to reveal their further research priority and translational potential.
Based on the gene signature of each group, we divided all TCGA patients into high-risk and low-risk groups, and we analyzed their survival differences. We found that although some risk score of gene signatures can reflect degree of differentiation well, they cannot accurately reflect the patient's tumor recurrence status. Among all gene signatures, there were significant differences in DFS between the high-risk and low-risk groups including transcription and epigenetic modification, cilia formation and movement, and proliferation (Figure 4; Supplementary Figure 5). Next, we further analyzed the relationships between high or low risk and clinical parameters in TCGA cohort ( Table 3), and we found significant differences in BRAF V600E mutation, TNM stage, T stage, and TDS. In  addition, in the cilia formation group, the risk score and LNM also had a significant correlation (Figure 5).

DISCUSSION
Undifferentiated thyroid cancer is relatively rare, but it has a very high degree of malignancy, which brings great difficulties in exploring its pathogenesis. In recent years, many studies have explored undifferentiated thyroid cancer at pretranscription, transcription, and translation levels, revealing that a considerable number of undifferentiated cancers develop from dedifferentiation of DTC. Studies have shown that dedifferentiation of colorectal cancer is closely related to TGF-β, Wnt, and Hedgehog signaling pathways (35), while dedifferentiation of glioblastoma is closely related to hypoxia (36). Our previous studies have also shown that metabolic pathways play an important role in dedifferentiation of thyroid cancer (13). Most of these studies focused on a certain pathway or mechanism; however, according to previous experience, the occurrence and development of malignant tumors, including dedifferentiation, involve multiple mechanisms and a large number of abnormal gene expressions. Moreover, research on the mechanism of dedifferentiation of DTC is still not comprehensive. In order to explore the potential mechanism of dedifferentiation of DTC from a broader perspective, we retrospectively obtained 16 samples of eight patients with DDTC who had undergone surgical treatment in our institution (FUSCC cohort). We used intermittent sampling to obtain tissue samples with a gradient of differentiation. Through the joint analysis of 16 samples of high-throughput sequencing data, TCGA cohort, and GEO chip data, we obtained 620 differentially expressed genes related to dedifferentiation of PTC, and all of them were subjected to further analysis. First, we performed WGCNA on all 620 genes and divided them into seven modules according to their expression. Among them, the yellow module demonstrated the strongest correlation with TDS. After screening genes by direct correlation between genes and specified traits, module identity, and weighted correlation, seven genes including METTL7A, KCNQ1, ALDH9A1, C16orf46, PLAUR, BCL2, and TMPT have been certified as hub genes. Through analysis of GeneMANIA prediction server, we found that these seven hub genes and their interacting genes directly or indirectly interact and co-localize, but they were not functionally enriched together (Supplementary Figure 6). Thus, we believe that the results of WGCNA cannot fully explain the dedifferentiation process of DTC.
In order to reflect the differentiation status more comprehensively from multiple angles, we first performed GSEA on all genes and found that the pathways related to TDS were mostly enriched on proliferation and apoptosis, which is consistent with the biological performance of PDTC. Many studies also focused on pathways related to cell proliferation, invasion, and immunity (26,(37)(38)(39); but also many new tumor-related phenotypes, such as ECM, cilia formation and movement, and epigenetic modification, have attracted our interest (40)(41)(42)(43). However, these phenotypes have been rarely studied in thyroid cancer, so there is a certain value for further research (9,44,45).
We evaluated the gene signature of each group by calculating AUC, and we verified it in the combined GEO cohort. Finally, we identified gene signatures in group signal and substance transport, transcription and epigenetic modification, ECM, invasion, and metabolism groups, which showed very good indicative performance. In the gene signature of signal and substance transport group, PEBP1, FZD5, KCNQ1, and TOM1L2 play a role in protein kinase binding together (Supplementary Figure 7A), while PEBP1 is also associated with autophagy and ferroptotic death (46). In the transcription and epigenetic modification group, SMAD6 and PRDM16 showed a high degree of consistency in function (Supplementary Figure 7B), as functions of these two genes are concentrated on TGF-β/SMAD (47,48), a pathway that plays an important role in cell differentiation. SMAD6 is a component of this pathway; and its importance is self-evident, as PRDM16, an important methyltransferase and transcription factor, has also been reported as a repressor of the TGF-β/SMAD pathway (49). In the ECM group, TNC was in a central position, and its main interaction target was Integrin Alpha V (ITGAV), a protein expressed in thyroid tissues higher than in other tissues in the human body, suggesting that it specifically influences the differentiation of thyroid cancer through the phenotype of extracellular matrix (Supplementary Figure 7C). As a traditional malignant tumor phenotype, the gene signature of the invasion group consisted of NHSL1, OCLN, EPB41L4B, PLAUR, PIK3CB, EPB41L5, ID4, TPM4, and MARVELD2. TPM4 showed the opposite effects in colon cancer and lung cancer. In our analysis, its high expression corresponded to better differentiation, which is consistent with the research results in colon cancer (50,51). In the metabolism group, CHPT1, PDE8B, HSD17B8, ADAMTSL1, ST3GAL1, SLC7A5, ACOT7, and PYGL were combined to make up a signature. PDE8B, HSD17B8, ST3GAL1, and ACOT7 also appeared in a signature constructed in our previous study (13).
Through the STRING database, we found that some genes have an interaction relationship, including intra-group and inter-group interactions (52) (Supplementary Figure 8). In the transcription and epigenetic modification group, there are gene fusion and co-expression between TBX3 and SMAD6 (53). OCLN and MARVELD2 are both invasion-related genes, and they share a protein homology (54). MARVELD2 also has a co-expression relationship with the signal and substance transport gene ILDR2 (54). KLHDC1 in the signal and substance transport group, and IQGAP3 in the proliferation group also has co-expression (55). These possible interactions indicate that the joint effect of these genes may have a more important impact on differentiation of DTC. Since they have not been studied in thyroid cancer, it is worthy of further investigation.
At the same time, we also paid due attention to the relationship between risk scores and other clinical parameters. We found that the cilia formation group could better reflect the condition of LNM than other groups. Other studies have shown that cilia formation is closely related to autophagy (56); therefore, high expression of cilia formation-related proteins (DNAH7, TAPT1, and BBS1) in the low-risk group may be due to exuberant autophagy activity (Supplementary Figure 7D). Additionally, the formation and movement of cilia are also very important to the movement of cells, which in turn affect the tumor cell migration and invasion (57). In recent years, studies have found that the sensitivity of ATC to chemotherapy drugs is closely related to autophagy (58)(59)(60)(61); therefore, we believe that genes involved in cilia formation and movement are promising as targets for drug therapy and should be further researched.
Our research also had certain limitations. First, we selected traditional malignant tumor-related phenotypes and some phenotypes that we were interested in as groupings, so this method is not very objective. There may be some important phenotypes that we did not include in the groups. This has also led to 44 genes that could not be included in specific group. At the same time, it should be noted that a more detailed and comprehensive grouping may divide the genes too finely, resulting in some interactions that cannot be reflected in intergroup. Second, limited by the number of cases and difficulty of obtaining material, the distinction between PDTC and PTC in the sequenced samples is not fine enough. We plan to collect more samples and use more precise microdissection methods for the next step of research.
In conclusion, we analyzed the genes that may affect the differentiation of thyroid cancer from 11 different perspectives and constructed gene signatures, revealing the possibility of multiple mechanisms that lead to dedifferentiation of DTC.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The name of the repository and accession number can be found below: National Center for Biotechnology Information (NCBI) BioProject, https://www.ncbi.nlm.nih.gov/ bioproject/, PRJNA702648.

ETHICS STATEMENT
This study was approved by the Medical Ethics Committee of the FUSCC. All procedures performed in our study were in accordance with the ethical standards of our institutional research committee and the Declaration of Helsinki.

AUTHOR CONTRIBUTIONS
WX, CL, BM, and WW contributed to conception and design of the study. HJ, XW, and YY organized the data. WX, CL, and YucW performed the statistical analysis. All authors contributed to writing and review of the manuscript and approved the final submitted version.

FUNDING
This work was supported by the National Natural Science Foundation of China (Grants 81572622 and 81772854 to QJ and 82072951 to YuW) and the Natural Science Foundation of Shanghai (Grant 19ZR1410900 to WW).

ACKNOWLEDGMENTS
The UCSC Xena Platform for cancer genomics datasets, the GEO platform of the National Center for Biotechnology Information, and the Science and Technology Commission of Shanghai Municipality are gratefully acknowledged. We were also very grateful to Haihan Chen for his help in the current research.