Cancer CRC: A Comprehensive Cancer Core Transcriptional Regulatory Circuit Resource and Analysis Platform

A core transcriptional regulatory circuit (CRC) is a group of interconnected auto-regulating transcription factors (TFs) that form loops and can be identified by super-enhancers (SEs). Studies have indicated that CRCs play an important role in defining cellular identity and determining cellular fate. Additionally, core TFs in CRCs are regulators of cell-type-specific transcriptional regulation. However, a global view of CRC properties across various cancer types has not been generated. Thus, we integrated paired cancer ATAC-seq and H3K27ac ChIP-seq data for specific cell lines to develop the Cancer CRC (http://bio.liclab.net/Cancer_crc/index.html). This platform documented 94,108 cancer CRCs, including 325 core TFs. The cancer CRC also provided the “SE active core TFs analysis” and “TF enrichment analysis” tools to identify potentially key TFs in cancer. In addition, we performed a comprehensive analysis of core TFs in various cancer types to reveal conserved and cancer-specific TFs.


INTRODUCTION
Transcription factors (TFs) regulate gene expression by binding to specific DNA sequences (1). A small number of TFs that are expressed in each cell type control the gene expression programs in specific cells, and dysregulation of these genes can cause various diseases (2)(3)(4)(5)(6). These TF clusters play crucial roles in the pathology and development of cancers. For example, in MYCN-amplified neuroblastoma, the TFs MYCN, HAND2, ISL1, PHOX2B, GATA3, and TBX2 are essential for maintaining cell state and survival (7). Ran et al. reported that in gastrointestinal stromal tumors (GIST), the forkhead family member FOXF1 directly controls the transcription of two master regulators, KIT and ETV1, and these TFs are required for precursor-interstitial cell lineage specification and the tumorigenesis of GIST (8). Furthermore, groundbreaking studies on embryonic stem cells (ESCs) revealed that OCT4, SOX2, and NANOG played prominent roles in early cell development. These TFs are essential for the proliferation of undifferentiated ESCs and contribute to the pluripotency and self-renewal of stem cells (9)(10)(11)(12). Together, these TFs cooperate to regulate cellular gene expression by binding regulatory elements that impact expressed genes. Additionally, they not only bind to their own loci, but also regulate mutually. They collaborate to form regulatory circuitry and their expression is driven individually and in cooperation with other members of the circuit. Researchers have defined these interconnected autoregulatory loops as core transcriptional regulatory circuits (CRCs), and TFs within these CRCs are known as core TFs (9,13). An increasing number of studies have investigated CRCs and core TFs in different cancer types. These studies have shown that core TFs promoted the growth of cancer cells (14). For example, in esophageal squamous cell carcinoma, the TFs TP63, SOX2, and KLF5 form a CRC to control gene expression programs and transcription patterns (15). Additionally, Decaesteker et al. showed that TBX2 is a member of a CRC in neuroblastoma. Furthermore, TBX2 drives proliferation through the activation of FOXM1 target genes (16). Above all, CRCs are critical for maintaining transcriptional patterns. Identification of CRCs in various cancers is essential for a better understanding of cell/tissue characteristics and for addressing fundamental molecular and cellular biological questions.
Studies have shown that genes encoding TFs that form CRCs are driven by SEs (17). SEs are densely bound by an array of TFs (18,19). Additionally, SEs are widely thought to drive the expression of cell-type-specific or disease-associated genes, including core TFs (5,17,(20)(21)(22). Recently, a CRC database, dbCoRC, was developed by Hung et al. (23). This database provides a valuable resource to catalog the fast-expanding information on CRCs. In dbCoRC, CRCs are computationally inferred from the mapping of SEs and the prediction of TF binding sites using H3K27ac ChIP-seq data. However, chromatin accessible regions can now be identified by the "assay for transposase-accessible chromatin using sequencing" (ATAC-seq) (24) technology. For H3k27ac ChIP-seq and ATACseq data, both can be used independently to identify regulatory element regions. Uniquely, the combination of these two kinds of data can be used to identify precise TF-binding hypersensitive elements within large enhancer regions, which will be helpful for identifying SE-defined TF networks in cancer. Furthermore, the python package "Coltron" (https://pypi.python.org/pypi/coltron) developed by Saint-Andre et al. can be used to integrate H3K27ac ChIP-seq and ATAC-seq data to identify CRCs (17,25). For instance, Ott et al. used "Coltron" to generate regulatory landscapes of chronic lymphocytic leukemia samples and representative cell lines. They showed that CRC discovery in primary cancer cells uncovered tumor-specific hallmarks and active TF regulatory pathways (26). That study also demonstrated the success of CRC reconstruction using "Coltron".
Here, we present the Cancer CRC (http://bio.liclab.net/ Cancer_crc/index.html), the first comprehensive and interactive resource of cancer CRC models. To highlight the advantages and innovations of Cancer CRC, we compared it with existing CRC databases dbCoRC in both data and functionality in Table S1. Cancer CRC focuses on building a comprehensive cancer CRCs and analysis platform, and has more obvious cancer specificity and functional advantages (Table S1). Cancer CRC contains information on core TFs, including topological properties (in-degree and out-degree), expression values, frequencies, associated SEs, mutational information, disease information, associated pathways, Gene Ontology (GO) functional annotations and survival analyses. In particular, Cancer CRC supports a TF enrichment analysis tool to enable the discovery of important cancer-related TFs. Here, we provide a case study using differentially expressed TFs in breast cancer to demonstrate the value of this analytical tool. Moreover, we describe a global analysis of all core TFs and reveal the universality and specificity of TFs across cancer types. The Cancer CRC platform is a valuable resource to query, browse and visualize information about cancer CRCs, and it can also be used to investigate the function of TFs in cancer types.

Super Enhancers
First, we collected raw H3K27ac ChIP-seq data from NCBI GEO/ SRA, ENCODE, Roadmap and the Genomics of Gene Regulation Project. In the process of data collection and collation, it's required that H3K27ac and the corresponding control group (Input) ChIP-seq data must be both present in a cell or tissue sample to reduce false positives. Second, we used Bowtie (27) to align the raw sequencing reads to the human (hg19) reference genome. Third, the ChIP-seq peak calling analysis was performed using MACS software (28). Last, we used rank ordering of super enhancers (22) software to annotate SE regions. Furthermore, detailed SE information and analysis can be viewed in the SEdb (29) database and SE analysis (30) web server, which were developed by our group.

Accessible Chromatin
ATAC-seq provides a systematic approach to understand the genome in cancer to advance diagnosis and therapy. To obtain cancer-specific open chromatin regions, we collected genomic regions of 23 kinds of primary human cancers from TCGA (https://www.cancer.gov/). These genomic regions have gone through quality control and screening, which are high-quality and cancer type-specific peak set. For each sample, the MACS2 callpeak command with the parameters "-shift -75 -extsize 150nomodel -call-summits -nolambda -keep-dup all -p 0.01" were used for peak calling. Then, the peak summits were extended by 250 bp on either side to a final width of 501 bp, filtered by the ENCODE hg38 blacklist (https://www.encodeproject.org/ annotations/ENCSR636HFF/), and further filtered to remove peaks that extend beyond the ends of chromosomes (31). Then, we used the UCSC liftOver tool (32) to convert the genomic locations of these regions onto the hg19 reference genome to match to the SE data.

CRC Construction
Based on the available datasets, we only retained paired samples with both H3K27ac ChIP-seq data and cancer ATAC-seq data, which covered 13 cancer tissue types. The detailed information on each H3k27ac sample (source, tissue type, sample type and cancer type) can be found in Table S2. To integrate the two types of data for each cancer type, the Coltron pipeline was used with the command: -o [OUTPUTFOLDER] -n [NAME]", along with the option "-s SUBPEAKS, -subpeaks=SUBPEAKS". We also provide the detailed technical pipeline ( Figure S1).
By calculating the regions overlapping between the H3k27ac ChIP-seq and ATAC-seq data, we obtained the trans-active SEs regions that TFs typically bind to. At present, there are many ways to recognize TF binding motifs (33,34). Here, the underlying sequences from these regions were extracted and FIMO v4.91 was used to search for TF binding sites in these regions (35). TF position-weight matrices were taken from TRANSFAC and JASPAR (36,37). When a motif of TF A was identified in an SE of TF B, an edge was linked between TF A and TF B. We merged all edges to obtain a TF interaction network. A clique was defined as a subnetwork of the TF network with a size of at least two nodes (TFs) where all TFs were connected to themselves and other TFs within that clique. For each sample, the set of all cliques of size 2 or greater were defined. Additionally, CRCs were defined as completely auto-regulatory TF cliques within each sample.
Through our pipeline, we obtained the CRC file of each sample. Taking the esophageal cancer KYSE510 cell line as an example (Table S3), the results file consisted of three columns (a list of all CRCs in the sample, the score of each CRC and the number of TFs of each CRC). Ultimately, we identified more than 94,000 CRCs, and the distribution of all CRCs core TFs across each cancer can be found in Table 1. We count the number of core TFs in each cancer to facilitate users to browse and search for core TFs according to each cancer.

TF Degree
The total in-degree of a TF was defined as the sum of the TFs binding to their regulatory elements, while the out-degree of a TF was defined as the sum of the TFs that were regulated by it in the network. The total degree was calculated by summing the total number of edges adjacent to the given node (Table S4).

TF Frequency
In a sample, the frequency of a TF was defined as the number of all CRCs in the sample divided by the number of CRCs that contain that TF.

CRCs Score
"Clique score" was defined as the average TF frequency across all cliques of a sample.

TF Expression
Cancer CRC provides TF expression information in different samples. We downloaded mRNA expression profiles with FPKM values from TCGA (https://tcga-data.nci.nih.gov/tcga), GTEx, CCLE and ENCODE databases. The expression panel provides users with the differential expression of TFs in different samples.

Mutation Information
Several studies have shown that disease-associated variants are significantly enriched in TFs. Somatic mutations within TFs may affect the regulation of target genes and the expression of downstream genes (38)(39)(40). Thus, we collected somatic mutations in TFs from OncoBase (41). Furthermore, genetic mutations located in regulatory regions may provide clues that the mutation results in the dysregulation of gene expression (42,43). We also annotated mutation information in TF regulatory regions, such as common SNPs, risk-associated SNPs, eQTLs, SAS, EAS, EUR, AMR and AFR.

GO Term Information
The GO database was used to predict the potential functions of TFs (44). We annotated biological terms to TFs and provided associated information for core TFs including GO_term_type and GO_term_name acquired from the GO database.

Disease Information
The Comparative Toxicogenomics Database (45) and PsyGeNE (46) pipelines were integrated into the Cancer CRC to evaluate the correlations between TFs and various diseases.

Pathway Information
Pathways associated with TFs were acquired from ComPAT (http://bio.licpathway.net:8018/msg/ComPAT/index.do). For a core TF, we provided related pathways as well as the pathway name, pathway source, gene number and edge number of the current pathway. When selecting the pathway ID, detailed information including comprehensive annotations and interactive visualizations of the current pathway are shown.

Survival Analysis
The survival analysis module allows users to select a TF or multiple TFs on the detail page for survival analysis, which is implemented by the built-in "gepia2" python package (47). Furthermore, there are several clinical factors, such as age, sex, stage, etc. to help users observe the impact of these factors on prognosis.

Construction and Functionalities of the Cancer CRC
The construction of the Cancer CRC platform was based on the comprehensive identification and analysis of CRCs from various cancer types. The content and construction pipeline of the platform are presented in Figure 1.

Online Analysis Tools
Cancer CRC provides two tools to perform related analyses on core TFs. In the first tool, "Super enhancer active core TF analysis", users can obtain relevant TF regulatory information by inputting a genomic region. Then, SE regions overlapping with the input region will be returned, as well as core TFs regulated by these SE regions and linked sample information.
The core TFs related to CRC information can also be obtained by clicking on a specific TF. Furthermore, for the SE regions, we provide detailed annotation information including common single nucleotide polymorphisms (SNPs), risk-associated SNPs and expression quantitative trait loci (eQTLs), which may affect the binding of TFs to SEs. Second, in the "TF enrichment analysis" tool, Cancer CRC annotates TFs submitted by users with reference cancer core TF sets. We calculated the statistical significance of the enrichment analysis using the hypergeometric test (48). The corresponding enrichment results are shown in the form of a bubble diagram, bar graphs and a Venn diagram. Furthermore, Cancer CRC provides further information on these annotated TFs.

Personalized Genome Browser and Data Visualization
The query on the search page is limited to TFs. To better display multidimensional TF information across the entire genome, we developed a personalized genome browser using JBrowse (49) to view detailed transcriptional regulatory information with several tracks. Users can select multiple tracks to interactively display peaks at a specific genomic location and see the proximity of specific regions to nearby genes, genomic segments, SNPs, common SNPs, risk-associated SNPs, enhancers and conserved TF binding sites. For each genomic region in our platform, there is a hyperlink to the UCSC Genome Browser webserver and the built-in JBrowse viewer.

An Application of Cancer CRC to Explore the CRCs in Esophageal Cancer
From the "Browse" page, users can apply three strategies to search for samples of interest. To find esophageal cancer CRCs, users can select the "Cancer type" as "ESCA" (option 1); enter "ESCA" in the "Search" tab (option 2); or adjust the number of entries per page and look up their cancer of interest in the alphanumerical table (option  3). From the results, we can see that there are two samples of esophageal cancer (Figure 2A). After clicking on "Sample_02_228" and the KYSE510 cell line, the page will be redirected to display the sample details. For the current sample, all CRCs in this sample will be shown ( Figure 2B). This of the core TFs present in the CRC of that sample ( Figure 2C). We found that KLF5, SOX2 and TP63 all have a high frequency in this sample's CRCs. Additionally, all of the core TFs have a wellestablished function in esophageal cancer (50)(51)(52). Another significant piece of information on the core TFs is the degree ( Figure 2D). Notably, KLF5, SOX2 and TP63 were also highly connected in the KYSE510 cell line. We found that TFs with higher frequencies also tended to have higher degree, and these two properties of core TFs could help select TFs that are important in cancers. In addition, information on the sample details also included: (i) a sample overview (sample basic information, number of CRCs in this sample and all core TFs of this sample); (ii) the most representative CRC of this sample; (iii) the "Expression" panel shows the differential expression of core TFs across a large panel of human cell lines, normal tissues and cancers.
Taking the most representative CRC of esophageal cancer Sample_02_228 as an example, the CRC details page will be redirected to display a visual and interactive CRC model along with the exportable tabular information on individual core TFs. Cancer CRC provides regulatory information on TFs in one CRC. The genetic and epigenetic annotations for binding regions of TFs, including common SNPs, risk-associated SNPs, eQTLs, and so on, are provided ( Figure 2E). Furthermore, conjoint survival analysis and co-expression of TFs in each CRC are provided on the CRC detail pages. When clicking on a TF (e.g., BCL6), the mutation, disease and expression information of that TF are displayed ( Figure 2F). Moreover, information on individual core TFs, their corresponding SE regions, the frequencies of core TFs in all of the CRCs of the sample, the distribution of core TFs among different CRCs (the distribution of a TF in samples most representative CRC and the distribution of a TF in all CRCs), survival analysis results, TF associated pathway information and GO term information are provided on the detail page. It is worth mentioning that if users want to search for esophageal cancer related core TFs through the search page, our platform can rank all the core TFs in esophageal cancer (section "TF in CRCs of BRCA") on the return page to help users find potentially important TFs in esophageal cancer.

A Case Study Using Differential Expressed Breast Cancer TFs
Numerous studies have focused on enrichment analysis of differentially expressed TFs to explore their potential functions. Therefore, by analyzing differentially expressed TFs in breast cancer, we aimed to find key TFs affecting its occurrence and development. First, we used the "limma" R package (53) to calculate differentially expressed genes in breast cancer samples from TCGA. Then, we obtained the differentially expressed TFs (log 2 FC > 1, P < 0.05) ( Table S5) as input for Cancer CRC. Next, we set the parameters of the hypergeometric test at P = 0.01 and clicked the "RUN" button to start the enrichment analysis ( Figure 3A).
The analytical results showed that these TFs were only significantly enriched in three invasive breast cancer samples ( Figure 3B). Users can click on the hyperlink to view the sample details and TFs annotated to the sample. In total, seven TFs (HOXC11, FOXA1, SPDEF, SOX12, HOXC10, GATA3 and TFAP2A) were annotated to the breast cancer samples. Indeed, most of these TFs have previously been associated with breast cancer. For example, HOXC11 was identified to be an indicator of poor response to hormonal therapy in breast cancer cases (54). FOXA1, GATA3 and SPDEF are key drivers of estrogen receptorpositive (ER+) breast cancer risk as well as cancer progression (55,56). In breast cancer cells, overexpression of SOX12 partially improved the inhibitory effect on cell proliferation, migration and invasion. Additionally, a study demonstrated that cases with high HOXC10 expression showed shorter recurrence-free and overall survival in patients with breast cancer undergoing chemotherapy (57,58). However, one of the TFs annotated in our study, TFAP2A, has not been extensively studied in breast cancer, and may be a new critical TF associated with the disease. The bubble diagram, bar graphs and Venn diagram of the enrichment analysis result are also provided ( Figure 3C). Users can download these results by clicking on the top button.
Although our analyses showed that these differentially expressed TFs are highly related to breast cancer, further exploration of the biological mechanisms leading to cancer remains essential. Therefore, we examined the regulatory relationship of the enriched TFs and found that all the identified TFs regulated each other ( Figure 3D). Furthermore, customized survival analysis could guide research on prognosis for these TFs ( Figure 3E).
Taken together, we provided a specific case study of enrichment analysis for differentially expressed TFs in breast cancer using the Cancer CRC resource. Most of the annotated TFs had previously been shown to be directly related to breast cancer. These results showed the value and significance of our platform for research of TFs in cancer.

Landscape of Core TFs Across Cancer Types
Investigating the expression patterns and distribution of all core TFs across cancer types can reveal conserved and cancer-specific TFs, which will help reveal their key roles in the genesis and progression of tumors. By analyzing mRNA expression data, we found that the expression of core TFs across cancers was significantly higher than non-TF mRNA and other TFs ( Figure 4A), suggesting that core TFs were closely associated with cancer developmental processes (17). Next, we examined the frequency of the core TFs in each cancer type, and we found a lot of universal and specific core TFs in various cancers ( Figure 4B). On one hand, one group of TFs was specific for certain cancers. Cancer-specific TFs play a key role in the occurrence and development of different cancers. For example, PAX9 and TP73 appeared specifically in esophagus cancer ( Figure 4B). Studies have shown that PAX9 regulated squamous cell differentiation in the epithelium, and progressive loss of PAX9 expression correlated with increasing malignancy of the esophagus (59,60). However, the specific TF MITF of skin cutaneous melanoma (SKCM), which is a marker of melanoma cell differentiation and regulates a variety of melanocyte differentiation genes. It has been reported that the down-regulation of MITF is related to the acquisition of melanoma stem cells, and MITF is involved in the metastatic growth of melanoma after diffusion.
Besides, the SKCM specific TFs GLI3, MEF2C, and MITF are closely related to melanocyte differentiation, pigment cell differentiation and pigmentation pathway (61)(62)(63). For esophageal cancer, as shown in Figure 4C, PAX9 was expressed at a low level in esophageal cancer. For TP73, a study showed that a SNP (rs2273953) in TP73 in esophageal cancer patients impacted disease-free survival (64). Overall, these TFs play important roles in the occurrence and development of esophageal cancer. On the other hand, some TFs were prevalent across digestive tract cancers, such as SOX2, CEBPA and ZNF217. Among them, SOX2 expression may serve as a novel prognostic factor for patients with digestive tract cancers. Overexpression of SOX2 is correlated with vascular invasion, differentiation and poor prognosis of patients with digestive tract cancers (65). Additionally, the overexpression of SOX2 in digestive tract cancer is evident in Figure 4C. Many studies have confirmed that CEPBA was a master regulator of hepatic function, and CEPBA expression was suppressed in many liver diseases including hepatocellular carcinoma (66). Additionally, CEBPA has been identified as a gene with high frequency mutations in hepatocellular adenocarcinoma of the stomach (a type of gastric cancer) (67). Finally, ZNF217 is overexpressed in colorectal carcinoma tissues and is associated with tumor malignant clinicopathological features, which may promote colorectal carcinoma progression by inducing cell migration and invasion (68,69). In addition, ZNF217 expression may be a novel prognostic biomarker in gastric cancer (70). The expression of these TFs in different cancers are provided in Figure 4C, and the expression levels of different TFs in corresponding cancers were consistent with previous studies (59,60,65,66,68,69). Next, in Figure S2A, we demonstrate the distribution of cancer-specific core TFs (core TFs that are present only in one cancer type) in cancers. Together with Figure 4B, we describe the distribution of all core TFs in different cancers and discovered the specificity of core TFs across different cancer types. Furthermore, we found that the expression of these specific core TFs also showed specificity in different cancer types ( Figure S2).

DISCUSSION
The general mechanisms underlying gene transcription are well understood (71,72), but the regulatory patterns of gene expression programs controlled by a small number of TFs are unknown in most cells. Identification of CRCs in tumors can reveal clues relating to cellular origin and gene regulatory drivers of the oncogenic state, which may offer novel strategies to fight against malignancy. To accurately identify CRCs in different cancer types, the current release of the Cancer CRC platform integrates publicly available cancer ChIP-seq and ATAC-seq datasets.
The Cancer CRC platform provides a user-friendly interface to search, browse, analyze and visualize information on cancer CRCs. Cancer CRC has abundant annotations, visual charts and useful analysis tools. Cancer CRC provides: (i) comprehensive information on core TFs, including the topological properties of TFs (in-degree and out-degree), expression values of TFs, frequencies of core TFs in the CRC from each sample, and the SEs, mutations, disease information, pathways, GO functions and survival information related to these TFs, all of which have user-friendly displays with interactive tables; (ii) the ability to search related CRCs based on cancer name, H3k27ac Sample ID or TF names; (iii) the ability to browse each sample to obtain detailed information; (iv) online analysis tools for "Super enhancer active core TFs analysis" and "TF enrichment analysis"; (v) a customized genome browser for user-friendly visualization of genomic context information; and (vi) curated data for each sample along with visual charts that can be downloaded from the platform.
In our pipeline, in order to improve the accuracy of cancer CRCs, we integrate tissues/cell lines H3K27ac ChIP-seq data and matched cancer ATAC-seq data. However, TCGA project only published ATAC-seq data for 23 cancers and researchers usually selectively implemented H3K27ac ChIP-seq experiments in some cancer cell lines, which led to the small sample size of our study. Fortunately, with the rapid development of highthroughput sequencing technology, more cancer accessible chromatin data will be available, and cell line/tissie-specific H3K27ac data will be greatly increased, which enable us to obtain more CRCs and core TFs for various cancers and expand our platform resources in future. On this basis, we will also develop a comprehensive algorithm that integrate multiple properties of core TFs to help researchers better identify potentially TFs in cancers. Because our platform provides cell line/tissue-specific TFs, it can be used to explore the target genes regulated by core TFs in a CRC. For a specific sample of a cancer, we can find all the target genes regulated by all the core TFs in CRCs from SEdb (29), so as to obtain the common target genes regulated by core TFs. Next, for a CRC, the list of target genes regulated by all core TFs of this CRC can be get. Then the number of target genes regulated by each core TF can be calculated, and we can find out the genes more jointly regulated by a CRC, which are more likely to be tissue-specific target genes. In future updated version of Cancer CRC, we will also consider adding such data for easy use by researchers. Furthermore, we will integrate TF ChIP-seq data to predict target genes and add them to our platform in the future.

CONCLUSIONS
Our motivation to establish this platform was prompted by the need to understand both cell/tissue type-specific TFs and CRCs in different cancer types. Systematic identification of active regulatory elements in cells has led to several key observations (26). By constructing specific TF regulatory networks using specific cancer enhancer profiles and chromatin accessibility regions, we identified critical TF nodes and CRCs. In addition, we systematically analyzed the core TFs across cancer types, providing a valuable reference for cancer research. We anticipate that these analyses will provide a framework for describing the state of primary tumor and will be useful in identifying TFs as potential therapeutic targets.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.