Signatures of Discriminative Copy Number Aberrations in 31 Cancer Subtypes

Copy number aberrations (CNA) are one of the most important classes of genomic mutations related to oncogenetic effects. In the past three decades, a vast amount of CNA data has been generated by molecular-cytogenetic and genome sequencing based methods. While this data has been instrumental in the identification of cancer-related genes and promoted research into the relation between CNA and histo-pathologically defined cancer types, the heterogeneity of source data and derived CNV profiles pose great challenges for data integration and comparative analysis. Furthermore, a majority of existing studies have been focused on the association of CNA to pre-selected “driver” genes with limited application to rare drivers and other genomic elements. In this study, we developed a bioinformatics pipeline to integrate a collection of 44,988 high-quality CNA profiles of high diversity. Using a hybrid model of neural networks and attention algorithm, we generated the CNA signatures of 31 cancer subtypes, depicting the uniqueness of their respective CNA landscapes. Finally, we constructed a multi-label classifier to identify the cancer type and the organ of origin from copy number profiling data. The investigation of the signatures suggested common patterns, not only of physiologically related cancer types but also of clinico-pathologically distant cancer types such as different cancers originating from the neural crest. Further experiments of classification models confirmed the effectiveness of the signatures in distinguishing different cancer types and demonstrated their potential in tumor classification.


INTRODUCTION
Copy number variations (CNV) are a class of structural genomic variants, in which the regional ploidy differs from the normal state of the corresponding chromosome. Germline copy number variations constitute a major part of genomic variability within and between populations and are an important contributor to genetic and inherited diseases (Zhang et al., 2009). In most cancer types, an extensive number of somatic CNV, usually referred to as sCNV or CNA (copy number aberrations), accumulate during the progression of the disease (Baudis, 2007;Beroukhim et al., 2010). CNAs have been shown to be directly associated with the expression of driver genes (Upender et al., 2004;Cox et al., 2005) where expression of oncogenes can be increased by copy number amplifications and tumor suppressor genes can be suppressed through heterozygous or homozygous deletions. On a genomic level, recurrent patterns of CNVs are observed in a number of cancer types and have been associated with cancer prognosis and development (Zack et al., 2013).
While traditional karyotyping and various DNA hybridization techniques had provided insights into specific CNV events, the systematic, genome-wide screening for CNV emerged in the 1990s as a reverse in-situ hybridization technology termed "comparative genomic hybridization" (CGH; Kallioniemi et al., 1992;Joos et al., 1993). Chromosomal CGH allowed the semiquantitative profiling of copy number changes over complete (tumor-)genomes. However, the technology was limited through its chromosomal banding based low resolution (Bentz et al., 1998) and only indirect association of CNV events with putative target genes. The hybridization of tumor (or germline) DNA to genome-spanning matrices was refined through the use of substrates ("arrays") containing thousands to more than 2 millions of mapped DNA sequence elements (Solinas-Toldo et al., 1997;Pinkel et al., 1998), now allowing the direct association of the experimental read-out to specific genome features. A variation of this principle, SNP (single nucleotide polymorphisms) based arrays (Wang et al., 1998), originally developed for the detection of allelic variations within populations, was rapidly adopted for CNV detection and allelic decomposition analysis in cancer (Zhao et al., 2004). Nowadays, Next-Generation Sequencing (NGS) techniques are increasingly adopted to detect copy number variations, Zare et al. (2017), Li et al. (2018), and Zhang et al. (2019) although technologies with coverage below shallow whole genome sequencing (Macintyre et al., 2016) show reduced utility for the analysis of CNV events when compared to high-density arrays. Regardless of their technical heterogeneity, a large number of CNV data has been generated in the past three decades, which represents an invaluable asset for genomics studies.
Spurred by an increasing interest in genomic heterogeneity as well as mutational patterns shared across tumor types, the exponential growth of available cancer CNV profiling data stimulated the study of patterns in the context of meta-analyses and large consortia studies, across multiple groups of tumors (Beroukhim et al., 2010;Grasso et al., 2012;Stephens et al., 2012;Ciriello et al., 2013;Wang et al., 2013;Zack et al., 2013;Zhao et al., 2013;Juhlin et al., 2015). The majority of the studies focused on finding either associations to cancer driver genes or the impact of focal regions in specific tumor types. As a result, the CNA patterns were often characterized by the coverage of driver genes in contrast to comparative analyses of the whole genome. While this approach can provide direct connections to the established theories, it also bears two drawbacks. First, the distribution of cancer driver genes is extremely skewed: a few hallmark drivers are responsible for a large percentage of tumorigenesis, while a long-tail of rare or putative drivers are reckoned to be the cause of the rest (Hou and Ma, 2014). Second, besides the association with somatic mutations, researches have also discovered different facets of CNA in their relations to cellular regulations and genome dynamics (Conrad et al., 2010;Völker et al., 2010;Chen et al., 2014;Mishra and Whetstine, 2016). Therefore, CNA patterns that solely rely on driver genes often lack the capacity to embrace the full spectrum of the aberrations. To generate CNA patterns, which capture their multitude of uniqueness, it would be more comprehensive to abstract by the aberrations' characteristics, rather than using focal regions that overlap with driver genes.
In translational research, an imminent goal of studying the CNA patterns of different cancers is to gain insight in designing new therapeutic protocols. Recently, the discovery of circulating cell-free DNA (cfDNA), which presents a potential for earlystage and non-invasive cancer detection, attracts the attention of both the academia and the industry (Pathak et al., 2006;Li et al., 2017;Phallen et al., 2017;Panagopoulou et al., 2019). While many cfDNA methods in cancer detection are focusing on somatic mutations, it is believed that the ultimate solution would come from an ensemble of different genomic aberrations . One of such aberrations is CNA in cancers that bear high burdens of copy number mutation. Large-scale cancer genome studies have illustrated the recurrent CNA patterns across different types of cancer. Several studies have employed CNA of cfDNA as biomarkers and demonstrated their potential in identifying cancer types and tissues of origin (Leary et al., 2012;Dawson et al., 2013;Heitzer et al., 2013). As an emerging field, the accurate identification of genomic abnormalities and classifications of the cfDNA yet remains challenging, and the characterization of CNA patterns across cancer types would provide a valuable piece of the solution to the puzzle.
Due to the technical heterogeneity and underlying biological variability, the meta-analysis of multi-platform cancer CNV data, which is necessary for comprehensive "pan cancer" studies, also poses great challenges in data integration and normalization (Gao and Baudis, 2020). In this study, we have assembled a collection of 56,077 CNA profiles and created the CNA signatures of 31 cancer subtypes, where each signature represented the CNA landscape of a specific cancer group. The signatures were generated from a computational pipeline (Figure 1), which unifies heterogeneous data and is powered by a hybrid model of neural networks and attention algorithm. Using the signatures, we also constructed a multi-label classifier of cancer types and tissues of origin from copy number segmentation data. The result illustrates the genetic uniqueness of CNA in different cancer types and demonstrates the potential of CNA signatures in tumor identification.

Data Integration
In this study, we integrated CNA profiles of tumor samples from three prominent resources. Progenetix (Cai et al., 2013) is a curated resource targeting copy number profiling data in human cancer. It features a large collection of data from different tissues and platforms. The Cancer Genome Atlas (TCGA) (Hutter and Zenklusen, 2018) represents a comprehensive cancer genomics initiative that generated standardized molecular profiling data for a wide range of cancer types. The Pan-Cancer Analysis of Whole Genomes (PCAWG) represents an initiative from the International Cancer Genome Consortium (ICGC) for the uniform analysis of a representative set of cancer samples using Whole Genome Sequencing (WGS) (Campbell et al., 2020). All FIGURE 1 | The workflow of the study was composed of three parts. The Features part consisted of methods of data integration and feature generation. The Signature part focused on creating CNA signatures for cancer subtypes and the categorization of subtypes. The Classification part recruited machine learning techniques to predict the organ and the subtype from a given copy number profile. three repositories provide publicly-though with the exception of Progenetix-partially access-controlled genomic data together with clinical information and diagnostic classifications. The initial collection consisted of 56,077 samples, representing an accumulation of CNA data from a wide range of tumor types and technological platforms.
When integrating data of high diversity for analyses, two major technical challenges can be encountered. First, genomics studies performed over the last decade will have applied different reference genome editions in their analysis pipeline, leading to shifted coordinates in results such as CNV segments. It is crucial to convert all data to the same coordinate system when analyzing data from multiple studies through a remapping procedure. However, unlike with SNP or other types of mutations involving short DNA sequences, applying standard remapping tools such as liftover (Kuhn et al., 2013) would result in a considerable information loss when converting copy number data, due to the occurrence of disruptive remapping in a relevant proportion of CNA segments. We previously addressed this problem with the development of a generic tool named segmentLiftover, to convert CNA data with high efficiency and minimized data loss . Second, regardless of the underlying technology, genomic copy number data is derived from the relative assessment and integration of multiple signals, with the data generation process being prone to contamination from several sources. Estimated copy number values have no absolute or strictly linear correlation to their corresponding DNA levels, and the extent of deviation differs between sample profiles, which poses a great challenge for data integration and comparison in large scale genome analysis. To tackle this problem, we designed a method called Mecan4CNA (Minimum Error Calibration and Normalization for Copy Numbers Analysis) to perform a uniform normalization (Gao and Baudis, 2020).
In the integration pipeline, samples were first converted to GRCh38(Hg38) using segmentLiftover. Then, the values in the sample were aligned to the corresponding true copy number levels of the main tumor clones using Mecan4CNA. For each step, a quality control protocol was set up to filter out samples carrying a low number of alternations, causing ambiguity in interpretation, or such having incomplete

Feature Representation
All copy number data can be represented in a platformindependent format of genomic segments, where each segment represents a continuous region on a chromosome. As the segments vary dramatically in platform-dependent resolution, reported size, and biologically supported thresholds (Krijgsman et al., 2014;Hastings et al., 2016), a pragmatic approach for crossstudy normalization lies in the creation of equally sized genomic bins and mapping of segment data into those. This method is elastic on the number of generated bins (features); therefore, it cooperates well with follow-up computational models that are sensitive to the feature numbers. However, the binning method disregards the uneven distribution of genes and does not capture any functional or structural information of the genome, thereby limiting biological correlations between features and the final model. Another straightforward approach is to use the coverage of protein coding genes to represent each segment. This method provides inherent connections to genetic functions; however, direct mapping to protein coding genes overlooks the structural impact of CNA and also results in a large number of features, which are often difficult for machine learning methods to perform in optimal. In this study, we designed a two-phase approach to generate a refined gene panel as the feature representation, so that the noise and redundancy of the feature space could be reduced with consideration to functional significance. In the first phase, CNA segment data was mapped to cytobands, which represent "genomic phenotypes" correlated to gene density and chromosome structure and provided a good balance between feature number and genomic resolution. The cytobands features were then added to a hybrid model of Autoencoder and Layer-wise Relevance Propagation (LRP) to perform a feature extraction, which generated a panel of cytobands with high separation power. In the second phase, the selected cytoband features were denormalized to the protein coding genes from each band. Finally, another iteration of the feature selection process was performed to arrive at a selection of genes that contribute to the distinctiveness of individual sample profiles.

Feature Extraction
Autoencoder An autoencoder is an unsupervised neural network that can derive abstracted representations from data. It usually consists of two parts, an encoder and a decoder, and both are also neural networks on their own. The input is first transformed to an encoding, then restored to its original by the decoder. The aim of an autoencoder is to reconstruct the input as exact as possible, and in the process, it learns a representation (encoding) of the input. It's typically used for dimensionality reduction and to remove noise from signals (Feng et al., 2014;Gondara, 2016;Pei et al., 2020).
An autoencoder possesses several features that are suitable for our research. Its ability to remove noise is well-suited to limit the amount of "noise" from true passenger mutations, which may represent a considerable part of the copy number variations in tumors but do not have a functional impact. In the process of restoration, the encoding is able to capture the traits and distinguishing details, which is ideal for sample characterization in terms of their individual uniqueness. Finally, as a neural network, it can benefit from the extensive collection of heterogeneous data through its generalization capability. In this study, four different autoencoders (basic, denoising, sparse, and contractive autoencoders) were compared on their abilities of input reconstruction. The denoising and contractive autoencoders showed the best performance in the restoration of CNV data, and the denoising autoencoder was chosen for its efficiency in implementation. The details of the comparison are elaborated in Supplementary Implementations.

Layer-Wise Relevance Propagation
Layer-wise Relevance Propagation (LRP) is a technique to scale the importance and contribution of input features in a deep neural network. It utilizes the network weights and activation functions to propagate the output backward until the input. The basic propagation rule is illustrated as the following: Here, j and k are two neurons from two adjacent layers. a is the activation of a neuron, and w is the weight between two neurons. The initial R is known from the output, and this formula is iterated to compute R for each neuron in the previous layer until reaching the input layer. LRP provides a solution to the "black-box" problem of deep neural networks, and is especially useful in tracing the network attentions and understanding the network behavior. In our hybrid model, we exploited LRP's ability to quantify the importance of input features in distinguishing different samples. When an autoencoder can reconstruct a large number of inputs with high accuracy, it provides an encoding that carries the information to recognize the differences between samples. If we apply LRP to this encoding, we would be able to measure the importance of each feature in the encoding.

The Hybrid Model
A primary goal of the research was to investigate the similarity and differences of CNA between cancers, through CNA feature dimension reduction toward an optimal representation of the uniqueness of each sample. In our hybrid model, we combined autoencoder's ability to recognize differences and the LRP's ability to scale feature relevance. First, the transformed feature matrix was used as the input data for the training of a denoising autoencoder. Then, the encoding generated by the autoencoder was used as the input of LRP to compute the weightings of each feature in the initial feature matrix. Next, the weightings of each feature were summed and normalized for all samples. Finally, by applying a threshold, we obtained a panel of high weighting features representing the uniqueness of each input sample.
In the two-phased feature extraction, the hybrid model first returned a panel of cytobands with high weighted CNA coverage; then, the hybrid model was utilized on the new feature matrix (CNA coverage on protein coding genes of the high-weighting cytobands), to calculate a panel of high-weighting genes as a representation of each sample's feature space. Table 2 shows the resulting feature numbers for each step. The details of the procedures are provided in Supplementary Implementations.

A Dataset of Major Cancer Types
To derive the genomic characteristics of different cancer types, it is important to utilize accurate, systematic mappings between samples and disease classifications. Following the standardized protocol set up for the Progenetix and arrayMap resources (Cai et al., 2013(Cai et al., , 2015, all samples were labeled using morphology and topography defined by International Classification of Diseases for Oncology (ICD-O 3) (World, 2013). The complete data collection included 211 unique morphology (ICDOM) and 271 unique topography (ICDOT) terms. While the combination of ICDOM and ICDOT terms in principle can result in detailed disease classifications, the granularity of available sample descriptors varied dramatically between different studies. For example, while some samples were annotated as "epidermoid carcinoma, keratinizing" at "lower lobe, bronchus, " for other pulmonary carcinomas a more general "squamous cell carcinoma" of the "lung" was available in the input data. Also, the number of samples mapped to individual classification terms was extremely imbalanced. For example, while "infiltrating duct carcinoma of breast" was available with more than 5,000 samples, "mucinous adenocarcinoma of lung" only was represented with 14 samples. For the sake of statistical validity and to minimize small-batch effects, we decided to focus on a data subset of 11 major organ systems with abundant samples and reliable copy number mutation content.

Signatures of Cancer Subtypes
After the two-phase feature extraction, we obtained a panel of genes, which drastically reduced the feature space to the signals of 927 genes. For any given sample, a small subset of the genes from this panel reflects its "signature" mutations and is able to represent the uniqueness of the sample against all other samples. When aggregating the samples of a disease type, 2 | The number of features before and after each extraction phase.

Phase
Step Representation Features Phase we should be able to identify a subset of feature genes that are significantly altered in this disease (implementation details in Supplementary Implementations). Thus, we could use this subset of feature genes and their intensities as the CNA signature of the group. However, as mentioned in earlier sections, the samples were not labeled with the same granularity. For the clarity of analyses, samples of similar signatures were grouped together with a consistent label. First, samples were grouped into subtypes by the combination of their topography and morphology. Subtypes with <50 samples were removed. Then, the signature of each subtype was generated and the Pearson's correlation coefficient of signatures was computed for subtypes of the same organ. Next, for each organ, subtypes of the same morphology level were merged into a more general morphology term if their correlation is >0.9. Similarly, a subtype with a more detailed morphology is merged into another subtype of a more general morphology if their correlation is >0.9. Finally, a dataset with refined disease labels was produced, which consisted of 22,671 samples covering 11 topography (organs) and 31 cancer subtypes. Table 3 shows the number of selected samples and subtypes in each organ. The complete table of subtypes and the merging are presented in Supplementary Subtypes.

Classifiers of Subtypes and Organ Involvement
The compact feature panel and the curated labels provided us an opportunity to apply machine learning methods of classification performance. We constructed a multi-label classifier using the signature genes and their normalized signal intensities as features, and the subtype of each sample as labels. The primary challenge of building a multi-label classifier came from the extremely imbalanced number of samples in different classes, because the classifier would be significantly biased by the classes with a high number of samples. To create a robust classifier, we first applied undersampling on classes with a high number of samples and oversampling on classes with a low number of samples, so that the number of samples in each class became relatively balanced. Then, we recruited a random forest to learn and predict the subtypes. In addition to the prediction of subtypes, we were also able to evaluate the classifier's performance on predicting samples' organ of origin by mapping the subtype labels from predictions to the corresponding organ labels. Comparing with the approach that directly used the organ of each sample as labels in training, the subtype mapping method showed an improved performance. The implementation details of the model and comparisons are presented in Supplementary Implementations.

CNA Signatures
By using the data processing and modeling procedures described in the Method section, we generated a panel of feature genes from the collected CNA samples. Then, with the feature genes, we were able to create an abstract representation for each copy number profile, where only alternations that contributed to the distinctiveness of the sample were preserved. Figure 2 compares the original CNA patterns with the derived signature features where the frequent and extensive regional alterations in the original data have been replaced by a small number of feature genes, which visibly compare to subsets of characteristic changes in the original CNA data and represent the most discriminative alternations. With our methodology, ubiquitous alterations such as deletions on the short arm of chromosome 8 (Figure 2) are considered "non-typical" and therefore are not represented in the abstracted signatures. While the complete panel of feature genes provides the feature space for the whole of analyzed samples, in each individual sample only a small subset of these features reflects the sample's own mutations. By aggregating samples from individual cancer subtypes, we were able to deduct the set of subtype related, significantly altered feature genes and could generate the CNA signatures of 31 cancer subtypes. Every signature consists of a subset of the general feature genes and their relative signal intensity comparing to other signatures. Figure 5 illustrates parts of the signature of medulloblastoma, glioma, and melanoma. The complete list of signatures in all subtypes are included in Supplementary Signatures.
In the majority of copy number studies, analyses of tumor samples are focused on identifying the driver genes or the focal regions. However, in this study, the frequent drivers such as TP53 and PTEN, and the common aberrations such as CDKN2A/B and MYC are removed due to their prevalence. The genes in the signatures rather reflect the uniqueness of each sample or each cancer subtype. It is importance to note that the feature genes are not intended as the only nor the optimal representation. The fundamental objective of the study is to explore the potential driver mutations that are infrequent but relevant to specific cancer types. Although the signatures do not imply pathogenic causation, we can instead reveal their potential implications and correlations by investigating the signatures and the feature genes (Figure 6). In general, spatial and annotation analysis suggest that some feature genes reflect structural and functional alternations in samples; the signatures of different subtypes show high preference in several genomic regions that suffer frequent CNAs in many cancer types; and some subtypes, which have distinct disease codes, exhibit high similarity in their signatures.

Focal and Regional Feature Groups
The vast number of CNAs in cancer distribute in a wide spectrum of sizes, ranging from several kilobases to entire chromosomes. Recent research showed they are neither randomly nor uniformly distributed (Beroukhim et al., 2010;Mermel et al., 2011). Regarding their individual extent, CNAs can be categorized into two primary groups: they are either focal, affecting a region with a limited set of potential target genes, or very large and covering an extensive fraction of a chromosome. It has been found that-in general-these two groups of CNAs are caused by different biological mechanisms and can play different roles in tumorigenesis (Pihan et al., 1998;van Gent et al., 2001;Hastings et al., 2009). Focal CNAs usually arise from errors during DNA repairs and can suppress or promote the affected genes. On the other hand, chromosome level CNAs usually arise from errors during mitosis and can lead to dramatic structural aberrations, also involving large numbers of genes. Therefore, it seems important to explore the distinctions of focal and larger CNAs among the signatures' features.
In studies of focal and large CNAs, no strict consensus on exact size the limits exists though general practice is to use an upper limit of 1-3 Mb for "focal" events (Bignell et al., 2010;Krijgsman et al., 2014) while others used a proportional cut-off such as 90% of a chromosome arm for "chromosomal" CNA (Beroukhim et al., 2010;Mermel et al., 2011). Because the signature feature genes are the simplification of the original CNA landscape, all chromosome arm level variations have been reduced to much smaller representations. Instead of studying arm-level CNAs, we merged adjacent feature genes within the distance of 5 MB to construct regional features, which indicates the high frequency and significant discriminative weightings of the covered region and the nearby regions. Feature genes, which are away from others for more than 5 MB on both ends, were considered focal features. After merging, the 927 signature genes were separated into 15 focal features and 43 regional feature groups. Figures 3A,B show the distributions of the different feature groups on the entire genome by their types and sizes. As expected, there are more regional feature groups than focal features in the merged feature space in both amplifications and deletions. This is because high-frequency focal CNAs were removed in the process of feature selection for their weak discriminating abilities. Among the regional groups, there are 50% more small regions (<5 MB) than large regions, especially in the deletions. It suggests that the small regions may represent the difference of the original arm level alternations between different cancer subtypes. Although at a lower frequency, the size of large regional groups can extend up to 200 MB, which strongly suggests that they represent distinctive arm-level events at these loci. Spatially, the large regions often occur exclusively and do not overlap with the same or the other type of CNVs. As illustrated in Figure 3C, the large regional groups are mostly of low amplitude (the average of normalized CNV log-ratios across all samples); and no significant correlations with the amplitude are observed between different groups or types. The analysis of regional and focal features' patterns across a large number of samples and cancer types confirmed and agreed with observations in several previous studies (Baudis, 2007;Beroukhim et al., 2010;Zack et al., 2013;Mishra and Whetstine, 2016). The size distribution of merged feature groups also provided important implications to the underlying representation of different features.

Functional Implications of the Feature Genes
A cardinal concept in this study was to describe cancer subtypes using CNAs based on their uniqueness and discriminating ability, instead of their frequencies. The selected feature genes usually exhibited different patterns in either their presents or variation amplitude between different subtypes. As a result, the final features mainly included non-focal and less prevalent genes that do not play known central roles in tumorigenesis. However, it is still interesting to investigate the feature genes by their annotations and discuss their potential functional meanings.
The HGNC (HUGO Gene Nomenclature Committee) gene families are defined based on evolutional and functional homologies (Daugherty et al., 2012). Table 4 lists the significantly over and under expressed HGNC families, where the family has <0.05 p-value in Binomial test and includes at least five feature genes: (1) The immunoglobulin heavy chains related genes showed the highest significance. Although the mutations of these genes are known to play causal roles in B-cell related malignancies (Nishida et al., 1997;Othman et al., 2016), surprisingly, accumulating evidence has also shown their frequent overexpression in epithelial cancers (Babbage et al., 2006;Hu et al., 2008). Though their functions and mechanisms in oncogenesis are still not well understood, several studies have demonstrated their tumor-promoting impacts and observed their prognostic implications in breast cancer and ovarian cancer (Lee and Ge, 2009;Hu et al., 2011;Larsson et al., 2020).
(2) Olfactory receptors are one of the largest gene families in the human genome. Similar to immunoglobulin genes, their overexpression has also been observed in various cancers and is often neglected in genomic cancer studies because of their specific role (Weber et al., 2018). However, recent studies have begun to reveal their functional roles and prognostic potentials in prostate cancer and breast cancer (Ranzani et al., 2017;Li M. et al., 2019;Masjedi et al., 2019). (3) The tripartite-motif (TRIM) family proteins are involved in a variety of biological processes including transcriptional regulation, cell growth and apoptosis, and their dysregulation has been extensively linked to cancer risk and prognosis (Hatakeyama, 2011;Jaworska et al., 2020;Mandell et al., 2020). (4) The HOXL subclass homeoboxes are overexpressed in amplifications and underexpressed in deletions. The family consists mainly of HOX genes, which are master regulatory transcription factors in embryogenesis, cellular development, and issue homeostasis . Recent researches have shown that depending on the type of the tumor, the expression of HOX genes may be increased or decreased and play roles in oncogenesis or tumor suppression (Shah and Sukumar, 2010;Bhatlekar et al., 2014;Brotto et al., 2020). (5) Keratin-associated proteins have no confirmed role in cancer. (6) The combined actions of glycosidases and glycosyltransferases constitute the primary catalytic machinery for the synthesis and breakage of glycosidic bonds. The aberrant glycosylation patterns are often considered a hallmark of cancer, which promotes tumor proliferation, invasion, and metastasis (Bernacki et al., 1985;Sandrine and Juillerat-Jeanneret, 2006;Andergassen et al., 2015;Wu et al., 2019). (7) Intermediate filaments proteins are often used as diagnostic markers in cancer, because changes in their expression patterns are often associated with tumor progression and cancerous cells usually partially retain their original structural signatures (Strouhalova et al., 2020). Several recent studies have reported their active roles in tumor invasion and metastasis (Holle et al., 2017;Sharma et al., 2019). (8) Zinc fingers C2H2-type family is the largest group of transcription FIGURE 3 | (A) The distribution of focal and regional feature groups on the genome. The groups were generated from the 927 feature genes, where genes within 5 million-bases were merged into a group. The position of the mark was determined using the center of the group or gene. The size of the mark was determined by the length of the group or gene, in the unit of million-bases and a minimum size of 1 MB. (B) The length distribution of feature groups. (C) The amplitude of feature groups in 31 studied cancer subtypes. The values were computed using the average of the normalized CNV log ratios of all samples in each subtype. On the x-axis, F denotes focal, R denotes regional, the number denotes chromosome name.
factors in humans. A number of studies have revealed that aberrant expression of C2H2 proteins promotes tumorigenesis with various roles in several cancer types (Jen and Wang, 2016;Munro et al., 2018).
The functional background of the significantly expressed HCNC families reveals several exciting aspects of the feature genes. First, the features consist of biomarker genes, which are frequently and abnormally expressed in many cancer types. Although their exact mechanisms in tumorigenesis are not fully understood, many studies have confirmed their functional roles and prognostic correlations in cancer. Second, the features describe functional aberrations by including several families that are related to transcriptional regulations. Previous studies have shown the high correlation of CNAs and differential gene expression in cancers (Shao et al., 2019;Bhattacharya et al., 2020), and partially elucidated their involvement in tumor development and progression. Finally, the features include hallmark signatures in invasion and metastasis, which are crucial indications of latestage cancer progression. They are more likely to represent the consequence of other mutational activities rather than the direct impact of CNAs.
In summary, the analysis of feature genes agreed with a variety of previous research in their cancer-promoting or related roles and suggested the potential rationales behind the features. It also confirmed that malignant cell transformation is often an orchestrated process of many interplaying parts. While important driver mutations can signify the milestones in tumor development, the minor mutations during the development can also provide valuable characterization to specifics between different cancer types.

Discriminative Genomic Loci
Since feature genes represented the uniqueness of individual samples, it is expected that they would distribute exclusively among different subtypes in general. Interestingly, while most of the features are sparsely distributed, there are also a few "hot zones, " which are frequent CNV regions in general, that are commonly included in many signatures with high significance.  chr7:91,692,008-93,361,123, chr7:105,014,190-117,715,971, chr7:129,611,720-130,734,176, and chr7:148,590,766-152,855,378 on the q arm of chromosome 7; chr8: 47,260,878-50,796,692, chr8:51,319,577-54,871,720 and chr8:112,222,928-138,497,261 on the q arm of chromosome 8. As shown in Figure 2, chromosome 7 and 8 contain the most frequently duplicated regions among all studied cancer types. Despite their common existence, analysis of individual signatures and samples reveals three patterns that may explain their prominence in features: first, on the two chromosomes, samples often have large duplications.
The aggregated signals in individual subtypes usually show high alternation frequency of the whole region and pose a plausible hint of frequent chromosome or arm level events. Second, the spans of duplications differ among subtypes. For example, the signature of melanoma consists of feature genes covering the entire chromosome 7 and 8; the signature of lung adenocarcinoma consists of feature genes only on the p arm of chromosome 7, and the signature of ovary carcinoma consists of feature genes only on the q arm of chromosome 8. Third, the alternation amplitude of the feature genes also shows distinctive patterns among subtypes. The subtle difference in the scale and amplitude of CNV on chromosome 7 and 8 suggest that aberrations of these hot regions may play differentiating roles in different cancers. Similar to duplications, the most common deletion features distribute on three regions on chromosome 18: chr18:2,916,994-7,117,797 on the p arm; chr18: 58,481,247-60,372,775 and chr18:69,400,888-70,330,199 on the q arm. Different from duplications on chromosome 7 and 8, the deletions on chromosome 18 show a similar pattern of high amplitude deletions on the entire chromosome 18, where feature genes on the p arm suggest a one-copy deletion on average, and feature genes on the q arm indicate frequent homozygous deletions. Interestingly, the deletion features on chromosome 18 are usually mutually exclusive to the deletion features on chromosome 10, which are under high pressure of chromosome level deletions in several subtypes. Previous studies have observed strong correlations of deletions on chromosome 18 and chromosome 10 to the progression of several cancer types. Other analyses have investigated their functional impacts and suggested the presence of several tumor suppressor genes (Bax et al., 2010;Xie et al., 2012;Kwong and Chin, 2014;Shao et al., 2019). Another hot zone of deletion features locates on the q arm of chromosome 22 (chr22:48,489,460-48,850,912). High-level deletions are frequently observed and selected as features in many subtypes. Different studies have shown similar observations of frequent deletions on chromosome 22, however, the mechanism of the deletion is still obscure and the functional impact remains putative (Castells et al., 2000;Morikawa et al., 2018).
Besides high prevalence, the feature "hot zones" also exhibit a higher copy number alternation level than average features. As shown in Figure 4, the average copy number alternation levels are far greater in these regions than in aggregated features: 56% higher in amplifications, and 70% higher in deletions. Furthermore, several hot zones and the associated cancer types, including the frequent amplifications near the centromere of chromosome 7, on the q arm of chromosome 8, and deletions closed to the telomere of chromosome 18, are in coherence with the general CNA patterns identified in previous pan-cancer studies (Beroukhim et al., 2010;Zack et al., 2013;McNulty et al., 2019), which suggests these regions are related to high CNA frequency. More specifically, some hot zones show correlations with nearby driver genes, which are frequently observed with high CNAs in cancers having high alternations in these hot zones. First, the high alternations level on chromosome 7 hot zones are in accordance with the overexpression of nearby EGFR, MET, and BRAF genes in glioma, lung cancers, and melanoma. EGFR is a well-studied oncogene in lung cancers and gliomas, MET is also a frequent driver in lung cancers, and BRAF is a signature driver gene in melanoma. A variety of studies have reported the CNA amplifications at or near the hot zones, and investigated their driver roles in these cancers (Taoudi Benchekroun et al., 2010;Janjigian et al., 2011;Trombetta et al., 2011;Kang, 2013;McNulty et al., 2019). Next, the frequent but relatively low-amplitude hot zone on the q arm of chromosome 8 is correlated with MYC, which is frequently amplified in numerous cancers. The extensively amplified hot zones on the p of chromosome 8 may be associated with FGFR1 and IKBKB, which are often deregulated by amplification in lung, colon, and bladder cancers. Studies of copy number patterns in these cancers have shown the significant correlations of increased gene expressions and amplification at the hot zone regions (El Gammal et al., 2010;Moelans et al., 2010;Weiss et al., 2010). Finally, the supporting evidence on deletion hot zones are mostly prognostic instead of functional confirmed, except on chromosome 10, where two frequent and high-alternation hot zones are closed to PTEN. Although it is one of the most common inactivated tumor suppressor genes in cancers, the signal amplitude of these zones is in high consistency with cancer types (gliomas, lung cancers, melanoma, and breast), where the loss of PTEN is especially frequent and crucial (Milella et al., 2015;Fusco et al., 2020).

Similarities of Neural Crest Originated Subtypes
The signature based clustering of cancer subtypes (Figure 5) shows that diseases in the same cluster usually also share close ontologies. However, there is an interesting outlier, which consists of medulloblastoma, melanoma, and glioma (astrocytoma is a subtype of glioma). The three clinically and topographically distant cancer types have signatures of high similarity in both the selection of features and their alternation frequencies. Figure 6 shows the comparison of the original CNV data, the features and the known drivers of the three cancers on chromosomes harboring similar signatures. Their signatures exhibit high similarities in the duplication of chromosome 7 and the deletion of chromosome 10. They also share pairwise similarities on the duplication of chromosome 1 and 20, and the deletion of 9 and 14. For the three cancers, the most frequently amplified chromosome 7 harbors several of the key oncogenes. For example, EGFR, CDK6, and MET in glioma; KMT2C and PMS2 in medulloblastoma; BRAF, RAC1 and TRRAP in melanoma. The most frequently deleted chromosome 9 and 10 harbor several important suppressor genes. For example, CDKN2A and PTEN in glioma; XPA, PPP6c, and CDKNA in melanoma; PTCH1 and SUFU in medulloblastoma. Noticeably, the CDKN2A/B deletion is the most frequent copy number aberrations across all cancer types. Although the distribution of driver genes shows a close correlation with the frequency and amplitude of CNV, most drivers do not demonstrate signal peaks nor overlap with the feature genes. Frequent duplications on chromosome 1 and deletions on chromosome 14 do not show direct correlations to common driver genes; however, a number of studies have shown the connection of CNAs on these regions to the progress and prognosis in these cancer types (Cross et al., 2005;Parsons et al., 2011;Mathieu et al., 2012;Boots-Sprenger et al., 2013;Cohen et al., 2015;Park et al., 2019).
In the 1990s, epidemic studies (Azizi et al., 1995;Desai and Grossman, 2008;Scarbrough et al., 2014) first revealed the connection between melanoma and tumors of the nervous system: not only a familial association, which was confirmed by germline mutations; but also a significantly increased risk of one disease in people having a history of the other one. Although there was evidence showing their potential common pathophysiologic pathways and responsiveness to the same drugs, the genetic connection of the two disease groups was still largely unclear (Middleton et al., 2000). In spite of their ontological difference, medulloblastoma, melanoma, and glioma are all derived from the lineages of neural crest cells. Recent studies of neural crest cells and the cancers from their lineage cells suggest that malignant cells mimic many of the behavioral, molecular, and morphologic aspects of neural crest development Powell et al., 2014;Maguire et al., 2015). Aberrations in tumor cells may lead to the reactivation of their embryonic developmental programs and promote tumorigenesis and metastasis. For example, the WNT family members, which play import roles during the epithelial-to-mesenchymal transition of neural crest cell, are reactivated during invasive transformation in melanoma (Kovacs et al., 2016;Sinnberg et al., 2018). In glioblastoma, experimental data suggests that the dysregulation of the WNT signaling pathway supports the onset of cancer stem cells, which assure the enlargement of the tumoral mass and eventually the spread of metastases (Zuccarini et al., 2018). In medulloblastoma, the WNT subgroup signifies one of the four molecular subtypes of the disease (Doussouki et al., 2019). In regions represented by the similar signatures, several WNT genes exhibit abnormal amplification frequencies, which are a potential reflection of the overexpression of WNT signaling. Specifically, WNT2B and WNT4 are covered by amplifications of moderate frequencies, and WNT2, WNT3A, WNT9A, and WNT16 are encompassed by amplifications of high frequencies. WNT2 and WNT16 are signature genes in all three subtypes implicating their prevalence among individual samples.
Another interesting result from the similar signatures is the connection between melanoma and medulloblastoma, which has dramatic differences in many biological aspects. For example, medulloblastoma is considered primarily originated from embryonal cells in early development. It mostly occurs in children and usually has good prognosis outcomes. On the other hand, melanoma is primarily caused by ultraviolet light exposure. Its risk has a positive correlation with age and the prognosis is usually very poor once passing the early stage of the disease. However, besides their differences, all three cancer types are notorious for their fast progression, high invasiveness and wide metastasis. It's possible that their common CNA signatures are the imprint of their acquired cancer hallmarks instead of their mutational characteristics. The combined evidence suggests that different types of tumor may achieve their hallmark abilities through different evolution paths (Gallik et al., 2017). The acquired hallmarks exhibit a number of common genotypes such as copy number aberrations, which are potentially the

Classification of Subtypes and Organs
In the previous sections, we explored the implications of the feature genes and cancer signatures. It is also interesting to assess the robustness and usefulness of the feature genes in sample classifications, especially when most of the primary driver genes are excluded from the features. The identification of the origin of a tumor sample is a challenging task that attracted significant attention from both academia and industry. A great number of computational methods have been proposed in recent years to facilitate the early diagnosis of cancers from liquid biopsies (Heitzer et al., 2013;Volik et al., 2016;Phallen et al., 2017;Wood-Bouwens et al., 2017;Chin et al., 2019). Among the various explored genomic aberrations, several studies have illustrated the predictive effectiveness of copy number aberrations (Leary et al., 2012;Dawson et al., 2013;Heitzer et al., 2013). In the meantime, machine learning researchers have also proposed a number of classification methods to predict cancer types using copy number data (Zhang et al., 2016;Karim et al., 2020;Liang et al., 2020). However, these methods usually utilized a uniform dataset such as TCGA to investigate the feasibility of the method and the performance of the model. They often lack the generalization experiments to apply the method on data that is not from the training data pipeline. Also, the goal of machine learning researches usually focused on constructing a classifier of high accuracy, whereas in this study, we focused on exploring the unique and typical mutations in different cancer subtypes. Therefore, we would like to use the classification results of copy number signatures to demonstrate their potential in the predictions of cancer subtypes and organs of origin from liquid biopsies, and performed no direct comparisons with other predictive studies.
In the prediction of subtypes, the model showed a wide range of performance spectrum on different subtypes. Figure 7A shows the confusion matrix of individual subtypes, where the x-axis represents the true labels and the y-axis represents the predicted labels. Specifically, the classifier achieved good performance in predicting: breast intraductal carcinoma, colon adenocarcinoma, brain glioma, cerebellum medulloblastoma, cerebellum medulloblastoma, and kidney clear cell adenocarcinoma, where the F1-score of individual subtypes ranged from 0.72 to 0.67. On the other hand, the prediction performance was deficient in several subtypes, such as brain astrocytoma, stomach carcinoma diffuse type, and colon mucinous adenocarcinoma. The complete metrics of each subtype are included in Supplementary Performance. Investigation on false negatives and false positives showed that most of the false predictions fell into the subtypes that were from the same organ and had significantly more samples. In general, the performance of individual subtypes was positively correlated with the number of samples ( Figure 7C). The difficulty in predicting some subtypes suggests that some clinicopathologically closed cancer subtypes share similar copy number mutation patterns. While the copy number signatures showed  promising predictive potential for some cancer subtypes, data from complementary genomic aberrations are needed for a comprehensive model to classify similar subtypes.
In the prediction of organs, the model showed a more consistent performance than in subtypes. Figure 7B shows the confusion matrix of individual organs. Specifically, the classifier achieved good results in predicting: brain, lung, colon, cerebellum, ovary, and kidney, where the F1-sore of individual organs ranged from 0.75 to 0.67. Although some organs showed relatively low prediction accuracy, they all had dramatic improvements compared with the prediction results of the divided subtypes. The complete metrics of each subtype are included in Supplementary Performance. In general, the performance was also positively correlated with the total number of samples in each organ (Figure 7D). In contrast, the prediction results of organs showed a significant overall improvement over the subtypes. For example, none of the lung subtypes had a good performance in the subtype classification. However, when combined, the lung achieved the second-best performance (0.72 F1-score) in organ prediction. A similar improvement was also observed for the stomach while the prediction performance of breast had a drastic decline compared with divided subtypes; the latter is further discussed in the subsequent section.
In summary, our classification experiments indicated that the copy number signatures have good potential in predicting several cancer subtypes and organs of origin. However, not all investigated subtypes and organs achieved satisfying performance with solely CNA features, which confirmed the need for ensemble strategies to identify unknown tumor cells' type and origin. Moreover, the improved performance in organ classifications suggested the correlations of CNA and organ's functional influence in tumorigenesis. In future research, it would be interesting to further investigate the difference between organspecific CNAs and subtype-specific CNAs.

Heterogeneity Within Subtypes
The results of the classifier showed significant differences in performance among different subtypes and organs. While some of the under-performance may be attributed to the insufficient number of samples, the false predictions of several classes suggested the heterogeneity within these subtypes. As shown in Figure 7A, for breast infiltrating duct carcinoma and cerebellum medulloblastoma, the false predictions were widely distributed over several unrelated classes. Figures 7C,D highlighted breast infiltrating duct carcinoma, lung non-small cell carcinoma, and lung adenocarcinoma as outliers of the correlation, where the high number of samples did not improve but deteriorate their performance.
Large-scale cancer genome studies have shown that tumors from the cerebellum, lung, and breast usually carry high burdens of copy number mutations (Beroukhim et al., 2010;Zack et al., 2013). With the extensive number of CNA in these subtypes, it is expected to be more challenging to identify key signals. Genomic studies in the past two decades have revealed mutation characteristics in these diseases, and established classifications of their molecular subtypes. For example, medulloblastomas are now commonly categorized into four subgroups: WNT, SHH, G3, and G4. Similarly, breast carcinomas can be categorized into four subgroups: Luminal A, Luminal B, Her2-enriched, and basal-like. Studies of their mutational patterns showed distinct markers and different mutation landscapes of each subtype, and multi-omics studies further revealed the intrinsic complexity within subtypes and suggested classifications of more subgroups (Banerji et al., 2012;Curtis et al., 2012;Tyanova et al., 2016). However, the majority of the samples in our dataset were labeled by their morphology, with the molecular subtype classification missing in the sample annotations. As a result, large diagnostic groups such as "breast infiltrating duct carcinoma" could not be further separated a priory, resulting in a common label for breast carcinomas with the diagnostic subtype consisting of 5,657 samples, over three times more than the secondlargest subtype. While we addressed these skewed overall subtype sizes through the application of undersampling techniques, the implicit mixture of different molecular subgroups provided an inevitable impact on the association of diagnostic labels and CNA based classifiers.

DISCUSSION
For this study, we assembled a large collection of cancer CNA profiles with the aim to identify genomic aberration signatures with specificity for individual cancer types. In order to integrate technically diverse data, we developed innovative tools for genome data conversion and signal normalization. Targeting the identification of unique components in diagnosis mapped CNA profiles, we were able to derive the CNA signatures of 31 cancer subtypes, where each signature was characterized by a minimal representation of genes with high discrimination capacity. The signatures were further evaluated to derive classifiers identifying tumor types and organs of origin. The comparative analyses of signature genes and their represented regions showed that duplications on chromosome 7 and 8, and deletions on chromosome 22 were the most common aberrations among studied cancer types. However, these regions also harbored features of high differentiating power, which may indicate the functional significance of cancer related genes with specificity for cancer-type related pathway involvement.
While our feature genes exhibited certain correlations with known driver genes, this itself does not provide direct evidence of their functional significance. Feature genes could also be the result of the downstream effects of mutational activities or be co-opted as a representation of an aberrant region where the pathogenetic activity is provided by different genomic elements (Szalai and Saez-Rodriguez, 2020). In the analysis, we were prudent to limit propositions of the impact of individual feature genes.
Our analyses showed that three clinico-pathologically distant cancer types-medulloblastoma, melanoma, and glioma-shared CNA signatures of high similarity. Developmentally, the three tumor types can be traced back to common lineages of neural crest cells. Research of neural crest cells and epidemiologic studies of glioma and melanoma have shown sporadic evidence of their connections in cancer development, but the genetic foundation of their association with respect to oncogenetic processes is still elusive. Here, comparative analysis of shared mutations together with improvements of developmental processes in the corresponding normal tissues may provide insights into shared pathologies and potential therapeutic targets.
The studies on genomic classifiers demonstrated the capacity and potential of CNA signatures in tumor identification. However, while the multi-label classification approach showed promising performance in many cancers, in a few subtypes the performance was deteriorated by the intrinsic heterogeneity of their CNA profiles. Here we expect that future follow-up studies with extended input data and employing different ontology-based grouping strategies may lead to improved performance and the potential emergence of better group aggregations.
An essential challenge in the study was the integration of data from varying genomic profiling platforms. During normalization and quality control, we removed samples that were ambiguous in interpretation or had an abnormal distribution of signals. However, as a trade-off, many genomically heterogeneous samples (e.g., aneuploid or containing CTLP) were removed in the process. For future studies, complementary strategies should be developed to include these samples without detrimental effects on data normalization and quality. With respect to sample annotation for histologic and diagnostic classifications, extensive efforts went into the curation of the data with the aim to use a unified set of classifications. Here, future analyses may make use of emerging cancer specific ontologies for a better integration of varying annotation granularities using hierarchical terms and concepts. Currently, our research group is finishing a parallel study on ontology mappings of cancer samples, and we are expecting a considerable improvement in the disease annotation of samples in future studies.
In summary, this study presents a systematic pipeline for integrative and comparative analyses of a large amount of copy number data. The resulting CNA signatures offer new perspectives on the understanding of common foundations in cancers and show promising potential in applications of tumor classification.

AUTHOR CONTRIBUTIONS
BG developed the method and implemented the analysis. MB conceived the original concept and supervised the project. All authors contributed to the writing and revisions of the manuscript.

FUNDING
BG was a recipient of a grant from the China Scholarship Council. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

ACKNOWLEDGMENTS
We thank Paula Carrio Cordo and Qingyao Huang for support with the data ontologies, and all current and previous members of the Baudis