New Recurrent Structural Aberrations in the Genome of Chronic Lymphocytic Leukemia Based on Exome-Sequencing Data

Chronic lymphocytic leukemia (CLL) is the most frequent lymphoproliferative syndrome in Western countries, and it is characterized by recurrent large genomic rearrangements. During the last decades, array techniques have expanded our knowledge about CLL’s karyotypic aberrations. The advent of large sequencing databases expanded our knowledge cancer genomics to an unprecedented resolution and enabled the detection of small-scale structural aberrations in the cancer genome. In this study, we have performed exome-sequencing-based copy number aberration (CNA) and loss of heterozygosity (LOH) analysis in order to detect new recurrent structural aberrations. We describe 54 recurrent focal CNAs enriched in cancer-related pathways, and their association with gene expression and clinical evolution. Furthermore, we discovered recurrent large copy number neutral LOH events affecting key driver genes, and we recapitulate most of the large CNAs that characterize the CLL genome. These results provide “proof-of-concept” evidence supporting the existence of new genes involved in the pathogenesis of CLL.


INTRODUCTION
Chronic lymphocytic leukemia (CLL) is the most frequent lymphoproliferative disease in Western populations, and it is characterized by its clinical and genetic heterogeneity. Döhner et al. (2000) described the widely used cytogenetic classification of CLL based on the most prevalent chromosomal aberrations in the CLL genome (Döhner et al., 2000), that is, trisomy 12 and deletions in 13q14. 2-14.3, 11q22.3, and 17p13.1. Since that moment, new genome-wide studies have revealed new recurrent genomic aberrations, such as trisomy 19, amplifications at 2p and 8q, and deletions at 8p, 6q21, 18p, and 20p (Pfeifer et al., 2007;Landau et al., 2015). Similarly, a wealth of genomic and epigenomic modulators of CLL's clinical aggressivity have been discovered (Nadeu et al., 2018), such as point mutations in NOTCH1, SF3B1, ATM, TP53, and POT1 and the absence of somatic hypermutation at the IGHV locus. It has been observed that copy number aberrations (CNAs) in CLL genomes tend to be acquired early in disease evolution and usually remain stable, whereas the mutational heterogeneity can increase (Nadeu et al., 2018). Indeed, mounting evidence indicates that the accumulation of these cytogenomic events modulates CLL proliferation and clinical aggressivity to a great extent (Raponi et al., 2018;Gruber et al., 2019), acting as drivers of genomic complexity and clonal evolution (Edelmann et al., 2017;Yu et al., 2017;Hernández-Sánchez et al., 2019) and accumulating in relapsed cases (Ljungström et al., 2016;Leeksma et al., 2019).
CNAs and copy number neutral loss of heterozygosity (CNN-LOH) are oncogenic mechanisms that induce gene-dosage effects, disrupt coding sequences, cause structural rearrangements, or potentiate epigenetic effects. Oncogenes are frequently affected by copy number gains, while tumor suppressor genes tend to be deleted. Massive array-based techniques such as array comparative genomic hybridization and single-nucleotide polymorphism (SNP) arrays have enabled the analysis of structural aberrations on cancer genomes to an unprecedented resolution of 10-100 kb. With the development of massive sequencing technologies, large databases of cancer sequence data have been published. This motivated the development of a variety of CNA detection algorithms from exome-sequencing data, which have the additional benefit of detecting smaller CNA events at the expense of increased false discoveries and reduced sensitivity and specificity (Nam et al., 2016). These methods are specifically designed to face particular issues, particularly those inherent to the sequencing protocol (such as biases induced by hybridization, GC content, and read mappability), due to cancer biology (ploidy estimation and subclonality) (Zare et al., 2017) and due to the presence or absence of matched controls .
In this analysis, we used previously published exome-seq data in order to detect small recurrent structural events involved in the pathogenesis of CLL. Our results not only reproduce the known cytogenetic aberrations in CLL but also support the existence of multiple recurrent focal CNAs and CNN-LOH affecting key oncogenic pathways, some of which are clearly associated with higher proliferative capacity, shorter survival, and altered gene expression. We conclude that focal CNAs may be more relevant than previously expected in the pathogenesis of CLL, and they merit further consideration for prognostic stratification.

Data Source
The International Cancer Genome Consortium (ICGC) Data Access Committee granted us access to the CLL sequencing data (Ramsay et al., 2013)

Data Preprocessing
Samples were processed by Puente et al. (2015) as described in their original paper (Puente et al., 2015). Briefly, 3 μg of genomic DNA was used for paired-end sequencing library construction, followed by enrichment in exomic sequences using the SureSelect Human All Exon 50Mb kit (Agilent Technologies). Next, DNA was pulled down using magnetic beads with streptavidin, followed by 18 cycles of amplification. Sequencing was performed on an Illumina GAIIx or on a HiSeq2000 sequencer (2 × 76 bp). Exomeseq data were aligned to the reference genome (GRCh37.75) using bwa (Li and Durbin, 2009). Duplicate read removal, sorting, and indexing were done using samtools (Li et al., 2009). Base quality score recalibration was made with BamUtil (Breese and Liu, 2013) using a logistic regression model.

CNA and CNN-LOH Detection
We analyzed paired tumor-normal exome-sequencing data with Control-FREEC version 11.3 in order to identify somatic CNA and CNN-LOH regions (Boeva et al., 2012). Control-FREEC uses aligned reads to construct and normalize a copy number profile and a B-allele frequency (BAF) profile. Then, it performs profile segmentation and infers genotype status for each segment using both copy number and allelic frequency information. Finally, genomic aberrations are identified and annotated.

LOH and CNA Selection and Filtering
CNN-LOH were called with p-values < 0.05 (Kolmogorov-Smirnov test) and an uncertainty upper threshold of 5%. Regions with lowmappability according to UCSC 75-bp mappability tracks (score below 0.5) were filtered out. As expected, we observed regions that seemed prone to CNA erroneous detection. Thus, we decided to apply a hard filter and discard those CNAs significantly enriched in both amplifications and deletions, as well as those located near a telomeric or centromeric region. CNA events were detected using GISTIC2.0 (Beroukhim et al., 2007). GITIC2.0 was run with default parameters plus the arm peel-off filter. Focal recurrent CNAs were defined as those spanning less than 50% of a chromosome arm with a residual q-value < 0.05 and a wide peak size below 10 megabases. A 1 − log2 tumor/ normal ratio above 0.3 was used to define amplifications, and a ratio below −0.3 was used to define deletions.

Survival Analysis
Variables associated with time to treatment (TTT) and overall survival (OS) were analyzed using Cox regression as implemented in the survival R package (Therneau and Grambsch, 2000;R Development Core Team, 2011;Therneau, 2015). Assessment of the proportional hazards assumption was performed using the cox.zph function. We created two different models: a univariate model that only includes CNA status for each gene and an adjusted model that included variables associated with clinical outcome at a p-value < 0.2 in a univariate model. The combined model (CM) for TTT analysis included as covariates donor sex, stage at diagnosis (monoclonal B-cell lymphocytosis (MBL), Binet A, Binet B, and Binet C), and IGHV mutational status, while the CM for OS analysis included stage at diagnosis, IGHV mutation status, and age at diagnosis. The Benjamini-Hochberg (BH) method was used to adjust for multiple testing.

RNAseq Analysis and Correlation With CNA Status
Two hundred twenty patients had matched RNAseq data of CLL-purified cells (accession IDs EGAD00001001443 and EGAD00001000258). Illumina adapters were removed using cutadapt (Martin, 2011), and alignment to the human reference genome (GRCh37) was performed using Hisat2  with default specifications. We used the Hisat2provided Hierarchical Graph FM index for GRCh37 with SNP and Ensembl transcript information. Bam files were sorted and indexed using samtools (Li et al., 2009). Bam files were processed in R (R Development Core Team, 2011) according to the RNAseq gene expression protocol developed by Love et al. (2015). Briefly, bam files were read using Rsamtools (Morgan et al., 2017), followed by gene-level expression estimation using the SummarizeOverlaps function from the GenomicAlignments package (Lawrence et al., 2013). Gene models in GTF format were downloaded from Ensembl (GRCh37.75 version) (Yates et al., 2016). A log2-transformation on normalized frames per kilobase counts was performed. Focal CNAs were classified according to Gistic into low-range events (tumor/normal log2 ratio > 0.3 and <0.9 for amplifications and less than −0.9 and more than −0.3 for deletions) and higher-range events (tumor/ normal log2 ratio > 0.9 for amplifications and less than −0.9 for deletions). Correlation between CNA status and gene expression was performed using Spearman's correlation. A minimum of 5 CNA events with matched transcriptomic data was set for analysis. Furthermore, immunoglobulin and T-cell receptor gene rearrangements were not included. p-values were adjusted for multiple testing using the BH method.
Three focal deletions were associated with TTT (BH q-value < 0.05): 11q22.3 (ATM locus), 14q32.33 (IGH locus), and 7q21.2 (CDK6 locus). 11q22.3 loss was associated with shorter survival too. No event was associated with TTT or OS after adjusting for IGHV status, sex, and disease stage (BH q-value < 0.05). No recurrent gain was associated with treatment-free survival or OS. The association results can be consulted in Supplementary Table 1.
Furthermore, SETD2 deletions (nine cases) and IRF4 gains (five cases) were nearly significant (GISTIC residual q-values of 0.08 and 0.06, respectively) and associated with clinical evolution. SETD2 deletion was associated with shorter treatment free survival (p-value 1.3 × 10 −8 ) independent of IGHV status, sex, and disease stage (p-value 2.3 × 10 −3 ); and it was also associated with shorter survival (p-value 9.9 × 10 −3 ) but not independently of IGHV status (p-value 0.24). Amplifications in IRF4 were associated with shorter survival (p-value 6.4 × 10 −5 ) in a partially IGHV-independent manner (p-value 0.025). Nevertheless, this finding must be interpreted with caution due to the position of IRF4 near the telomeric end of the short arm of chromosome 6.

Correlation of Focal CNAs With Gene Expression
We detected significant correlations between some recurrent CNAs and their correspondingly encoded genes (Supplementary Table 2). As expected, deletions in 11q22.3 and 13q14.2 and the expression of their respective genes (ATM, FDX1, and RDX in the first case, and DLEU1, DLEU2, and KCNRG in the second case). Deletions in 6q21 were correlated with lower expression of CDK19, and so did those in 14q21.1 with the expression of MIA2. Furthermore, we detected significant correlations between amplifications in 3q25.1 and 3q29 and expression of their target genes PFN2 and ACAP2, respectively. Surprisingly, an inverse correlation was observed between 11q14.3 deletions and FOLH1B expression.
Nonetheless, this study is probably underpowered to detect CNA-transcript correlations due to the fact that only 50% of the exome-seq samples had matched RNAseq data of purified CLL cells.
Amplifications in chromosomes 12, 2p, and 8q, as well as deletion of 17p, were significantly associated with shorter TTT (Supplementary Tables 4 and 5). Furthermore, deletion of 17p was significantly associated with TTT independently of IGHV status, sex, and Binet staging. Amplifications in chromosomes 12 and 8q were associated with shorter OS, but no event was significant after adjusting for IGHV status, age, and Binet staging. Nonetheless, we detected a significant difference in OS and TTT between IGHV-mutated cases with and with and without trisomy 12 (Figures 3A, B respectively).

Regions of CNN-LOH
Control-FREEC identified 63 regions of CNN-LOH with a genotype uncertainty below 5% (Supplementary Table 6). Ten events were located at 1p, six of which affected ARID1A. By comparing with mutation data published by Puente et al. (2015), none of these samples bore concurrent non-synonymous mutations in ARID1A. Other three events were detected at 1q, with a minimally affected region on 1q21.3, which holds likely driver genes such as PI4KB and IL6R/CD126. Four CNN-LOH events with a minimally involved region in 9q34.13-q34.3 involved the NOTCH1 gene. One of these cases also had a frameshift deletion in NOTCH1. Three events at 11p15.5-15.4 affected the imprinted locus of IGF2 and CDKN1C. Eight CNN-LOH events affected the 11p11.2-q13.2 region. Two different samples had events at 11q involving the ATM locus, both of which also had non-synonymous ATM mutations. Three events were located in 16p13.3-p13.11, which involve the CREBBP gene. None of these had mutations in CREBBP. Ten CNN-LOH events affected chromosome 17, three of which included the TP53 locus. Among the latter, two had non-synonymous mutations in TP53. Four samples had CNN-LOH events at chromosome 20, all of which affect the ASXL1 gene but without any concurrent mutation in it. It is interesting to mention that only two of the CNN-LOH events overlapped with those reported by Puente et al. (2015) using SNP array technology.

DISCUSSION
The detection of new cytogenetic aberrations using targeted sequencing takes advantage of the increased sensitivity of this technique in order to detect small events that would be otherwise difficult to identify using array-based techniques. CLL is characterized by large-scale cytogenetic alterations (Döhner et al., 2000), but focal rearrangements have been studied to a lower extent. Using sequencing data originally produced by Puente et al. (2015), here, we report the existence of 54 putatively recurrent focal CNAs in the CLL genome. Recurrent focal amplifications were shorter than deletions, mostly involving one or two genes. The most significantly enriched focal gains affected the loci of SUZ12, WT1, HFM1, RFWD2, FLT4, and TTNI3K. On the contrary, focal deletions were wider and tended to span more than two genes. As expected, among the most significant deletions were those in 13q14.2 and in 11q22.3. Nevertheless, other highly significant regions involved the loci of genes such as NOX4, a component of the NADPH oxidase complex (Guo and Chen, 2015), and MIA2, a tumor suppressor gene (Hellerbrand et al., 2008). Deletions in SETD2, 11q22.3, and 14q32.33 were associated with shorter time to first treatment, and gains of IRF4 were independently associated with short survival. Furthermore, we could detect significant positive correlations between six recurrent CNAs and the expression of genes encoded in their respective loci, as well as correlations between 19 CNAs and the expression 926 proteincoding genes genome wide.
Focal recurrent gains and losses tended to target genes that participate in oncogenic pathways. For example, five amplified genes (HFM1, ANAPC10, TAF5, COBRA1, and SYCE3) and one deleted gene (FAM175A/ABRAXAS) are involved in DNA transcription, replication, and repair mechanisms. Both COBRA1 and FAM175A physically interact with the tumor suppressor BRCA1 (Castillo et al., 2014;Yun et al., 2018), whereas ANAPC10 belongs to the anaphase-promoting complex/cyclosome family of proteins that control sister chromatid segregation and cytokinesis (Chang et al., 2014). The amplified genes PHF21A, PCGF6, and SUZ12 encode epigenetic regulators with repressor activity (Iwase et al., 2006;Vizán et al., 2015;Zhao et al., 2017). Other genes targeted by recurrent events participate in important pathways. This is the case of the amplified oncogenes AKT1 (Hyman et al., 2017) and WT1 (Bergmann et al., 1997), and of the receptor tyrosine kinase gene FLT4, which regulates lymphangiogenesis and tumor metastasization to lymphatic vessels . Likewise, the deleted genes AMD1, TP53INP1, ULK1, and NFATC1 are also important in tumorigenesis. AMD1 and TP53INP1 participate in metabolic pathways, and both have tumor suppressor activity (Scuoppo et al., 2012;Saadi et al., 2015), whereas ULK1 plays a decisive role in autophagy initiation (Zachari and Ganley, 2017) and NFATC1 maintains an anergic phenotype in CLL cells (Märklin et al., 2017). Finally, two cyclin-dependent kinase genes were recurrently deleted: CDK6 and CDK19. Deletions in CDK6 were associated with shorter time to first treatment, whereas those in CDK19 were correlated with a reduced expression of the gene. CDK19 is a component of the mediator kinase module, which associates with the mediator complex in order to regulate diverse cellular functions (Dannappel et al., 2019), and CDK6 is a promoter of cell-cycle progression (Kollmann and Sexl, 2013). The role of both genes in the pathogenesis of CLL needs further clarification.
Finally, recurrent broad cytogenetic aberrations characteristic of CLL were identified at the expected frequency, as in the case of trisomy 12, 17p deletion, amplification of 8q, and loss of 8p (Blanco et al., 2016). Interestingly, we detected a significant adverse time to event and OS effect of trisomy 12 among IGHVmutated cases. The importance of IGHV mutation as a predictor of disease evolution within the group of patients with trisomy 12 was previously reported by others (Bulian et al., 2017;Roos-Weil et al., 2018), but to our knowledge, this is the first report about the prognostic importance of trisomy 12 among IGHV-mutated cases. Furthermore, new CNN-LOH events affecting CLL drivers were also detected, most of which had not been described in previous array-based analysis of the CLL genome (Novak et al., 2002;Pfeifer et al., 2007). These CNN-LOH events affected the ATM, NOTCH1, TP53, ARID1A, ASXL1, CREBBP, and PI4KB/IL6R loci, as well as the telomeric region of 11p. Only part of these mutations had concurrent mutations in their corresponding driver genes, suggesting the existence of other mechanisms of pathogenicity.
Detection of copy number changes based on exomesequencing has been proven to be prone to false positives by some studies (Rieber et al., 2017). In this analysis, we have included patient matched control samples, we applied stringent thresholds in order to minimize false detections, and we recapitulated most of the cytogenetic findings of CLL in the expected frequency. Furthermore, we could correlate the presence of this structural aberrations with changes in gene expression in a subgroup of patients, although we believe that this study may be underpowered to detect such associations.
In conclusion, our study presents proof-of-concept evidence for the existence of new focal recurrent CNAs and CNN-LOH in the genome of CLL, some of which influence clinical outcome. Furthermore, we observed that some of these novel events have significant correlations with gene expression changes. The results are concordant with the possible involvement of a set of oncogenes and tumor suppressors in the development of CLL. These results should be considered a "proof of concept," and their existence and functionality should be validated in the future.

FUNDING
This research has been performed without funding. The publication costs associated with this manuscript have been partially paid by Roche Pharmaceuticals. The funder played no role in the study design, data collection, analysis, results interpretation, writing or in the decision to submit this paper for publication.

ACKNOWLEDGMENTS
The authors gratefully thank CESGA (Supercomputing Center of Galicia) for providing the necessary resources for the development of this project, as well as the International Cancer Research Consortium and the European Bioinformatics Institute for supplying data access.
The content of this paper is part of the doctoral thesis of Adrián Mosquera Orgueira to obtain a PhD at the Department of Medicine, University of Santiago de Compostela.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00854/ full#supplementary-material SUPPLEMENTAL FIGURE 1 | Heatmap representation of genomic profiles made using segmented copy number data of the CLL genomes.
SUPPLEMENTARY TABLE 1 | Association of focal amplifications and deletions with TTT and OS. Only those events affecting at least 5 patients of this cohort were selected for the analysis.
SUPPLEMENTARY TABLE 2 | Significant correlations detected between CNA events and expression of genes encoded in their respective loci.