Time to Treatment Prediction in Chronic Lymphocytic Leukemia Based on New Transcriptional Patterns

Chronic lymphocytic leukemia (CLL) is the most frequent lymphoproliferative syndrome in western countries. CLL evolution is frequently indolent, and treatment is mostly reserved for those patients with signs or symptoms of disease progression. In this work, we used RNA sequencing data from the International Cancer Genome Consortium CLL cohort to determine new gene expression patterns that correlate with clinical evolution.We determined that a 290-gene expression signature, in addition to immunoglobulin heavy chain variable region (IGHV) mutation status, stratifies patients into four groups with notably different time to first treatment. This finding was confirmed in an independent cohort. Similarly, we present a machine learning algorithm that predicts the need for treatment within the first 5 years following diagnosis using expression data from 2,198 genes. This predictor achieved 90% precision and 89% accuracy when classifying independent CLL cases. Our findings indicate that CLL progression risk largely correlates with particular transcriptomic patterns and paves the way for the identification of high-risk patients who might benefit from prompt therapy following diagnosis.


INTRODUCTION
Chronic lymphocytic leukemia (CLL) is a low-grade B-cell lymphoproliferative disease with an estimated yearly incidence in western countries of about 6.9 cases per 100,000 people (1) and remarkable variation between races. The incidence of CLL is higher in men than in women and it increases progressively from the age of 35 until the last decades of life (2). Currently, CLL treatment is delayed until disease progression (bone marrow failure, organomegaly, general symptoms, or high-grade lymphoma transformation) and in the case of refractory autoimmune phenomena (3,4). Nevertheless, with the advent of new targeted treatments such as ibrutinib (5), idelalisib (6), and venetoclax (7), it is tempting to speculate that some individuals could benefit from early intervention immediately following diagnosis, when the tumoral mass is smaller and patients have a better physical condition. Thus, improved risk stratification for patients with CLL is needed.
Recent advances in CLL genomics have discovered new drivers of disease, many of which are associated with a different clinical evolution. Deletions (6p21, 6q15, 11q, 14q24, 15q15, 17p, 18p, and 20p; gains in 2p16, 5q24, and 8q24), trisomy 12, and gene mutations (TP53, ATM, NOTCH1, SF3B1, BIRC3,  BRAF, POT1, ZNF292, NFKB2, MGA, IRF4, DDX3X, ZMYM3, and FUBP1)have been repeatedly observed in the CLL genome and are linked to rapid disease progression (8) Nevertheless, immunoglobulin heavy chain variable region (IGHV) mutation status, which is an indirect measure of the tumor lymphocytes' maturation stage (9), is among the most important single predictive factor known to date (10). IGHV unmutated patients show remarkably worse prognosis than IGHV mutated patients (10,11) and only a few other genomic factors have proven to be associated with clinical evolution independent of this variable. Lymphocyte maturation is such an important indicator that DNA methylation status has been used to classify CLL into three different groups that resemble different B cell maturation stages (naive B cell, intermediate, and memory B cell). This classification was shown to outperform IGHV status at predicting time to first treatment (TTT) (12).
Mutations, genomic aberrations, and DNA methylation patterns induce transcriptomic changes that can be measured using RNA sequencing (RNAseq), a technique that offers an opportunity to identify new biomarkers for disease progression and drug response prediction (13)(14)(15). In fact, previous efforts to improve CLL risk stratification based on RNAseq data have demonstrated impressive results (16), but the clinical application is difficult due to the expense of extensive technical and bioinformatics efforts. Therefore, there is a need for smaller transcriptomics patterns correlated with disease evolution for medical use.
In this study, we performed machine-learning based Gaussian mixture model clustering on a subgroup of genes significantly associated with TTT in order to identify transcriptional clusters with clinical implications. We studied TTT due to the lack of treatment uniformity in the International Cancer Genome Consortium (ICGC) CLL cohort and because it is a variable associated with overall survival (17). We tested our results on a 196 patient cohort and validated its clinical significance in an independent 79 patient cohort. The overall results delineated two IGHV-independent transcriptional clusters that stratify patients according to their risk of treatment initiation. Furthermore, we demonstrated that machine learning algorithms using gene expression data can predict patient need for treatment in the first 5 years following diagnosis. We anticipate that our findings will improve the identification of high-risk CLL patients following diagnosis.

Data Sources and Patient Characteristics
We applied for access to the ICGC's CLL sequencing data (18) deposited in the European Genome-Phenome Database (EGA) (19). The Data Access Committee approved access to this data under DACO-1040945. Two CLL RNA-seq cohorts A summary of the patient characteristics of both cohorts can be consulted in Table 1.

Data Preprocessing and Alignment
RNAseq paired-end data were obtained from Illumina pairedend sequencing performed by the ICGC CLL consortium as described by Ferreira et al. (16) Illumina adapters were removed using cutadapt (20) and alignment to the human reference genome (GRCh37) was performed using Hisat2 (21) with default specifications. We used the Hisat2-provided Hierarchical Graph FM index for GRCh37 with SNP and Ensembl transcript information. Bam files were sorted and indexed using samtools (22).

Gene Expression Estimation
RNAseq bam files were processed in R (23) according to the RNAseq gene expression protocol developed by Love et al. (24) Briefly, bam files were read using Rsamtools, (25) followed by gene-level expression estimation using the SummarizeOverlaps function from the GenomicAlignments package. (26) Gene models in GTF format were downloaded from Ensembl (GRCh37.75 version) (27). Genes with a median read count below one were discarded.

Statistical Analysis
We analyzed gene expression association with CLL's TTT using cox regression implemented in the survival package (28,29). In this model we included the covariates donor sex and CLL stage (MBL, Binet Stage A, Binet Stage B, and Binet Stage C). Time to Treatment was calculated as the period between CLL diagnosis and the initiation of the first treatment for CLL. The day of last follow-up was used for right censoring the data of patients with incomplete follow-up.
Clustering was performed using the Mclust package (30)  Those genes with significant association with TTT in the study cohort (cox regression false discovery rate [FDR] below 5%) were selected as our initial list of genes. Variable selection was performed by adding one new gene in p-value ascending order to the model (starting with the first two most significant genes until reaching the top at 2,198 genes [FDR<5%]) and computing the most likely clusters. For the sake of simplicity, we discarded the 25% least variable genes, the 50% least expressed genes and those with a high (>0.9) Spearman's rank correlation with any other gene in the input data. In the case of a highly correlated pair of genes, the one with the lowest p-value was discarded. In each iteration we forced Mclust to calculate the two most likely groups of samples in our data, and to select the best model according to the maximal Bayesian Information Criterion (BIC). Association with TTT calculated using cox regression (survival package), including IGHV mutation status as covariate in each iteration. P-value adjustment was performed with the Bonferroni method.

Machine Learning Ensembl Construction
For IGHV status and need of treatment at 5 years prediction we ran boosted trees analysis using BigML applications (31) with a 2,000 tree node threshold. We chose 5 years due to the following reasons: (1) it is important to differ which patients will have progression in the first years since diagnosis; and (2) the number of cases progressing in earlier years was too small in order to train a good classificator. Varying percentages of learning rates were tested. The best model was selected based on receiver operating characteristic (ROC) curves, Precision-Recall curves, and Kolmogorov-Smirnov statistics.

Genes Associated With Time to Treatment and Clusterization
A cox regression model was constructed with gene expression, donor sex and CLL stage at diagnosis as independent variables. 2,198 genes were found to be significantly associated with TTT (FDR < 5%) in the study cohort.
Patient clusterization based on gene expression data using a GMM-EM algorithm retrieved 19 sets of genes that clustered samples into two groups with significant associations with TTT when adjusted for IGHV status (Bonferroni-adjusted p-value < 0.01) (Supplemental Table 1). The most significant cluster (cluster 2) contained 290 transcripts (Figure 1, Supplemental Table 2) and achieved an association p-value of 6.4 × 10 −7 (Bonferroni p-value 1.4 × 10 −3 ) with the TTT variable adjusted for IGHV mutation status (Figure 2). A significant association was confirmed in the validation cohort (IGHV adjusted p-value 3.05 × 10 −3 ) (Figures 2, 3).
According to the selected classificator, patients in cluster two (C2) had a more favorable prognosis than patients in cluster one (C1) (Hazard Ratios (HR) of−1.70 and−1.41 in the test and validation cohorts, respectively), independently of IGHV mutation status. Among the study cohort, roughly 36.7% of patients belonged to C2, while 34.1% of patients in the validation cohort clustered within C2. C2 involved 51.5% of IGHV-mutated patients and 6.4% of IGHV-unmutated patients in the study cohort, as well as 55.5% of IGHVmutated patients and 5.8% of IGHV-unmutated patients in the validation cohort.

Machine Learning for Treatment Free Survival Prediction
We were interested in a machine learning (ML) classifier that could predict which patients would require CLL therapy in the first years following diagnosis. We constructed model ensembles with all genes associated with TTT in Cox regression at a FDR of 5%. We also tested different learning rates (0.5, 1, 2.5, 5, and 10%). 222 patients had a follow-up period >5years or had been treated in the first 5 years following diagnosis, and we divided them into a training set (80% of patients, composed of 146 patients from the study cohort and 31 patients from the validation cohort) and a test set (20% of patients, composed of 45 patients from the validation cohort). ROC AUC and Precision-Recall AUC plots were evaluated to select the best results.
The best model used a 2.5% learning rate and 2,000 tree nodes. It achieved 90% precision at identifying patients that needed treatment in 5 years with 69.23% recall, and 88.57% precision at identifying those patients that did not require treatment in 5 years with 96.88% recall. We only detected 1 false positive case (3.1% False Positive Rate) and 4 false negatives (30% False Negative Rate). Average precision was 89.29%, accuracy was 88.89% and ROC area under the curve (AUC) was 0.911 (Figure 4 and Table 2). Precision-Recall AUC was 0.860 and 0.959 for predicting which patients would or would not need treatment within this period, respectively. In each case, the results mostly overlapped with the area under the convex hull (AUCH).

DISCUSSION
The main aim of this study was to identify new transcriptomic patterns in order to improve CLL patient risk stratification. We used the GMM-EM algorithm to stratify patients in two clusters with remarkably different clinical behavior based on the expression of 290 genes, and we observed that this pattern was independent of IGHV mutation status. Interestingly, we identified a group of CLL patients with mutated IGHV and We also describe a novel artificial intelligence algorithm that can predict a CLL patient's need for therapy during the first 5 years following diagnosis with high precision and accuracy. This is in line with other ML applications to oncologic malignancies that are starting to change paradigms in patient risk stratification and drug response prediction. For example, Aziz et al. (34) recently reported the identification of a ML model that integrates clinical and genomic data from patients with myelodysplastic syndrome (MDS). This model outperformed all commonly used prediction models in the field of MDS.
Similarly, Yousefi et al. (35) used bayesian-optimized deep learning for survival prediction in pan-cancer analysis, showing not only better performance than other state-of-the-art methods, but also improved predictability of cancer survival through transfer learning in different types of cancer genomic data. Thus, it is likely that ML-driven algorithms applied to genomic and transcriptomic data will be used in the near future for the identification of "smoldering" CLLs that may benefit from early intervention.
RNAseq is a powerful technique that can sequence the whole transcriptome at an increasingly lower cost. Targeted RNAseq is being developed for clinical application, with the additional possibility of testing for gene mutations and fusion genes in the same technique. Therefore, defining reproducible gene expression patterns with clinical implications is a strategy that can close the gap between research and the clinical practice. Here we present patterns of gene expression that can improve CLL patient risk stratification with a relatively small set of the transcriptome. These results may pave the way for the design of new treatment strategies involving early CLL treatment in high-risk patients before disease progression.

AUTHOR CONTRIBUTIONS
AMO designed the study performed research. AMO, BAR, JDA, and JBL analyzed the data. AMO wrote the paper. NAV, ABL, NDV, MGP, and MPE reviewed the paper.

FUNDING
The publication costs associated with this manuscript have been paid by Roche Pharmaceuticals. The funder played no role in the study design, the collection, analysis or interpretation of data, the writing of this paper or the decision to submit it for publication.