Genes selection using deep learning and explainable artificial intelligence for chronic lymphocytic leukemia predicting the need and time to therapy

Analyzing gene expression profiles (GEP) through artificial intelligence provides meaningful insight into cancer disease. This study introduces DeepSHAP Autoencoder Filter for Genes Selection (DSAF-GS), a novel deep learning and explainable artificial intelligence-based approach for feature selection in genomics-scale data. DSAF-GS exploits the autoencoder’s reconstruction capabilities without changing the original feature space, enhancing the interpretation of the results. Explainable artificial intelligence is then used to select the informative genes for chronic lymphocytic leukemia prognosis of 217 cases from a GEP database comprising roughly 20,000 genes. The model for prognosis prediction achieved an accuracy of 86.4%, a sensitivity of 85.0%, and a specificity of 87.5%. According to the proposed approach, predictions were strongly influenced by CEACAM19 and PIGP, moderately influenced by MKL1 and GNE, and poorly influenced by other genes. The 10 most influential genes were selected for further analysis. Among them, FADD, FIBP, FIBP, GNE, IGF1R, MKL1, PIGP, and SLC39A6 were identified in the Reactome pathway database as involved in signal transduction, transcription, protein metabolism, immune system, cell cycle, and apoptosis. Moreover, according to the network model of the 3D protein-protein interaction (PPI) explored using the NetworkAnalyst tool, FADD, FIBP, IGF1R, QTRT1, GNE, SLC39A6, and MKL1 appear coupled into a complex network. Finally, all 10 selected genes showed a predictive power on time to first treatment (TTFT) in univariate analyses on a basic prognostic model including IGHV mutational status, del(11q) and del(17p), NOTCH1 mutations, β2-microglobulin, Rai stage, and B-lymphocytosis known to predict TTFT in CLL. However, only IGF1R [hazard ratio (HR) 1.41, 95% CI 1.08-1.84, P=0.013), COL28A1 (HR 0.32, 95% CI 0.10-0.97, P=0.045), and QTRT1 (HR 7.73, 95% CI 2.48-24.04, P<0.001) genes were significantly associated with TTFT in multivariable analyses when combined with the prognostic factors of the basic model, ultimately increasing the Harrell’s c-index and the explained variation to 78.6% (versus 76.5% of the basic prognostic model) and 52.6% (versus 42.2% of the basic prognostic model), respectively. Also, the goodness of model fit was enhanced (χ2 = 20.1, P=0.002), indicating its improved performance above the basic prognostic model. In conclusion, DSAF-GS identified a group of significant genes for CLL prognosis, suggesting future directions for bio-molecular research.


Introduction
A precise prognostic methodology in chronic lymphocytic leukemia (CLL) patients is critical from the clinical standpoint since progression to a more advanced disease stage requires therapy and often implies an adverse prognosis.At first presentation/ diagnosis, over three-quarters of CLL patients are classified as early/asymptomatic disease phase and not requiring immediate therapy (1).Although most patients have a low-risk profile as indicated by the high frequency of the immunoglobulin heavy chain variable (IGHV) gene mutated (IGHVmut) status (2) and the low del(17p) occurrence involving the TP53 locus (3), the time to first treatment (TTFT) is rather heterogeneous, and it can be partially predicted using combinations of risk-associated markers, which include staging systems and b2-microglobulin (b2-M) (2,(4)(5)(6)(7)(8).
Despite the proven prognostic power of this approach, the clinical course of a number of patients does not follow the pattern predicted, possibly indicating the requirement for additional prognosticators.In this respect, gene expression profiles (GEP), that is, the measurement of the activity (the expression) of all genes of interest to depict a synthetic picture of cellular function, is exploited to increase the ability to predict the prognosis of CLL patients (9)(10)(11).
Although GEP datasets represent a valuable source of information in healthcare, being currently used for diagnosis, prognosis, and precision medicine of hematological malignancies (12), their analysis results are challenging for three main reasons.The first one is the course of dimensionality: genomic-scale datasets typically consist of a very large number of features (genes) and a relatively small number of samples (patients).The second problem concerns imbalanced classes: genomics data are often collected from multiple sources and stratified based on pathologies.In most cases, there is a significant difference between the number of instances in each class.Finally, sequencing data are typically collected from multiple sources, different laboratories, and sequencing tools.This results in noisy datasets which are difficult to analyze (13).
A bioinformatic analysis is necessary to fully realize the potential of these large-scale sequencing data for prognosis in hematological malignancies (14)(15)(16) and solid tumors (17).Machine learning (ML) approaches have been widely used to enhance the performance of diagnostic and predictive models for different diseases and CLL as well (14)(15)(16)(17)(18)(19)(20).Resources and guidelines for using ML in CLL have been made available (21,22).
However, most ML prognostic models for CLL fail to consider numerous variables and do not account for non-linear interactions between them (22).This limits the accuracy of the models and the ability to make informed predictions about the disease progression.Therefore, promising tools such as deep learning (DL) methods, a subset of ML methods based on artificial neural networks (NNs), may be used to overcome the aforementioned ML limitations.DL approaches recognize hidden patterns in large-scale datasets that are typically difficult to detect with traditional statistical and ML models.Recent studies propose and evaluate new feature selection (FS) approaches on genomic-scale datasets for cancer diagnosis and prognosis (23,24).Such FS methodologies mainly aim at selecting the most informative genes, which can characterize classes and identify groups of patients.
Although very powerful, DL models are in general not immediately interpretable, meaning that it is difficult to understand the causal relationship between the inputs and their outcomes.This is an even more severe problem in the bioinformatics domain, where it is crucial to understand, for example, in the case of genomics, how the expression of a gene can affect the progression of oncological patients.In this context, the adoption of explainable artificial intelligence (XAI) methods has started to gain momentum for interpretability purposes as well as to enhance FS (25)(26)(27).
On the other hand, a widely used approach to overcome the course of dimensionality problem is to perform dimensionality reduction using autoencoders (AE) (24).While this has been proven effective, the encoding is typically a non-linear projection of the variables into a lower-dimensional space, making it difficult to provide the interpretations of the proper results.
This study introduces DeepSHAP Autoencoder Filter for Genes Selection (DSAF-GS), a novel DL and XAI-based FS method for genomics-scale data analysis.Such a method uses AEs for selecting the most informative genes without any change into the original features space, hence enhancing the explainability of the results and still exploiting the representation abilities of AEs.Such selection of genes is used to design and train a prediction model for diagnosis or prognosis.Eventually, the Shapely Additive ex-Planation (SHAP) (28) XAI method is applied to interpret the model results and select the most meaningful genes for the disease.
In the present paper, the proposed XAI method has been used to identify those genes whose expression levels are relevant for predicting the need of therapy in CLL patients from a prospective cohort of newly diagnosed Binet stage A CLL (O-CLL protocol) (29,30) who are being monitored under a watch-and-wait strategy.This innovative approach enabled meaningful insights into CLL prognosis from genomic data by locating a group of significant genes to boost the prognostic power of a basic prognostic model.We point out that while our contribution is fully positioned within the research in oncology, our XAI method has broader applicability; in fact, from the bioinformatics and computational genomics point of view (18)(19)(20), an interesting avenue of further research is to assess its efficacy as a general feature-selection method, for instance, by considering datasets for which we already have some a priori semantic information on the most relevant features and by using classical comparison metrics for predictive models.

Patients
A total of 224 of 523 newly diagnosed Binet A CLL cases belonging to the observational O-CLL1 study (clinicaltrials.govidentifier NCT00917540) were prospectively enrolled from 40 Italian institutions (29,30) and studied for GEP.All participants provided written informed consent, and the relevant institutional review boards approved the study.The inclusion and exclusion criteria have been previously detailed (29).In particular, cases could be recruited only within 12 months of diagnosis and if they were aged <70 years and were Binet stage A. The biologic review committee confirmed the diagnosis using flow cytometry analysis and GEP analyses were centralized at Prof Ferrarini's (Istituto Studio Tumori, Genoa, Italy) and Prof Neri's (Fondazione Ca' Granda, IRCCS, Ospedale Maggiore, Policlinico, Milan) labs, respectively.Recruitment began in January 2007.According to the guidelines, treatment was decided uniformly for all participating centers based on documented progressive and symptomatic disease.

Assessment of biological markers
Cytogenetic abnormalities involving deletions at chromosomes 11q23 and 17p13 were evaluated by FISH in a purified CD19 + population as previously described (31).IGHV gene mutational status was assessed on cDNA specimens (32).Sequences were aligned to the IMGT directory and analyzed using IMGT/ VQUEST software.NOTCH1 mutation hotspot was set by nextgeneration deep sequencing as previously described (29).

GEP analysis
GEP experiments were performed as previously described (29,30).Briefly, total RNA fraction was obtained from CD19 + -enriched B-cell samples (EasySep-Human B cell enrichment kit without CD43 depletion, Stem Cell Technologies, Voden Medical Instruments S.p.A, Milan, Italy) using the fully automated protocol of immunomagnetic cell separation with RoboSepTM (Stem Cell Technologies).Purified B-cells (CD19+) exceeded 95% were employed as total RNA sources for GEP analysis.
Preparation of DNA single-stranded sense target, hybridization to GeneChip ® Gene 1.0 ST Array (Affymetrix, Santa Clara, CA), and scanning of the chips (7G Scanner, Affymetrix) were carried out according to manufacturer's protocols.RNA fraction was obtained from samples using Trizol reagent (Life Technologies, Monza, Italy).RNA quality was assessed using the Agilent 2100 Bioanalyzer (Agilent Technologies).The raw intensity expression values were processed by robust multi-array average (RMA) procedure 19 with the reannotated Chip definition files (CDF) from BrainArray libraries version 15.0.0 20 available at http:// brainarray.mbni.med.umich.edu,as previously described (22).The gene and miRNA expression data have been deposited at the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus repository (http://www.ncbi.nlm.nih.gov/geo/) and are accessible through GEO Series accession number GSE40570.

O-CLL dataset
For each patient, 19,367 genes profiles were provided.Patients are labeled according to the occurrence of an event (or not).The considered outcome was the need for therapy starting or death [dichotomous, not (event=0) vs yes (event=1)].From the 217 patients in the final dataset, 120 were labeled as event=0 and 97 as event=1.

Feature selection
GEP studies generate large, high-dimensional, and unbalanced datasets, where each sample can have up to thousands of variables.This results in high computational costs and the possibility of overfitting.Such overfitting may mistake small changes in the data as significant differences, leading to misclassification errors.This study addresses these risks by applying FS techniques to reduce the dimensionality problem by selecting the most relevant features and removing noise and redundancy.FS techniques can be filterbased, wrapper, or embedded (33)(34)(35).The integration of multiple FS techniques is denoted as hybrid FS.
The proposed DSAF-GS approach is a hybrid FS method that combines filter-based and wrapper techniques to achieve a representative and meaningful subset of genes.DSAF-GS uses autoencoders (AE) as wrappers along with statistical filters to remove redundant genes.An NN is trained on the remaining genes as an event predictor.Finally, the SHAP XAI method is used to evaluate the contribution of genes to NN decisions.The genes with the strongest contribution are selected.

Neural networks
NNs are ML computational models inspired by the structure and function of the human brain.They consist of consecutive layers of interconnected artificial neurons, which process and transmit information through weighted connections.Training an NN amounts to providing a dataset of input-output pairs and identifying, via proper optimization methods, the NN parameters that minimize some given loss function, usually meant to measure the distance between the output at hand and the result of the NN computation on the given input.In a research context, NNs can be trained on large datasets of genetic data to identify patterns and predict the effects of genetic mutations on traits of interest (36)(37)(38).These predictions can then be used to understand further the genetic basis of diseases (38) and other phenotypic traits (39,40) and inform the development of personalized medical treatments (37, 41).In addition, by using NNs and other computational and experimental methods (e.g.clustering and statistical analysis, such as F-test or XAI) researchers can gain deeper insights into the complex interactions between genetics and biology.
Autoencoders (AE) are particular architectures of NNs that uncover the underlying structure of the data and generate a latent code for further analysis (42).An AE maps an input to a lower dimensional representation (latent code).Such code is expected to have uncorrelated features, being able to reconstruct the original input data.Therefore, AEs can be used for dimensionality reduction, denoising, and data generation.

SHapley additive exPlanation
The black-box nature of NNs often limits the interpretability of their results.Advances in XAI provide various methods for interpreting black-box models, offering a clearer understanding of their predictions.For example, Shapely Additive ex-Planation (SHAP) is a game theory-based approach for interpreting blackbox models.SHAP determines the importance of a feature by observing the variations in predictions when the feature is included or excluded from the model.It assigns an importance value, called a SHAP value, to each feature, based on its contribution to the predictions (28).SHAP values are quantitative estimates, indicating 'how' and 'how much' every single gene contributes to the model decisions, providing insight into the gene's role in the event prediction.The SHAP method provides a way to understand the underlying workings of NNs predictions, leading to improved insights and better decision-making.
An alternative XAI approach, named LIME (Local Interpretable Model-Agnostic Explanations) (43), approximates the behavior of complex models via local interpretable explanations.Such explanations are obtained by fitting simpler and interpretable surrogate models with perturbed input data and observing the resulting changes in the model's predictions.
While LIME approximates the behavior of complex models with simpler ones, SHAP provides a more direct and explicit connection between feature importance and predictions.This transparency, along with a solid theoretical foundation rooted in game theory, enhances a deeper understanding of the underlying mechanisms driving the model's decisions.

Proposed algorithm
DSAF-GS consists of the following steps (Figure 1): 1.The pairwise correlation (Pearson) was computed over the whole set of genes.2. The resulting correlation matrix was clustered using hierarchical clustering such that similarly correlated genes belong to the same cluster.3.For each cluster, all patient data were retrieved from the original dataset.An AE is then trained for each cluster using patients as features and genes as samples.The AEs selected the most representative gene of the respective cluster.The most representative gene was the one associated with the lowest reconstruction error.This step eliminates redundant genes hence reducing the dimensionality, still working at the level of the original feature.4. Genes selected in the previous step were then ranked with the F-test.The F-value was computed by considering, for each gene, the ratio of the variance between and within the groups (event=0 and event=1).A subset of genes with the highest F-value was then selected.The final subset size was empirically selected by considering different subset sizes. 5.A NN was trained to perform binary classification on the event class, using the previously selected set of genes as input variables and by considering the standard binary cross-entropy loss function.A model selection phase identified the most appropriate architecture of the NN.A grid-search approach is applied over a hyperparameters space defined by the number of layers and neurons per layer.In particular, we considered 2, 3, and 4 layers and 8, 16, 32, 64, 128, and 256 neurons for the first layer.In every successive layer, the number of neurons was determined as half of the number of neurons in the preceding layer.For each given configuration (layers/neurons), we built the corresponding NN whose performances were evaluated using a cross-validation (cv) algorithm, which assesses the model's ability to generalize by repeatedly training and testing on different subsets of the data for multiple iterations.The hyperparameters configuration is chosen as the one with the highest average performances (according to the average binary accuracy) over the cv iterations; the best model is finally selected as the one with the best performances among the cv iterations of the chosen configuration.6. SHAP XAI method was used to explain the chosen NN classifier for the CLL event.SHAP evaluates the importance of each gene on the predictions by providing information on how such genes affect the prognosis.

Implementation
The DSAF-GS algorithm has been implemented using the Python (v3.8.11) programming language.NNs have been implemented using the Tensorflow (v2.6.0)framework and the Keras library.XAI analysis was performed using the SHAP (28) library.
For GEP analysis, 500 clusters were identified (step 2).Each AE architecture (step 3) consisted of five layers with the following number of neurons: 217, 43, 21, 43, and 217, respectively; relu was used as activation function and Adam as optimizer with a learning rate of 0.01; Mean Absolute Error was used as reconstruction loss, and each AE was trained for 1000 epochs.Out of the 500 genes selected by the AEs (one for each cluster), different subset sizes were used to train the NN event predictor.The F-test (step 4) was used to select a subset of genes of size 5, 10, 50, 100, and 300 (Table 1).Different NNs were trained (step 5) for each gene subset.The best model for GEP takes in 50 genes as input and consists of 4 layers of 46, 22, 12, and 1 neuron, respectively.The pipeline proposed for selecting a subset of genes relevant to predict CLL events.The input data is used to compute the genes pairwise correlation matrix (step 1), and the correlation matrix is clustered (step 2) to group similarly correlated genes.The clusters are then mapped to the original input data and transposed.AEs are trained for each cluster to select the most representative gene, reducing dimensionality (step 3).The genes are ranked with F-test, selecting a subset with the highest F-value (step 4).A neural network is trained with a selected set of genes to perform binary classification of the CLL patients (event=0 and event=1) (step 5).The best NNs architecture is determined through model selection, and the SHAP XAI method explains each gene's importance in the predictions (step 6).
During grid-search (step 5), Adam was used as an optimizer with a learning rate of 0.001, and relu was used as an activation function.A total of 10 cv iterations have been performed for each configuration by randomly splitting the data into a training set and test set (90% and 10% of the whole dataset, respectively).While the training set has been enriched with synthetic samples (by using SMOTE) (44) to guarantee a balanced training of the NN, the test set only comprised real data samples.While the computation of SHAP values was found to be inefficient for NN models, the authors (36) demonstrated that Shapley values could be calculated through a weighted linear least square regression with a shapely kernel.Such a method was adopted for computing SHAP values (step 6) using a subsample of 100 patients.Note training and test sets are different parts of the same O-CLL dataset.Indeed, we are not aware of any further public dataset having the clinical and genomic information required by our method and that can be used as a validation set, by fitting our needs and the prospective nature of our study.

Statistical analysis
TTFT was calculated during the watch and wait period from the date of the diagnosis to the date of therapy start or last follow-up.The prognostic impact of predictors was investigated by univariable and multiple Cox regression analysis.Data were expressed as hazard ratio (HR) and 95% confidence intervals (CIs).The predictive accuracy of the prognostic models was quantified by calculating Harrell's c-index (HC-index), ranging from 0.5 to 1.0, and the explained variation on the outcome (i.e., an index combining calibration and discrimination) (45).The improvement of model fitting due to the inclusion of specific genes was assessed by the log-likelihood ratio statistics.The receiver operator curves (ROC) Figure 2B illustrate the model performance by plotting the actual positive rate (sensitivity) versus the false positive rate (1 -specificity).A value of P <.05 was considered significant for all statistical calculations.Data analysis was performed by SPSS for Windows v.21, IBM, Chicago, Illinois, USA, and by Stata 16, StataCorp, Texas, USA.

Pathway and gene network analysis
Pathway analysis was performed using the Reactome Pathway Analysis tool (reactome.org)to group genes into specific pathways.
Reactome analysis with statistical hypergeometric distribution test determines whether certain pathways are over-represented in the submitted data.This test produces a probability score, which is corrected for false discovery rate using the Benjamani-Hochberg method (46,47).

Gene selection by explainable artificial intelligence
Of the 224 enrolled patients, 217 were used for the analysis.The remaining 7 were removed for defective gene profiles.The performance of the best model on the test set is reported in Table 1.The results are shown according to the number of genes used as independent variables (first column).The best model accuracy is 86.36%, achieved using 50 genes as predictors, with a sensitivity and specificity of 85% and 87.5%, respectively.The second column reports the 95% CIs computed using the model accuracy's mean and standard deviation over the cv iterations.
The model sensitivity and specificity are reported in Figures 2A,  B, in terms of a confusion matrix and ROC curve, respectively.Despite the three false positives and three false negatives, the model is capable of detecting the underlying patterns in the data, as shown by the overall performance.Figure 2B shows that the models have an area under the curve (AUC) of 0.91.
Figure 3A shows a waterfall plot of absolute mean SHAP values, reporting the average importance of each gene in the model, evaluated using SHAP.Genes are reported in order of importance.For example, the model was strongly influenced by CEACAM19 and PIGP (+0.09 and +0.08), moderately influenced by MKL1 (alias of MRTFA gene) and GNE (+0.04), and poorly influenced by others (<0.03).
Moreover, as shown in Figure 3B, higher values of the gene CEACAM19 are associated with positive SHAP values, meaning that they will increase the prediction towards the occurrence of the event.Moreover, lower values of the variable are associated with negative SHAP values, meaning that they will decrease the prediction towards the absence of an event.Conversely, for the PIGP gene lower values are associated with positive SHAP values, increasing the prediction towards the event's occurrence.PIGP higher values are associated with negative SHAP values, meaning they will decrease the forecast towards no event occurrence.

Pathways and networks overview based on Reactome database of the top 10 genes
The description and chromosome localization of the top ten genes are described in Table 2. Eight of the top ten genes (i.e., FADD, FIBP, FIBP, GNE, IGF1R, MKL1, PIGP, SLC39A6) selected by the NN algorithm were found in the Reactome pathway database, showing involvement in various pathways such as signal transduction, gene expression (transcription), protein metabolism, immune system, cell cycle and apoptosis (Figure 4).
Interestingly, seven (i.e., FADD, FIBP, IGF1R, QTRT1, GNE, SLC39A6, MKL1) of the top 10 genes appear to be connected into a complex network as shown in Figure 5 by the network model of the 3D protein-protein interaction (PPI) explored using the NetworkAnalyst tool (48).

Multivariate analysis of the top 10 genes
The top 10 genes selected by the NN models were chosen to estimate their prognostic influence on noticeable clinical and biomolecular variables (named basic prognostic model) consisting of IGHV mutational status, del(11q) and del(17p), NOTCH1 mutation, b2-M, Rai stage, and B-lymphocytosis.Table 3 shows their prediction power on TTFT in univariable analysis.
As expected, all ten top genes showed a predictive power on TTFT in univariable analyses (Figure 6).Specifically, for COL28A1 (HR 0.32, 95% CI 0.12-0.82,P=0.018), FADD (HR 0.21, 95% CI 0.07-0.62,P=0.005), and PIGP (HR 0.39, 95% CI 0.15-0.98,P=0.047) high expression was associated with a reduced risk to be treated, while the  Pathways overview based on the Reactome database of the top 10 genes identified by the (SHAP) XAI method.A genome-wide overview of the results of pathway analysis is shown.Reactome pathways are arranged in a hierarchy.The center of each of the circular "bursts" is the root of one top-level pathway, for example, Cell Cycle.Each step away from the center represents the next lower level in the pathway hierarchy.The color code denotes the over-representation of that pathway in the input dataset.The closer the color is to yellow, the more significant the over-represented pathway is; light grey indicates pathways that are not significantly over-represented.4).When these three significant genes were evaluated in a final multivariable model, including the basic prognostic variables, COL28A1, along with B-lymphocytosis, lost its significance, while IGF1R and QTRT1 maintained their prognostic independence (Table 4).
The basic prognostic model provided an HC-index of 76.5% and an explained variation to predict the TTFT of 42.2%.When the three significant genes (i.e., IGF1R, COL28A1, and QTRT1) were jointly considered in the final multivariable model, the HC-index and the explained variation significantly increased to 78.6% and 52.6%, respectively, along with an improvement of the goodness of model fit (c2 = 20.1,P=0.002).In a more parsimonious model, only including IGF1R and QTRT1 (i.e., the two genes that remained significantly associated with the TTFT in the final model) and excluding CLO28A1, the HC-index (78.2%) and the explained variation (52.4%) retained a better performance as compared with the basic prognostic model, with a concomitant rise in the goodness of the model fit (c2 = 18.8, P=0.001).

Discussion
The considerable innovations in genomics engendering a vast and miscellaneous bulk of information from sizable cohorts of patients and the concurrent computer science knowledge improvements have guided the growing use of AI and, more specifically, of ML approaches that acquire knowledge from available data, devising variable selections without pre-setting programming (50).Well-defined examples of the ML approach in the analysis of hematological malignancies are the association of BCL6 and PDL1/2 rearrangements in primary testicular diffuse large B-cell lymphoma (DLBCL) with central nervous system relapse (51); the involvement of six prognosis-related long noncoding genes in acute myeloid leukemia (AML) patients (52); or the relevance of tumor mutation burden for the DLBCL overall survival prognostication (53) are.In CLL, the ML algorithm identified six hub genes as possible biomarkers to improve the diagnosis (14).Moreover, baseline clinical data added to the international prognostic index for CLL (CLL-IPI) variables demonstrated improved predictive performance over CLL-IPI, using a range of ML boosting algorithms to identify the individual risk of death, treatment, infection, and a combination of them (16).In contrast, no additional improvement was observed when comprising recurrent genetic mutation information (16).Moreover, an ML algorithm called CLL Treatment Infection Model (CLL-TIM) was b2-M, b2-microglobulina; IGHV, immunoglobulin heavy chain gene rearrangement; LL, lower limit; UL, upper limit; Cis, confidence intervals.The PPI networks created by FADD, FIBP, IGF1R, QTRT1, GNE, SLC39A6, and MRTFA genes.Node size and color correspond to the number of connected edges; gene name is displayed only for nodes with ≥ 4 edges, and the closer the color is to red, the bigger the node size is.
applied to recognize patients at high risk of infection and/or treatment based on CLL-IPI variables and routine clinical data (17).However, differently from our prospective study, the CLL-IPI score system only included 32% of Binet stage A patients and, more importantly, 4% of IGHV mutated cases, thereby rendering it less representative of the real-world setting and may lower the TTFT's predictive efficacy.In contrast, the Brno-Barcelona cohort (54) had a significantly higher proportion of early-stage/low-risk Binet A (83%) and IGHV mutated cases (43%), as did the German CLL study group which also developed a predictive model for newly diagnosed Binet stage A patients (55), with roughly 71% of the population having an IGHV mutation status.
Herein, we selected the top 10 genes (CEACAM19, PIGP, FADD, FIBP, FIBP, GNE, IGF1R, MKL1, PIGP, and SLC39A6) from a GEP dataset of 217 CLL cases comprising roughly 20,000 genes using a novel deep ML-based approach to estimate how much every single gene had a role in predicting the therapy need occurrence.The GEP model was strongly influenced by CEACAM19 and PIGP (SHAP value +0.09 and +0.08) in making decisions, moderately influenced by MKL1 and GNE (SHAP value +0.04), and poorly influenced by the others (SHAP value <0.03).IGF1R, COL28A1, and QTRT1 moderately influenced quite the GEP model (SHAP value +0.05).
Some variables, namely Rai stage, IGHV mutational status, b2-M, and 17(p) and 11(q) deletions previously validated in the CLL-IPI score system (5, 56) and by our group (57,58), were used as a basic risk model in predicting TTFT.We found that IGF1R, COL28A1, and QTRT1 genes maintained their own independent prognostic value in Forest plot of Cox univariable analysis for time to TTFT according to the top 10 genes selected by the NN algorithm.
predicting the time-to-event when tested in a multivariable model, including the variables of the basic prognostic model.However, in a final multivariable model, in which the three genes (IGF1R, COL28A1, and QTRT1) were tested all together with the prognostic variables of the basic model IGF1R and QTRT1, but not COL28A, maintained their predictive independence on TTFT.Notably, the presence of these genes in the model significantly increased the prognostic accuracy of a basic risk model.In this regard, the HCindex and the explained variation significantly increased from 76.5% in the basic model to more than 78% and from 42.2% to roughly 52% in the IGF1R/QTRT1-gene model, respectively.These data indicate that the IGF1R/QTRT1-gene model retained a better performance than the basic prognostic model.IGF1R encoding the insulin-like growth factor 1 receptor (IGFR1) is not only implicated in numerous cellular biofunctional processes, i.e., growth, proliferation, differentiation, and apoptosis (59), but also it plays a critical role in cancer development, progression, and metastasis (60).Moreover, IGF1R is involved in CLL (61)(62)(63) and overexpressed in various CLL cell subsets.Its inhibition induced apoptosis and efficiently reduced CLL growth in an Em-TCL1 transgenic murine model (62).Moreover, IGF1R seems to be a direct target of sorafenib since the latter decreased its expression and phosphorylation by offsetting the insulin-like growth factor-1 binding to CLL cells and ultimately dropping the in vitro IGF1R kinase activity (62).Finally, we previously demonstrated the IGF1R gene expression as an independent prognostic factor related to TTFT in our O-CLL prospective cohort after a shorter follow-up (63).
Unlike the IGF1R gene, QTRT1 encoding the queuine tRNAribosyltransferase 1, a key enzyme involved in the posttranscriptional modification of tRNAs (64), has never been implicated in the pathogenesis or prognosis of CLL.Conversely, a significant increase in QTRT1 expression and a striking downregulation in its methylation was also found in lung cancer (65).Furthermore, it was discovered to be a risk factor for the disease onset and progression and adversely associated with survival outcomes (65).
Among various CLL prognostic models involving genes (66-69), Herold et al. (11) provided evidence of the association between overall survival and TTFT and the expression of 8 genes in CLL cells (PS.8 score).For TTFT, PS.8 showed an improved prognostic effect than the single parameters and even to a combined FISH and IGVH status model, which, in turn, failed to i n c r e a s e t h e p e r f o r m a n c e o f t h e P S .8 s c o r e i n a multivariable analysis.
Huang et coll (70).showed that high NRIP1, BCL11B, and SIRT1 expressions were associated with more prolonged survival, while high expression of CDKN2A and SREBF2 with a poor prognosis (70).However, a substantial fraction of patients in the dataset chosen by the authors was not analyzed at the diagnosis/first presentation but at the time of progressive disease or relapse (70).Conversely, patients of our O-CLL cohort were prospectively followed-up, and all the biomolecular analyses were performed at the disease onset.Moreover, both Herold's (11) and Huang's (70) studies did not consider, unlike our study, unavailable risk factors included in the CLL-IPI score, somewhat misinterpreting the final results.In models 1-3, the basic prognostic model variables were included in the multiple models with every single gene.The results of the analyses in which the specific gene was independently associated with TTFT are reported.In the final model, the three significant genes were integrated into the multivariable analysis with the markers of the basic prognostic model.b2-M, b2-microglobulin; LL, lower limit, UL, upper limit; CIs, confidence intervals.
Two recently published articles represent interesting innovations in CLL's gene-oriented prognosis (71, 72).Liang X et al., following the super-enhancer (SE) new hypothesis, generated a prognostic score to predict the time-to-therapy-need in CLL by the expression levels of nine SE-associated genes (71).Yet, since several data suggest the high dependency of CLL cells on microenvironment support, Abrisqueta and coll (72).described the prediction power of a signature for predicting progression based on the analysis of two hundred genes linked to microenvironment signaling by the NanoString approach.This novel approach established a 15 genes-based signature that predicted disease outcome independently of the IGHV mutational status, the CLL-IPI, and the International Prognostic Score for Earlystage (IPS-E) CLL score (72).Notably, the nanoString platform, overcoming GEP methodological drawbacks and reproducibility, could represent the future, facilitating its use in clinical settings.
Notably, several pathways involved in cancer and hematopoietic malignancies development were identified by Reactome analysis of the top ten genes analyzed in this study, including Interferon alpha/ beta signaling (73-75), caspases and Rho GTPase activity (76), GHR signaling pathway (77-79), Integrin signaling (80), non-receptor Tyrosine Kinases activity (81), and FGF/FGFR pathways (82).Moreover, among the top ten genes, FIBP was found to be overexpressed in a specific group of CLL patients affected by a large loss at the 13q14 locus (83); as previously noted also, IGF1R was identified as overexpressed in various CLL subsets, suggesting a contribution to CLL pathology (63,81,84,85).Finally, seven of the top 10 genes (which appeared to be connected in a complex PPI network) are, in turn, interconnected through the UBC gene, encoding for Polyubiquitin C, which represents one of the sources of ubiquitin in human cells.Polyubiquitin C plays a crucial role in maintaining cellular ubiquitin levels, especially during the stress response.The process of ubiquitination has been associated with protein degradation, DNA repair, cell cycle regulation, kinase modification, endocytosis, and regulation of other cell signaling pathways (86).Interestingly, Zhang et al. (87), through bioinformatic analysis of gene expression profiles in CLL cells, identified the UBC gene as the key node in the PPI network of genes up-regulated in B cells co-stimulated with immobilized anti-IgM with respect to untreated cells, revealing the proteasome pathway as the most significant in this network.
Finally, it should be emphasized that this preliminary study lacks a validation cohort.To the best of our knowledge, we are not aware of any further public dataset fitting the prospective nature of our study as well as the clinical and genomic information required to answer our aims.Specifically, the GEO dataset (GSE39671) (88) does not have the characteristics required to run the method presented in this article but the characteristics are instead collected in the ICGC CLL dataset (89,90).However, in the latter, sampling was completed within a year, as in our study, in approximately 26% of Binet A untreated CLL cases.In contrast, the median sampling time for the remaining cases of the ICGC CLL cohort was approximately 5 years (IQR 2.6-9.1), a bias that might invalidate the analysis' conclusions.Moreover, information on the Rai stage system is lacking and the Binet stage information at sampling is not available.

Conclusions
A novel deep ML-based approach was proposed in the current analysis, exploiting the reconstruction capabilities of AEs and XAI to select the most informative genes for predicting the therapy need event.This study's strengths lie in the use of an original ML method and the prospective nature of our study.The results, although preliminary, evidenced the effectiveness of this approach in identifying genes with independent predictive power, suggesting a set of meaningful genes for further investigation.Finally, it should be emphasized that this pilot study requires external validation using a different prospective cohort of patients with similar characteristics.Finally, it should be emphasized that this pilot study requires external validation using a different prospective cohort of patients with similar characteristics.

3 SHAP
FIGURE 3 SHAP values were computed for the best model.(A) Waterfall plot of absolute mean SHAP values (average absolute importance of each gene in the model), (B) Beeswarm plot of SHAP values (shows how and how much each gene influences the predictions).
FIGURE 2 (A) Confusion matrix of model performance on the test set in predicting the event or non-event of new patients.Black squares refer to wrongly classified patients (false positives and false negatives), while colored squares refer to well-classified patients (true positives and true negatives).(B) ROC curves for the model.The graph plot sensitivity against specificity at various threshold settings.The classifier performs better as the curve approaches the upper left corner.An AUC value of 0.91 for GEP predictions indicates the solid overall performance of the model.

TABLE 1
Models' accuracy in the binary classification of CLL event (i.e., therapy need or death).

TABLE 2
Description and localization of the top ten genes derived from SHAP analysis.

TABLE 3 Univariable
Cox analyses for time to first treatment of several well-known clinical and biomolecular variables belonging to the basic prognostic model.

TABLE 4
Cox multivariable analyses for time to first treatment (TTFT).