A Predictive Based Regression Algorithm for Gene Network Selection

Gene selection has become a common task in most gene expression studies. The objective of such research is often to identify the smallest possible set of genes that can still achieve good predictive performance. To do so, many of the recently proposed classification methods require some form of dimension-reduction of the problem which finally provide a single model as an output and, in most cases, rely on the likelihood function in order to achieve variable selection. We propose a new prediction-based objective function that can be tailored to the requirements of practitioners and can be used to assess and interpret a given problem. Based on cross-validation techniques and the idea of importance sampling, our proposal scans low-dimensional models under the assumption of sparsity and, for each of them, estimates their objective function to assess their predictive power in order to select. Two applications on cancer data sets and a simulation study show that the proposal compares favorably with competing alternatives such as, for example, Elastic Net and Support Vector Machine. Indeed, the proposed method not only selects smaller models for better, or at least comparable, classification errors but also provides a set of selected models instead of a single one, allowing to construct a network of possible models for a target prediction accuracy level.

1. We augment our initial variable set M 0 with 1 variable in order to construct the set I * 1 . (i) Construct the p possible models obtained by augmenting M 0 with each of the p available variables. (ii) Compute D(·, ·) for every model obtained in Step (i). (iii) From Steps (i) and (ii), construct the set I * 1 using (3). Go to Step A .2 and let d = 2.
2. We augment our initial model M 0 set by d variables in order to construct the set I * d . (i) Construct the p d possible models and augment M 0 with all variables of these constructed models. (ii) Compute D for every model obtained in Step (i). (iii) From Steps (i) and (ii), construct the set I * d using (3) and let d = d + 1. Go to Step A .2 (if d < d ) or Step B.1 (if d ≥ d ), with model dimension starting value d. Table 3 reports the main biomarker hubs and related biomarker networks for the leukemia data set analysed in Section 4.1. Table 4 reports the performances of our implementation of the competing methods as described in Section 5. Unlike reported in Table 1 Table 3: Biomarker network organisation -leukemia data set -Lymphoblastic / Myeloblastic leukemia. TF = Transcription/translation factor activity, DNA repair and catabolism -AA = apoptotic activity -IR = immunity, inflammatory response (blood coagulation, antigen presentation and complement activation) -IPT = intracellular protein trafficking, transmembrane transport -ACC = actin activity, cytoskeleton organisation -APC = protein catabolism -ST = intracelular signal transduction -CG = cell growth, proliferation and division. Source: www. ensembl. org ; www. uniprot. org method uses the classical tenfold-CV for D(·, ·) (K = 1). The other hyperparameters are kept the same (i.e. α = 0.01, B = 20 000 and π = 0.5).  Table 4: Performances of our implementation of the competing methods on the leukemia data-set. For the Panning Algorithm, models "a" to "c" are three examples out of the 81 models. All the 81 models have a tenfold-CV error of 0 except one. The best test error is 1 and the worst is 21. Model averaging gives an equal weight to all the 274 models and aggregates their prediction. Trimmed-mean model averaging is Model averaging on the best 25% models based on their in-sample deviances.

C Breast Cancer
The second data-set we analyzed is the breast cancer data presented in Chin et al. (2006). The main goal behind analyzing this data is to identify the estrogen receptor expression on tumor cells which is a crucial step for the correct management of breast cancer. Similarly to Table 4 in Appendix B, Table 5 reports the performances of our implementation of the competing methods and the proposed approach on the breast cancer data. For the sake of this comparison, the data-set was randomly split into training (60) and test (58) sets. The hyper-parameters of the proposed method are α = 0.01, B = 30 000, π = 0.5 and D(·, ·) is the repeated tenfold-CV (K = 10).  Table 5: Performances of our implementation of the methods on the breast cancer data-set. For the proposed method, models "a" to "c" are two examples out of 274 models. The tenfold-CV error varies between 0 and 3. The best test error is 9 and the worst is 31. Model averaging gives an equal weight to all the 274 models and aggregates their prediction. Figure 3 shows the paradigmatic network identified by our method for the breast cancer data for which the selected model dimension is three (i.e. only three biomarkers are needed in a model to well classify the breast cancer). We used the hyper-parameters α = 0.01, B = 22 215, π = 0.05 and for D(·, ·) the tenfold-CV repeated K = 10 times was used. Table 6 provides the details of the networks based on the three main hubs and is to be interpreted as described in Section 4.1.
This figure is a clear example of the advantages of the proposed method since, it not only selects a set of low-dimensional models with a high predictive power, but also provides the basis for a more general biological interpretation which takes into account interactions between different biomarkers as opposed to one single model. The three main hubs identified through the proposed algorithm are: 3. TBC1 domain family, member 9 (TBC1D9): a GTPase-activating protein for Rab family protein involved in the expression of the ER in breast tumors.
GATA3 is known to regulate the differentiation of epithelial cells in mammary glands (see Kouros-Mehr et al., 2006) and is required for luminal epithelial cell differentiation. Its expression is progressively lost during luminal breast cancer progression as cancer cells acquire a stem cell-like phenotype (see Chou et al., 2010). IL6 ST has been linked to breast cancer epithelial-mesenchymal transition and cancer stem cell traits (see Chung et al., 2014), cancer-promoting microenvironment (see Bohrer et al., 2014) and resistance (see Christer et al., 2013). Moreover, this result supports the assertion by Taniguchi and Karin (2014) that IL6 ST and related cytokines are the critical lynchpins between inflammation and cancer. Finally, concerning the third biomarker, a recent publication by Andres and Wittliff (2012) has shown that the expression of the ER on the surface of breast tumor cells is highly correlated with the coordinate expression of different genes among which we can find TBC1D9 and GATA3. These two genes are not only considered as relevant genes according to the proposed method but as actual hubs of the "best" models which define the structure of the identified network. Instead of selecting a single model with many biomarkers whose interactions may be difficult to interpret, the proposed method selects a set of models with few biomarkers that allow them to be individually easy to interpret without losing the possibility of interpreting them within the larger network. This is what this paper intends with the expression "paradigmatic network" since by taking this approach it is possible to identify a set of biomarker families within which each biomarker is interchangeable with the others.  Table 6: Biomarker network organisation -breast cancer data set -Estrogen Receptor -Breast Cancer. TF = Transcription/translation factor activity, DNA/RNA repair and catabolism -ER = estrogen receptor activity -APC = autophagy -protein catabolism -IR = immunity, inflammatory response (blood coagulation, antigen presentation and complement activation) -CC = cell/cell communication -ST = intracellular signal transduction, protein glycosylation -CG = cell growth and division -IPT = intracellular protein trafficking , transmembrane amino-acid transporter -ACC = actin activity, cytoskeleton organisation, cell projection -STM = sugar transport and metabolism -ITT = ion transmembrane transport, transmembrane signaling systems -PUP = pseudogene, uncharacterized protein -FAM = fatty acid metabolism. Source: www. uniprot. org ; www. ncbi. nlm. nih. gov/ gene