REVIEW article

Front. Mol. Biosci., 07 November 2022

Sec. Molecular Diagnostics and Therapeutics

Volume 9 - 2022 | https://doi.org/10.3389/fmolb.2022.907150

A comprehensive survey on computational learning methods for analysis of gene expression data

  • 1. Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India

  • 2. Electronics and Telecommunication Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India

  • 3. Symbiosis Center for Applied AI (SCAAI), Symbiosis International (Deemed University), Pune, India

  • 4. Symbiosis School of Biological Sciences, Symbiosis International (Deemed University), Pune, India

Article metrics

View details

30

Citations

7,4k

Views

2,1k

Downloads

Abstract

Computational analysis methods including machine learning have a significant impact in the fields of genomics and medicine. High-throughput gene expression analysis methods such as microarray technology and RNA sequencing produce enormous amounts of data. Traditionally, statistical methods are used for comparative analysis of gene expression data. However, more complex analysis for classification of sample observations, or discovery of feature genes requires sophisticated computational approaches. In this review, we compile various statistical and computational tools used in analysis of expression microarray data. Even though the methods are discussed in the context of expression microarrays, they can also be applied for the analysis of RNA sequencing and quantitative proteomics datasets. We discuss the types of missing values, and the methods and approaches usually employed in their imputation. We also discuss methods of data normalization, feature selection, and feature extraction. Lastly, methods of classification and class discovery along with their evaluation parameters are described in detail. We believe that this detailed review will help the users to select appropriate methods for preprocessing and analysis of their data based on the expected outcome.

1 Introduction

A genome is a complete set of genes in an organism. Genomics is a study of the information structure and function programmed in the genome. Genomics has applications in multiple fields, including medicine (Chen et al., 2018; Lai et al., 2020; Huang et al., 2021), agriculture (Abberton et al., 2016; Parihar et al., 2022), industrial biotechnology (Alloul et al., 2022), synthetic biology (Baltes and Voytas, 2015), etc. Researchers working in these domains create and use a variety of data such as DNA, RNA, and protein sequences, gene expression, gene ontology, protein-protein interactions (PPI), etc.

Genomics data can be broadly classified into sequence and numeric data (e.g., gene expression matrix). The DNA sequence information can be determined by first generation (Sanger, Nicklen and Coulson, 1977), second generation sequencing (Margulies et al., 2005; Shendure et al., 2005; Bentley et al., 2008; Valouev et al., 2008) or third generation sequencing (Harris et al., 2008; Eid et al., 2009; Eisenstein, 2012; Rhoads and Au, 2015) methods. The second and third generation sequencing are together referred to as Next Generation Sequencing (NGS). Applications of DNA sequence analysis include prediction of protein sequence and structure, molecular phylogeny, identification of intrinsic features, sequence variations, etc. Common implementations of these applications include splice site detection (Nguyen et al., 2016; Fernandez-Castillo et al., 2022), promoter prediction (Umarov and Solovyev, 2017; Bhandari et al., 2021), classification of diseased related genes (Peng, Guan and Shang, 2019; Park, Ha and Park, 2020), identification of protein binding sites (Pan and Yan, 2017; Uhl et al., 2021), biomarker discovery (Arbitrio et al., 2021; Frommlet et al., 2022), etc. The numeric data often generated from functional genomics studies include gene expression, single nucleotide polymorphism (SNP), DNA methylation, etc. Microarray and NGS technologies are the tools of choice for functional genomics studies. The functional genomics that deals with high-throughput study of gene expression is referred to as transcriptomics.

Gene expression data, irrespective of the platform used (e.g., microarray, NGS, etc.), contains the expression levels of thousands of genes experimentally evaluated in various conditions. Gene expression analysis helps us understand gene networks and molecular pathways. Gene expression information can be utilized for basic as well as clinical research (Behzadi, Behzadi and Ranjbar, 2014; Chen et al., 2016; Karthik and Sudha, 2018; Kia et al., 2021). In disease biology, gene expression analysis provides an excellent tool to study the molecular basis of disease as well as the identification of markers for diagnosis, prognosis, and drug discovery. Therefore, for this review, we will focus on computational methods in the analysis of gene expression data.

The data produced by microarray as well as NGS-based RNA sequencing goes through multiple phases of quality check before analysis. This data is further transmuted to a numerical matrix (Figure 1) where rows and columns represent genes and samples. The numeric value in each cell of a matrix links the expression level of a specific feature gene to a particular sample. The expression matrix is generally a flat dataset as the number of features is very high compared to the number of samples. Some of the standard DNA microarray platforms available are Affymetrix (Pease et al., 1994), Agilent (Blanchard, Kaiser and Hood, 1996), etc. Some of the standard commercial NGS platforms are Illumina (Bentley et al., 2008), Ion torrent (Rothberg et al., 2011) etc. The massive amount of data generated from publicly funded research is available through open access repositories such as Gene Expression Omnibus (GEO), ArrayExpress, Genomic Expression Archive (GEA), etc. (Table 1).

FIGURE 1

Identification of differentially expressed genes is the most common application in gene expression analysis. This type of class comparison analysis can be achieved using basic statistical techniques, for example, chi-squared test, t-test, ANOVA, etc. (Segundo-Val and Sanz-Lozano 2016). Commonly used packages for microarray-based gene expression analysis include limma (Smyth, 2005), affy (Gautier et al., 2004), lumi (Du, Kibbe and Lin, 2008), oligo (Carvalho and Irizarry, 2010); whereas, those for RNA sequencing analysis include EdgeR (Robinson, McCarthy and Smyth, 2009) and DESeq2 (Love, Huber and Anders, 2014). The classification and regression problems on the other hand depend on classical linear and logistic regression analysis. However, the data typically generated by the transcriptomic technologies creates a need for penalized or modified prospects as a solution to the problems of high dimensionality and overfitting (Turgut, Dagtekin and Ensari, 2018; Morais-Rodrigues et al., 2020; Tabares-Soto et al., 2020; Abapihi et al., 2021). The development of high-end computational algorithms, such as machine learning techniques, has created a new dimension for gene expression analysis.

Machine learning (ML) is an artificial intelligence-based approach that emphasizes building a system that learns automatically from data and improves performance without being explicitly programmed. ML models are trained using a significant amount of data to find hidden patterns required to make decisions (Winston, 1992; Dick, 2019; Micheuz, 2020). Artificial Neural Network (ANN), Classification and regression Trees (CART), Support vector machine (SVM), and vector quantization are some of the architectures used in ML. Recent advancement in the ML domain is deep learning (DL) which is based on artificial neural networks (ANN) (Deng and Yu, 2014; LeCun, Bengio and Hinton, 2015). ANN architectures comprise input, hidden, and output layers of neurons. When more than one hidden layer is used, the ANN method is referred to as the DL method. Basic ML and DL models can work on lower-end machines with less computing power; however, DL models require more powerful hardware to process vast and complex data.

ML techniques, in general, are broadly categorized into supervised and unsupervised learning methods (Jenike and Albert, 1984; Dayan, 1996; Kang and Jameson, 2018; Yuxi, 2018). Supervised learning, which makes use of well-labelled data, is applied for classification and regression analysis. A labelled dataset is used for the training process, which later produces an inferred function to make predictions about unknown instances. Classification techniques train the model to separate the input into different categories or labels (Kotsiantis, 2007). Regression techniques train the model and predict continuous numerical value as an output based on input variables (Fernández-Delgado et al., 2019). Unsupervised techniques, on the other hand, let the model discover information or unknown patterns from the data. We can roughly divide unsupervised learning into clustering and association rules. Clustering used for class discovery is the task of grouping a set of instances in such a way that samples in the same group or cluster are more similar in their properties than the samples in other groups or clusters. Association rules associate links between data instances inside large databases (Kotsiantis and Kanellopoulos, 2006).

The supervised ML techniques have been used for binary classification e.g., identification of cases in clinical studies, as well as multiclass classification analysis e.g., grading and staging of the disease. ML techniques have been extensively used to analyze gene expression patterns in various complex diseases, such as cancer (Sharma and Rani, 2021), Parkinson’s Disease (Peng, Guan and Shang, 2019), Alzheimer’s disease (Kong, Mou and Hu, 2011; Park, Ha and Park, 2020), diabetes (Li, Luo and Wang, 2019), arthritis (Liu et al., 2009; Zhang et al., 2020), etc. The classification algorithms have also contributed to biomarker identification (Jagga and Gupta, 2015), precision treatment (Toro-Domínguez et al., 2019), drug toxicity evaluation (Vo et al., 2020) etc. The unsupervised learning techniques for clustering are routinely used in transcriptomics. The clustering analysis is applied for the study of expression relationships between genes (Liu, Cheng and Tseng, 2011), extracting biologically relevant expression features (Kong et al., 2008), discovering frequent determinant patterns (Prasanna, Seetha and Kumar, 2014), etc.

In supervised and unsupervised learning, the data is subjected to preprocessing, e.g., missing value imputation, normalization, etc. (Figure 2). In supervised learning for classification analysis, the entire dataset is divided into two subsets viz. training and testing/validation. The training dataset, which typically comprises 70–80% of the samples, is used for the construction of a model. The training data can first be subjected to missing value imputation and feature scaling. The preprocessed data is then subjected to feature selection/extraction and model development. The model is then applied to the test/validation dataset, which is also preprocessed in a similar fashion. The preprocessing and feature selection steps are applied to the training dataset after the train-test split to avoid “data leakage”. The unsupervised learning which is based on unlabeled data, may include preprocessing steps and data-driven techniques for feature reduction.

FIGURE 2

Though missing value imputation, normalization, feature selection, and modelling are important steps in classification analysis, there appears to be very limited literature that reviews them together. Most of the reviews focus either on missing value imputation, features selection, or learning/modelling (Quackenbush, 2001; Dudoit and Fridlyannnd, 2005; Chen et al., 2007; Liew, Law and Yan, 2011; Sahu, Swarnkar and Das, 2011; Yip, Amin and Li, 2011; Khatri, Sirota and Butte, 2012; Tyagi and Mishra, 2013; Bolón-Canedo et al., 2014; Li et al., 2015; Manikandan and Abirami, 2018; Hambali, Oladele and Adewole, 2020; Zhang, Jonassen and Goksøyr, 2021). This creates gaps in understanding of the complete pipeline of the analysis process for researchers from different domains. The objective of this review is to bridge these gaps. Here we discuss various ways to analyze gene expression data and computational methods used at each step. Through this comprehensive review, we also discuss the need for interpretability to provide insights and bring trust to the predictions made. The review is organized into 6 sections. Section 2 broadly covers different missing value imputation approaches along with their advantages and limitations. Section 3 discusses feature scaling techniques applied to gene expression data. In Section 4, broad categories of feature selection and dimensionality reduction techniques are discussed. Section 5 covers the different types of gene expression analyses, including class comparison, classification (class prediction), and class discovery. In Section 6, we discuss conclusions and future directions.

2 Missing value imputation

Gene expression matrices are often riddled with missing gene expression values due to various reasons. In this section, we will discuss sources of missing values and various computational techniques utilized to perform the imputation of missing values. Missing data are typically grouped into three categories: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR) (Rubin, 1976; Schafer and Graham, 2002; Aydilek and Arslan, 2013; Mack, Su and Westreich, 2018) (Figure 3). In MCAR, the missing data is independent of their unobserved values and independent of the observed data. In other words, the data is completely missing at random, independent of the nature of the investigation. MAR is a more general class of MCAR where conditional dependencies are accounted for. In MAR, the missingness of data is random but conditionally dependent on observed and unobserved values. In transcriptomics, it can be assumed that all MAR values are also MCAR (Lazar et al., 2016); for example, a channel signal obscured accidentally by a dust particle. However, in meta-analysis, a missing value can be attributable to a specific dataset due to its architecture. In this case, the missing values are MAR and not MCAR. In MNAR, the missingness depends on the observed and/or unobserved data. In microarray analysis, values missing due to their low signal intensities are an example of MNAR data.

FIGURE 3

Missing values can be imputed using two different approaches. MCAR/MAR values are either embedded with a fixed value, or mean, median, or mode. However, this method creates lots of similar values if missing data is high. MCAR/MAR and MNAR values can be imputed using advanced computational techniques. The choice of imputation method depends on the accuracy of the results obtained from the downstream analysis. Computational techniques for estimating missing values can be categorized into four different approaches: Global, Local, Hybrid, and Knowledge Assisted (García-Laencina et al., 2008; Moorthy et al., 2019; Farswan et al., 2020) (Table 2).

TABLE 2

ApproachAdvantagesLimitationsMethodsReferences
GlobalOptimal performance when data is homogeneousPoor performance when data is heterogeneousBPCAJörnsten et al. (2005), Oba et al. (2003), Souto et al. (2015)
SVDTroyanskaya et al. (2001)
ANNImputeGarcía-Laencina et al. (2008)
RNNImputeBengio and Gingras (1995)
LocalOptimal performance when data is heterogeneousPoor performance when data is homogeneousKNNImputeDubey and Rasool (2021), McNicholas and Murphy (2010), Pan et al. (2011), Ryan et al. (2010)
LSImputeBo et al. (2004)
SVRimputeWang et al. (2006)
GMCImputeOuyang et al. (2004)
HybridOptimal performance regardless of local or global correlationSub-optimal performance when data is noisy and has high missing ratesLinCmbJörnsten et al. (2005)
EMDIPan et al. (2011)
RMILi et al. (2015)
VAE, DAPLQiu et al. (2020), Qiu et al. (2018)
Knowledge-assistedOptimal performance in presence of noisy dataSub-optimal performance when data has high missing ratesiMISSHu et al. (2006)
GOImputeTuikkala et al. (2006)
POCSimputeGan et al. (2006)
HAIimputeXiang et al. (2008)

Various approaches of missing value imputation.

2.1 Global approaches

Global approaches assume homogeneity of data and use global correlation information extracted from the entire data matrix to estimate missing values. The Bayesian framework for Principal Component Analysis (BPCA) is based on a probabilistic model that can handle large variations in the expression matrix (Oba et al., 2003; Jörnsten et al., 2005; Souto, Jaskowiak and Costa, 2015). In BPCA, the missing value is replaced with a set of random values that are estimated using the Bayesian principle to obtain the relevant principal axes for regression. Singular Value Decomposition (SVD) is another global approach for missing value imputation. SVD is a matrix decomposition method for reducing a matrix to its three constituent parts (Figure 4A). A new matrix that is similar to the original matrix is reconstructed using these constituents in order to reduce noise and impute missing values (Troyanskaya et al., 2001).

FIGURE 4

Other than the above mentioned techniques, ANN-based techniques are also being utilized for the imputation of missing gene expression values. ANN-based methods for imputation include ANNimpute (García-Laencina et al., 2008), RNNimpute (Bengio and Gingras, 1995), etc. ANNimpute utilizes MLP (Multi-Layered Perceptron) based architecture that is trained with complete observed data (Saha et al., 2017) (Figure 4D). The final weight matrix generated through this process is further used for missing value imputation. RNNimpute utilizes Recurrent Neural Network architecture-based imputation (Bengio and Gingras, 1995) (Figure 4E). Since RNN has feedback connections from its neurons, it can preserve the long-term correlation between parameters.

2.2 Local approaches

Local approaches utilize a potential local similarity structure to estimate missing values. For heterogeneous data, the local approach is considered to be very effective. Many local imputation methods have been proposed since 2001. These techniques use a subset of the entire data by estimating underlying heterogeneity. K-Nearest Neighbor (KNN) is a standard ML-based missing-value imputation strategy (McNicholas and Murphy, 2010; Ryan et al., 2010; Pan et al., 2011; Dubey and Rasool, 2021) (Figure 4B). A missing value is imputed by finding the samples closest to the sample from which the gene expression value is missing. It should be noted that a lower number of neighboring points (K) may lead to overfitting of data (Batista and Monard, 2002) whereas a higher K may result in underfitting. Least Square (LS) imputation technique selects a number of most correlated genes using the L2-norm and/or Pearson’s correlation (Bo, Dysvik and Jonassen, 2004; Liew, Law and Yan, 2011; Dubey and Rasool, 2021). Support Vector Regression (SVR) method is a non-linear generalization of the linear model used for the imputation of missing gene expression values (Wang et al., 2006; Oladejo, Oladele and Saheed, 2018) (Figure 4C). A significant advantage of the SVR model is that it requires less computational time than other techniques mentioned above (Wang et al., 2006). However, the change in the missing data patterns and the high fraction of missing data limits the effects of SVR. Gaussian Mixture Clustering (GMC) is another technique used for the imputation of missing values that works with highly observable data (Ouyang, Welsh and Georgopoulos, 2004).

Some studies have compared the global and local approaches for their performances. SVD and KNN require re-computation of a matrix for every missing value, which results in prolonged evaluation time (Aghdam et al., 2017). SVR, BPCA, and LS try to mine the hidden pattern from the data and seem to perform better than SVD and KNN (Sahu, Swarnkar and Das, 2011) (Tuikkala et al., 2008; Subashini and Krishnaveni, 2011; Qiu, Zheng and Gevaert, 2020).

2.3 Hybrid approaches

The internal correlation among genes affects the homogeneity and heterogeneity of data and, therefore, the performance of global and local imputation approaches (Liew, Law and Yan, 2011). In order to cover both homogeneous and heterogeneous data, a hybrid approach can be very effective. LinCmb is one such hybrid approach for data imputation. LinCmb (Jörnsten et al., 2005) puts more weight on local imputation if data is heterogeneous and has fewer missing values. In contrast, it puts more weight on global methods if data is homogeneous with higher missing values. LinCmb takes an ensemble of row mean, KNN, SVD, BPCA, and GMC. When evaluated, LinCmb’s performance was found to be better than each technique it has ensembled. Ensemble missing data imputation method EMDI is another hybrid imputation approach composed of BPCA, matrix completion, and two types of LS and KNN estimators (Pan et al., 2011). It utilizes high-level diversity of data for the imputation of missing values. Recursive Mutual Imputation (RMI) is also a hybrid approach that comprises BPCA and LS to exploit global and local structures in the dataset, respectively (Li et al., 2015). ANN based autoencoders (AE) denoising autoencoder with partial loss (DAPL) (Qiu, Zheng and Gevaert, 2018) and variable autoencoders (VAE) (Qiu, Zheng and Gavaert, 2020) consist of encoder, and decoder layers. The encoder converts the input into the hidden representation and the decoder tries to reconstruct the input from the hidden representation. Hence, AE aims to produce output close to the input (García-Laencina et al., 2008).

2.4 Knowledge-assisted approaches

Knowledge-assisted approaches incorporate domain knowledge or external information into the imputation process. These approaches are applied when there exists a high missing rate, noisy data, or a small sample size. The solution obtained through this approach is not dependent on the global or local correlation structure that exists in the data but on the domain knowledge. Commonly used domain knowledge includes sample information such as experimental conditions, clinical information, and gene information which includes gene ontology, epigenetic profile, etc. Integrative MISSing Value Estimation (iMISS) (Hu et al., 2006) is one such knowledge-assisted imputation technique. iMISS incorporates knowledge from multiple related microarray datasets for missing value imputation. It obtains coherent neighbors set of genes for every gene with missing data by considering reference dataset. GOImpute (Tuikkala et al., 2006) is another knowledge-assisted imputation technique that uses GO database for knowledge assistance. This method integrates the semantic similarity in the GO with the expression similarity estimated using the KNN imputation algorithm. Projection onto convex sets impute (POCSimpute) (Gan, Liew and Yan, 2006) formulates every piece of prior knowledge into a corresponding convex set to capture gene-wise correlation, array-wise correlation, and known biological constraint. After this, a convergence-guaranteed iterative procedure is used to obtain a solution in the intersection of all these sets. HAIimpute (Xiang et al., 2008) utilizes epigenetic information e.g. histone acetylation knowledge for the imputation of missing values. First, it uses the mean expression values of each gene from each cluster to form an expression pattern. It obtains missing values in the sample by applying linear regression as a primary imputation and uses KNN or LS for secondary imputation. Since knowledge-based methods strongly rely on domain-specific knowledge, they may fail to estimate missing values from under-explored cases with low knowledge available (Wang et al., 2019).

Although a large number of missing value imputation methods are available to the users, there are still quite a few challenges when it comes to the application of imputation methods to the data. Firstly, there is only limited knowledge on the performance of different imputation methods on different types of missing data. The performance of the imputation methods may vary significantly depending on the experimental settings. Therefore, it is important to systematically evaluate the existing methods for their performance on different platforms and experimental settings (Aittokallio, 2009). Secondly, despite the many recent advances, better imputation algorithms that can adapt to both global and local characteristics of the data are still needed. Thirdly, the knowledge-based approaches can also be hybridized with local and/or global approaches to data imputation. More sophisticated algorithms which handle this combinatorial information may work better on the dataset with a higher rate of missing values and can be expected to perform better than those working on transcriptomics data alone (Liew, Law and Yan, 2011).

3 Data normalization

Once the missing values are imputed, the datasets can be subjected to downstream analysis. Efficacy of some of the classification methods, e.g., tree-based techniques, linear discriminant analysis, naïve Bayes, etc., does not get affected by variability in the data. However, the performance of class comparison, class discovery, and classification methods, e.g., KNN, SVM etc., may get affected due to technical variations in gene expression signals. The gene expression signals may vary from sample to sample due to technical reasons such as the efficiency of labeling, amount of RNA, and platform used for the generation of data. It is important to reduce the variability due to technical reasons but preserve the variability due to biological reasons. This can be achieved using data normalization or scaling techniques (Brown et al., 1999) (Table 3).

TABLE 3

TypeAdvantagesLimitationTechniqueReference
NormalizationIdentifies and removes systematic variability. Increases the learning speed.Less effective if high number of outliers exist in the data.QuantileLarsen et al. (2014)
Smyth and Speed (2003)
Schmidt et al. (2004)
LoessFranks et al. (2018)
Karthik and Sudha (2021)
Larsen et al. (2014)
Huang et al. (2018)
Bolstad et al. (2003)
Doran et al. (2007)
Data transformationReduces the variance and reduces the skewness of the distribution of data points.Data do not always approximate the log-normal distribution.Log transformationPirooznia et al. (2008)
Pan et al. (2002)
Doran et al. (2007)
StandardizationEnsures feature distributions have mean = 0. Applicable to datasets with many outliers.Less effective when data distribution is not Gaussian, or the standard deviation is very small.z-scorePeterson and Coleman (2008)
Cheadle et al. (2003)
De Guia et al. (2019)
Chandrasekhar et al. (2011)
Pan et al. (2002)

List of data transformation and feature scaling techniques prior to dimensionality reduction.

Quantile normalization (Bolstad et al., 2003; Hansen, Irizarry and Wu, 2012) is a global mean or median technique utilized for the normalization of single channel expression array data. It arranges all the expression values of samples in order, takes average across probes, substitutes probe intensity with average value, and goes back to the original order. Low computational cost is the advantage of quantile normalization. Robust Multi-chip Average (RMA) is a commonly used technique to generate an expression matrix from Affymetrix data (Gautier et al., 2004) or oligonucleotide microarray (Carvalho and Irizarry, 2010). RMA obtains background corrected, quantile normalized gene expression values (Irizarry et al., 2003). Robust Spline Normalization (RSN) used for Illumina data also makes use of quantile normalization (Du, Kibbe and Lin, 2008). Quantile normalization is also used for single color Agilent data (Smyth, 2005). Loess is a local polynomial regression-based approach which can be utilized to adjust intensity levels between two channels (Yang et al., 2002; Smyth and Speed, 2003; Bullard et al., 2010; Baans et al., 2017). Loess normalization performs local regression for each pair of arrays which are composed of the difference and average of the log-transformed intensities derived from the two channels. Two color Agilent data (Smyth, 2005) (Du, Kibbe and Lin, 2008) use loess normalization. Log-transformation is the simplest and very common data normalization technique applied to gene expression data (Pochet et al., 2004; Li, Suh and Zhang, 2006; Aziz et al., 2017). This method does not shuffle the relative order of expression values, therefore, does not affect the rank-based test results. Log transformation is often applied to data subjected to prior normalization by other methods such as quantile and loess.

Standardization is a normalization technique that does not bind values to a specific range. Standardization is commonly applied by subtracting the mean value from each expression value. Z-score is one of the most frequently used methods of standardization. The Z-score transformation modifies expression values such that the expression value of each gene is denoted as a unit of standard deviation from the normalized mean of zero (Cheadle et al., 2003). The standardization can also be used with the median instead of the mean (Pan, Lin and Le, 2002). The use of the median is more robust against outliers. Standardization techniques are often used for data visualization.

Feature normalization can have positive and negative effects on the expression array analysis results. It lowers the bias but also decreases the sensitivity of the analysis (Freyhult et al., 2010). Existing normalization methods for microarray gene expression data generally assume a similar global expression pattern among samples being studied. However, scenarios of global shifts in gene expressions are dominant in the datasets of complex diseases, for example, cancers which makes the assumption invalid. Therefore, when applying it should be kept in mind that normalization techniques such as RMA or Loess may arbitrarily flatten the differences between sample groups which may lead to biased gene expression estimates.

4 Feature selection and feature extraction

High dimension data often results in the sparsity of information which is less reliable for prediction analysis. As a result, feature selection or feature extraction techniques are typically used to find informative genes and resolve the curse of dimensionality. The dimensionality reduction not only speeds up the training process but also helps in data visualization. Dimensionality reduction is achieved by either selection or extraction of features by transforming the original set of features into new ones. Dimensionality reduction serves as an important step in classification and class discovery analysis. For classification, the dataset is split into training and testing sets, and feature selection/extraction is carried out only on the training set to avoid data leakage. Feature selection and extraction techniques are broadly divided into four categories: filter methods, wrapper methods, embedded methods, and hybrid methods (Tyagi and Mishra, 2013; Dhote, Agrawal and Deen, 2015; Almugren and Alshamlan, 2019) (Figure 5) (Table 4).

FIGURE 5

TABLE 4

ApproachAdvantagesLimitationFeature Selection TechniquesReference
FilterDatasets are easily scalable. Perform simple and fast computation. Independent of the prediction- outcome. Only one-time feature selection.Ignores the interface with the classifier. Every feature is separately considered. Ignores feature dependencies. Poor classification performance compared to other feature selection techniques.t-statistics (t-test)Pan et al. (2002), Önskog et al. (2011)
Chi-squareDittman et al. (2010)
ANOVAKumar et al. (2015)
CFSAl-Batah et al. (2019)
FCFSYu and Liu (2003)
WGCNALangfelder and Horvath (2008)
PCAPochet et al. (2004)
ICAZheng et al. (2006)
LDASharma et al. (2014)
WrapperInteraction between selected features and learning model taken into account. Considers feature dependencies.Higher risk of overfitting compared to filter approach. Computationally intensive.SFSPark et al. (2007)
SBEDhote et al. (2015)
RFEGuyon et al. (2002)
GARam and Kuila (2019)
ABCLi et al. (2016)
ACOAlshamlan et al. (2016)
PSOSahu and Mishra (2012)
EmbeddedRequires less computation than wrapper methods.Very specific to learning technique.k-means clusteringAydadenta and Adiwijaya (2018)
LASSOTibshiranit (1996)
GLASSOMeier et al. (2008)
SGLASSOMa et al. (2007)
AEDanaee et al. (2017)
RFDíaz-Uriarte and Alvarez de Andrés (2006)
HybridCombines filter and wrapper methods. Reduces the risk of overfitting. Lower error rate.Computationally expensive. Can be less accurate: the filter and the wrapper both being used in different steps.SVM-RFEGuyon et al. (2002)
MIMAGA-SelectionLu et al. (2017)
Co-ABCAlshamlan (2018)

List of different feature selection and feature extraction techniques.

4.1 Filter approaches

The filter methods are independent of the performance of the learning algorithm. Statistical methods such as ANOVA, chi-square, t-test, etc. (Pan, Lin and Le, 2002; Saeys, Inza and Larrañaga, 2007; Land et al., 2011; Önskog et al., 2011; Kumar et al., 2015) which are often used for class comparison are also used for the feature selection for prediction analysis. The fold change or p-value is often used as a cutoff parameter for the selection of features. Correlation-based unsupervised learning algorithms are also used for the features selection process (Figure 6A). In correlation-based features selection (CFS), Pearson’s coefficient is utilized to compute the correlation among feature genes (Al-Batah et al., 2019). As a next step, the network of genes that has a moderate to high positive correlation with the output variable is retained. Statistical approaches have also been coupled with correlation analysis for feature selection on Maximum Relevance and Minimum Redundancy (MRMR) principles (Radovic et al., 2017). MRMR is a filter approach that helps to achieve both high accuracy and fast speed (Ding and Peng, 2005; Abdi, Hosseini and Rezghi, 2012). The method selects genes that correlate with the condition but are dissimilar to each other. Another commonly used tool is Weighted Gene Co-expression Network Analysis (WGCNA) (Langfelder and Horvath, 2008). This approach is utilized to find the correlation patterns in gene expression across samples as an absolute value of Pearson’s correlation (Langfelder and Horvath, 2008). WGCNA groups genes into clusters or modules depending on their co-expression patterns (Agrahari et al., 2018). The eigenvectors generated through clustering can be thought of as a weighted average expression profile, also called eigengenes. These eigengenes can be used to study the relationship between modules and external sample traits. WGCNA is used more often in class comparison analysis for the identification of “hub” genes associated with a trait of interest. Another correlation-based technique, Fast Correlation Feature Selection (FCFS) utilizes a predominant correlation to identify relevant features and redundancy among them without pairwise correlation analysis (Yu and Liu, 2003) (Figure 6B).

FIGURE 6

The entropy-based methods are supervised learning methods that are used for feature selection. The entropy-based method selects features such that the probability distribution function across external traits have the highest entropy. Information Gain (IG) is a commonly used entropy-based method for feature selection applied to expression array data (Nikumbh, Ghosh and Jayaraman, 2012; Bolón-Canedo et al., 2014; Ayyad, Saleh and Labib, 2019). IG calculates the entropy of gene expression for the entire dataset. The entropy of gene expression for each external trait is then calculated. Based on entropy values, the information gain is calculated for each feature. Ranks are assigned to all the features and a threshold is used to select the features genes. The information gained is provided to the modeling algorithm as heuristic knowledge.

Feature extraction methods are multivariate in nature and are capable of extracting information from multiple feature genes. Classical Principal Component Analysis (PCA), an unsupervised linear transformation technique has been used for dimensionality reduction (Jolliffe, 1986; Pochet et al., 2004; Ringnér, 2008; Adiwijaya et al., 2018) (Figure 6C). PCA builds a new set of variables called principal components (PCs) using original features. To obtain principal components, PCA finds linear projection of gene expression levels with maximal variance over a training set. The PCs with the highest eigenvalues which explain the most variance in data are usually selected for further analysis. Independent component analysis (ICA), another unsupervised transformation method, generates a new set of features from the original ones by assuming them to be linear mixtures of latent variables (Lee and Batzoglou, 2003; Zheng, Huang and Shang, 2006). All features generated using ICA are considered to be statistically independent and hence equally important. As a result, unlike PCA, all components from ICA are used for further analysis. (Hyvärinen, 2013), however, as compared to PCA, ICA is slower. Linear Discriminant Analysis (LDA), on the other hand, is a supervised linear transformation feature reduction method that takes class labels into account and maximizes the separation between classes (Guo and Tibshirani, 2007; Sharma et al., 2014) (Figure 6C). The projection vectors are generated from original features. The projection vectors corresponding to the highest eigenvalue are used for downstream analysis. Similar to PCA, LDA also uses second order statistics. However, as compared to PCA and ICA, LDA offers faster speed and scalability.

All filter approaches (both simple filter and feature extraction methods) ignore the interface with classifier which can result in poor classification performance. This limitation can be overcome by wrapper and embedded approaches.

4.2 Wrapper approaches

The wrapper approach is a feature selection approach that wraps a specific machine learning technique applied to fit the data (Figure 7). The wrapper approach overcomes the limitation of the filter approach by selecting a subset of features and evaluating them based on the performance of the learning algorithm. The process of feature selection repeats itself until the best set of features is found.

FIGURE 7

Sequential Forward Selection (SFS) is an iterative method of feature selection (Figure 7A). It calculates the performance of each feature and starts with the best performing feature. It then keeps adding a feature with each iteration and keeps checking the performance of the model. A set of features that will produce the highest improvement will be retained, and others will be discarded (Park, Yoo and Cho, 2007; Fan, Poh and Zhou, 2009). Sequential Backward Elimination (SBE), on the other hand, initiates the feature selection process by including all the features in the first iteration and by removing one feature with each iteration (Figure 7B). The effect of elimination of each feature is evaluated based on the prediction performance (Guyon et al., 2002; Dhote, Agrawal and Deen, 2015). Selection or elimination of features in SFS and SBE is based on a scoring function, e.g., p-value, r-square, or residuals sum of squares of the model to maximize performance. A Genetic Algorithm (GA) is a stochastic and heuristic search technique used to optimize a function based on the concept of evolution in biology (Pan, Zhu and Han, 2003) (Figure 7C). Evolution works on mutation and selection processes. In GA, the Information Index Classification (IIC) value for each gene feature is calculated. The IIC value for the feature gene represents its prediction power. As a first step, top gene features with high IIC values are selected for further processing. The selected feature genes are randomly assigned a binary form (0 or 1) to represent a ‘chromosome’. A set of chromosomes of the select genes with randomly assigned 0s and 1s creates a ‘chromosome population’. The fitness power of each chromosome is calculated by considering only the genes which are assigned a value of 1. ‘Fit’ chromosomes are selected using techniques such as Roulette-wheel selection, rank selection, tournament selection, etc. The select set of chromosomes is subjected to crossover or mutagenesis to generate the offspring. Upon crossover and mutagenesis, the chromosomes exchange or mutate their information contents. The offspring chromosomes are used for further downstream analysis (Aboudi and Benhlima, 2016; Sayed et al., 2019). There are quite a few variants of GAs to handle the feature selection problem (Liu, 2008, 2009; Ram and Kuila, 2019; Sayed et al., 2019). Other stochastic and heuristic methods are Artificial Bee Colony (ABC) (Li, Li and Yin, 2016), Ant Colony Optimization (ACO) (Alshamlan, Badr and Alohali, 2016), Particle Swarm Optimization (PSO) (Sahu and Mishra, 2012), etc.

Though, the wrapped methods provide optimized prediction results as compared to the filter methods they are computationally expensive. This limitation of wrapped methods is addressed by the embedded methods.

4.3 Embedded approaches

The embedded approaches perform feature selection as a part of the learning process and are typically specific to the learning algorithm. They integrate the importance of both wrapper and filter methods by including feature interaction at a low computational cost. The embedded approach extracts the most contributing features from iterations of training. Commonly used embedded techniques for feature selection are LASSO (Least Absolute Shrinkage and Selection Operator) and Ridge regression (Figure 8A). Both these techniques are regularized versions of multiple linear regression and can be utilized for feature selection (Tibshiranit, 1996). These techniques perform feature selection by eliminating weights of the least important features (Hoffmann, 2007; Ma, Song and Huang, 2007; Meier, Van De Geer and Bühlmann, 2008; Algamal and Lee, 2015). Other than LASSO and Ridge Regression, K-means clustering, Random Forest and ANN-based techniques are also used.

FIGURE 8

The K-means clustering technique is an unsupervised method that is utilized to eliminate redundancy in high-dimensional gene expression data (Aydadenta and Adiwijaya, 2018) (Figure 8B). In K-means clustering, an arbitrary K number of points from the data are selected as centroids, and all the genes are allocated to the nearest centroid (MacQueen, 1967; Kanungo et al., 2002). After clustering, a scoring algorithm such as Relief (Kira and Rendell, 1992) is utilized and high-scoring gene features of each cluster are selected for further analysis. The computational complexity of K-means is linear with respect to the number of instances, clusters, and dimensions. Though it is one of the fastest clustering techniques, it may also lead to an incorrect result due to convergence to a local minimum. The Random Forest (RF) is a supervised approach applied to obtain very small sets of non-redundant genes by preserving predictive accuracy (Díaz-Uriarte and Alvarez de Andrés, 2006; Moorthy and Mohamad, 2012) (Figure 8C). RF is an ensemble of decision trees constructed by randomly selecting data samples from the original data (Breiman, 2001). The final classification is obtained by combining results from the decision trees passed by vote. The bagging strategy of RF can effectively decrease the risk of overfitting when applied to large dimension data. RF can also incorporate connections among predictor features. The prediction performance of RF is highly competitive when compared with SVM and KNN. An important limitation of RF is that many trees can make the model very slow and unproductive for real-time predictions.

ANN-based Autoencoders (AE) (Kramer, 1991) is an unsupervised encoder and decoder technique (Figure 8D). It tries to obtain output layer neuron values as close as possible to input layer neurons using lower-dimensional layers in between. AE can obtain both linear and nonlinear relationships from the input information. AE such as Denoising Autoencoders (DAE) (Vincent and Larochelle, 2008), Stacked Denoising Autoencoder (SDAE) (Vincent et al., 2010; Danaee, Ghaeini and Hendrix, 2017) are utilized to extract functional features from expression arrays and are capable of learning from the dense network. Convolutional Neural Network (CNN) is another ANN-based architecture that is utilized for the feature extraction process in order to improve classification accuracy (Zeebaree, Haron and Abdulazeez, 2018; Almugren and Alshamlan, 2019) (Figure 8E). CNN can extricate local features from the data (LeCun et al., 1998; O’Shea and Nash, 2015). The convolutional layer of CNN extracts the high-level features from the input values. The pooling layer is utilized to reduce the dimensionality of feature maps from the convolution layer.

4.4 Hybrid approaches

A hybrid approach is considered as a combination of two or more filter and wrapper methods. It can reduce the error rate and the risk of overfitting. A well-known feature selection hybrid approach is Recursive Feature Elimination with a linear SVM (SVM-RFE) (Guyon et al., 2002). SVM-RFE utilizes SVMs classification capability and, from the ranked list, recursively deletes the least significant features. This method was taken as a benchmark feature selection method due to its performance. However, its main disadvantage is that it ignores the correlation hidden between the features and requires high computational time (Li, Xie and Liu, 2018). A combination of the mutual information maximization (MIM) and the adaptive genetic algorithm (AGA) has also been proposed for feature selection (Lu et al., 2017). MIM is able to select the advanced feature subset, and AGA speeds up the search in the identification of the substantial feature subsets. This combination of methods is more efficient and robust compared to the individual component (Lu et al., 2017). This technique streamlines the feature selection procedure without getting into classification accuracy on the reduced dataset. MIMAGA-Selection technique can reduce datasets with the number of genes up to 20,000 to below 300 with high classification accuracies. It also removes redundancy from the data and results in a lower error rate (Bolón-Canedo et al., 2014). This technique is an iterative feature reduction technique. Therefore, with an increase in the size of the microarray dataset, the computational time increases. Co-ABC is a hybrid approach for feature selection based on the correlation Artificial Bee Colony (ABC) algorithm (Alshamlan, 2018). The first step utilizes correlation-based feature selection to filter noisy and redundant genes from high dimensionality domains and the second step utilizes ABC technique to select the most significant genes.

Feature selection or feature extraction process can generate high quality data for classification and predication analysis. It should be noted that for classification analysis, feature selection is carried out only on the training dataset. For clinical applications, it should be noted that model interpretation is important, and feature extraction technique may cause the model interpretation challenging as compared to feature selection techniques.

5 Modeling/learning and analysis

The final step of analysis of microarray gene expression data is statistical analysis and model learning through computational techniques. Methods used for normalization, gene selection and analyses exhibit a synergistic relationship (Önskog et al., 2011). Class Comparison is one of the most common types of gene expression data analysis for the identification of differentially expressed genes (O’Connell, 2003). To solve the class comparison problems most researchers use standard statistical techniques e.g., t-test, ANOVA, etc. (Storey and Tibshirani, 2003). Scoring enrichment techniques such as z-score or odds ratio are hit-counting methods utilized to describe either the pathway or the functional enrichment of a gene list (Curtis, Orešič and Vidal-Puig, 2005). A higher number of hits shows a higher score and represents greater enrichment.

5.1 Classification (class prediction)

Classification is the process of classifying microarray data into categories or systematic arrangement of microarray data into different classes, e.g., cases and controls. For classification analysis, the entire dataset is divided into two subsets, viz. training and testing. The training dataset, which typically comprises 70–80% of the samples, is used for the construction of a model. To improve the efficiency of classification, it is essential to assess the performance of models. A common way to improve the performance of a model during training is to include an additional validation subset (Refaeilzadeh, Tang and Liu, 2009). The validation dataset comprises 10–15% of the total sample observations used for parameter optimization. The remaining samples are used as a testing dataset. (Refaeilzadeh, Tang and Liu, 2009). However, to assess the generalization ability and prevent model overfitting, instead of setting aside a single validation set, k-fold cross-validation can be an effective solution. Various ML algorithms have been used for classification analysis.

K-Nearest Neighbor (KNN) is one of the techniques that can be utilized for the classification of expression array data (Kumar et al., 2015; Ayyad, Saleh and Labib, 2019). The classification of a sample is achieved by measuring its distance (e.g., Euclidean distance etc.) from all training samples using the distance metric. The performance of KNN is dependent on the threshold of the feature selection method and is subject to the distance function (Deegalla and Bostr, 2007). An increase in sample size has been shown to increase the computational and time complexity of KNN (Begum, Chakraborty and Sarkar, 2015). Another classification technique for expression array data is Nearest Shrunken Centroid (NSC) (Tibshirani et al., 2003; Dallora et al., 2017). It calculates the centroid for each class and tries to shrink each of the class centroids toward the global centroid by threshold. A sample is classified into a class whose centroid is nearest to it based on the distance metric. This method can reduce the effects of noisy genes. However, an arbitrary choice of shrinkage threshold is a limitation of NSC.

A Decision Tree (DT) (Safavian and Landgrebe, 1991) approach can also be utilized for the classification of gene expression data (Peng, Li and Liu, 2006; Krętowski and Grześ, 2007; Chen et al., 2014). A decision tree is also a versatile ML technique that can perform classification as well as regression operations (Safavian and Landgrebe, 1991). DT requires less effort for data preparation during preprocessing. However, a slight variation in the input information can result in a significant variation in the optimal decision tree structure. Also, overfitting is a known limitation of the DT models. Random Forest (RF) (Breiman, 2001) is another algorithm used for the classification and regression analysis of gene expression data. RF is an ensemble of decision trees (Statnikov, Wang and Aliferis, 2008; Aydadenta and Adiwijaya, 2018). While Random Forest has lesser chances of overfitting and provides more accurate results, it is computationally expensive and more difficult to interpret as compared to DT.

Another technique that is utilized for classification analysis using expression arrays is an SVM (Brown et al., 2000; Furey et al., 2000; Ben-Hur, 2001; Abdi, Hosseini and Rezghi, 2012; Adiwijaya et al., 2018; Turgut, Dagtekin and Ensari, 2018). For complex non-linear data, higher degree polynomials can be added to the cost function of SVM. This will increase the combination of a number of features; however, this results in the reduction of computation speed. To overcome this situation, ‘kernel trick’ is used, which can handle complex non-linear data without the addition of any polynomial features. Various kernel types can be used with SVM, such as linear, polynomial, radial, etc. In some studies, SVMs performed better than DT and ANN-based techniques (Önskog et al., 2011), whereas, in others the performance of SVM was poor (Tabares-Soto et al., 2020) (Motieghader et al., 2017).

Multilayered CNN, a deep learning algorithm typically applied where the data can be visualized as an image (Neubauer, 1998; Collobert and Weston, 2008), has also been proposed for the analysis of microarray data (Zeebaree, Haron and Abdulazeez, 2018). Each neuron is scanned throughout the input matrix, and for every input, the CNN calculates the locally weighted sum and produces an output value. CNN can deal with insufficient data. CNN involves much less preprocessing and can do far better in terms of results as compared to other supervised techniques.

The performance evaluation for classification analysis using classification techniques can be achieved by error rate or accuracy parameters. Root Mean Squared Error (RMSE) or Root Relative Squared Error (RRSE) are examples of error-rate-based evaluation. The accuracy metric is the most common performance evaluation parameter utilized to find the accuracy of classification. However, accuracy alone is not enough for performance evaluation (McNee, Riedl and Konstan, 2006; Sturm, 2013) and therefore, a confusion matrix is computed. A set of predictions is compared with actual targets to compute the confusion matrix. The confusion matrix represents true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). TP, TN, FP and FN are utilized to calculate more concise metrics such as precision, recall (sensitivity), specificity, Matthew’s correlation coefficient (MCC), etc. ROC (Receiver Operating Characteristic) curve and Precision-Recall curve are other standard tools used by binary classifiers as performance measures. ROC and MCC are more robust measures as compared to accuracy since accuracy is affected by class imbalance (Chicco and Jurman, 2020).

The problem of classification of expression data is both biologically important and computationally challenging. From a computational perspective one of the major challenges in analyzing microarray gene expression data is a small sample size. Error estimation is greatly affected by the small sample size, and the possibility of overfitting of data is very high (Hambali, Oladele and Adewole, 2020). Another important issue in gene expression array data analysis is class imbalance for the classification tasks. In clinical research on rare diseases, generally, the number of case samples is very less as compared to healthy controls which may lead to biased results. With decreasing costs of microarray profiling and high-throughput sequencing, this challenge can be expected to be resolved in the near future.

5.2 Class discovery

The third type of microarray analysis is class discovery which involves the analysis of a set of gene expression profiles for the discovery of novel gene regulatory networks or sample types. Hierarchical Clustering Analysis (HCA) is a simple process of sorting instances into groups of similar features and is very commonly used for the analysis of expression array data (Eisen et al., 1998). Hierarchical clustering produces a dendrogram which is a binary tree structure and represents the distance relationships between clusters. HCA is a highly structured approach and the most widely used technique for expression analysis (Bouguettaya et al., 2015). However, the graphical representation of hierarchy is very complex in HCA. The lack of robustness and inversion problems complicate the interpretation of the hierarchy. HCA is also sensitive to small data variations. Self-Organizing Maps (SOM) is another clustering technique used for the identification of prevalent gene expression patterns and simple visualization of specific genes or pathways (Tamayo et al., 1999). SOM can perform non-linear mapping of data with a two-dimensional map grid. Unlike HCA, SOM is less sensitive to small data variations (Nikkila et al., 2002).

K-means is an iterative technique that minimizes the overall within-cluster dispersion. K-means algorithm has been utilized to discover transcriptional regulatory sub-networks of yeast without any prior assumptions of their structure (Tavazoie et al., 1999). The advantage of K-means over other clustering techniques is that it can deal with entirely unstructured input data (Gentleman and Carey, 2008). However, the K-means technique easily gets caught with the local optimum if the initial center points are selected randomly. Therefore various modified versions of K-means are applied for converging to the global optimum (Lu et al., 2004; Nidheesh, Abdul Nazeer and Ameer, 2017; Jothi, Mohanty and Ojha, 2019).

Another technique for class discovery analysis is the Bayesian probabilistic framework which uses Bayes theorem (Friedman et al., 2000; Baldi and Long, 2001). This technique is a good fit for small sample sizes of microarray studies; however, it is computationally exhaustive for a dataset with a very high number of samples and features. Nonnegative Matrix Factorization (NMF) is also a clustering technique utilized for pattern analysis of gene expression data (Kim and Tidor, 2003; Brunet et al., 2004). NMF involves factorization into matrices with nonnegative entries and recognizes the similarity between sub-portions of the data corresponding to localized features in expression space (Kim and Park, 2007; Devarajan and Ebrahimi, 2008).

Evaluation measures for clustering algorithms utilized for class discovery can be of three different types, viz. internal validation index, relative validation index, and external validation index (Dalton, Ballarin and Brun, 2009). The internal validation index method calculates properties of the resulting clusters based on internal properties of clusters such as compactness, separation, and roundness. Dunn’s Index and Silhouette Index are examples of internal validation indices. The relative validation indexing method compares clusters generated by algorithms with different parameters or subsets of the data. It can measure the stability of the technique against variations in the data, or consistency of the results in the case of redundancy. The figure of merit index and instability index are examples of relative validation indices. External validation index method compares the groups generated by the clustering technique to the actual cluster of the data. Generally, external methods are considered to be better correlated to the actual error as compared to internal and relative indexing methods. Hubert’s Correlation, Rand Statistics, Jaccard Coefficient, and Folke’s and Mallow’s index are a few examples of external evaluation parameters. Table 5 describes all the evaluation parameters discussed above.

TABLE 5

Evaluation metricSpecificsReferences
Prediction performance evaluation parameters
Root Mean Squared Error (RMSE)RMSE is a square root of mean of the difference between predicted values and actual values for each sampleVihinen, 2012, Parikh et al., 2008a, Parikh et al., 2008b, Goffinet and Wallach, 1989
Root Relative Squared Error (RRSE)RRSE is a normalized RMSE which enables the comparison between datasets or models with different scales. Standard deviation is used for normalization
AccuracyThe accuracy of a test is its ability to differentiate the cases and controls correctly
Precision/Positive Prediction ValueThe Precision of a test is its ability to determine cases that are true cases
Sensitivity/Recall/True Positive RateThe sensitivity of a test is its ability to determine the cases (positive for disease) correctly
Specificity/True negative RateThe specificity of a test is its ability to determine the healthy cases correctly
F1-scoreF1-score of a test is its ability to determine harmonic mean of precision and recall
MCCMCC of a test is a correlation coefficient between the true and predicted valuesChicco and Jurman, 2020, Matthews, 1975
ROC curveROC curve is a graph where each point on a curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. Area Under the ROC curve is a measure of how well a parameter can distinguish between cases and controls. ROC curves should be used when there are roughly equal numbers of instances for each classFawcett, 2006, Davis and Goadrich, 2006
Precision-Recall CurveA precision-recall (PR) curve is a graph where each point on a curve represents a precision/sensitivity pair corresponding to a particular threshold. PR curves should be used when there is moderate to high class imbalanceBuckland and Gey, (1994)
Clustering performance evaluation parameters
 Dunn’s IndexDunn’s index is a ratio between the minimum distance between two clusters and the size of largest cluster. Larger the index better the clusteringDunn, 1974, Dalton, Ballarin and Brun, 2009
 Silhouette IndexSilhouette Index of a cluster is a defined as the average Silhouette width of its points. Silhouette width of a given point defines its proximity to its own cluster relative to its proximity to other clustersRousseeuw, 1987, Dalton, Ballarin and Brun, 2009
 Figure of Merit IndexThe FOM of a feature gene is computed by clustering the samples after removing that feature and by measuring the average distance between all samples and their cluster’s centroids. The FOM for a clustering technique is the sum of FOM over each feature gene at a timeSmith and Snyder, 1979, Dalton, Ballarin and Brun, 2009
 Instability IndexInstability index is disagreement between labels obtained over data points to parts of a dataset, averaged over repeated random partitions of the data points. Clustering method is applied to a part of dataset, and the labels obtained on that part of the dataset are utilized to train a classifier that partitions the whole spaceGuruprasad, Reddy and Pandit, 1990, Dalton, Ballarin and Brun, 2009
 Hubert’s Correlation, Rand Statistics, Jaccard Coefficient, Folke’s and Mallow’s indexAll these measures analyse the relationship between pairs of points using the co-occurrence matrices for the expected partition and the one generated by the clustering algorithmDalton, Ballarin and Brun, 2009, Brun et al., 2007

Evaluation Parameters for analysis of microarray gene expression data.

While dealing with a very large number of gene features in expression arrays, multiple gene feature selection techniques are available to deal with dimensionality problem. However, an elaborate study is required to identify optimum methods for downstream analysis that can be combined with specific dimensionality reduction techniques.

6 Conclusion and future directions

In this paper, we have attempted to describe the complete pipeline for the analysis of expression arrays. Conventional ML methods for missing value imputation, dimensionality reduction, and classification analysis have achieved success. However, with an increase in data complexity, deep learning techniques may find increasing usage. The current applications of genomics in clinical research may benefit from the data coming from different modalities. For gene expression data analysis of complex diseases, data sparsity or class imbalance is a real concern. This issue can be addressed with the recent technology of data augmentation, for example, Generative Adversarial Networks (GANs) (Chaudhari, Agrawal and Kotecha, 2020). The aim of any class prediction algorithm for diagnostic applications in a clinical research is not only to predict but also to disclose the reasons behind the predictions made. This understanding of the undercover mechanism with some evidence makes the model interpretable. Therefore, it is important to develop interpretable models which help to understand the problem and the situation where the model may fail (Holzinger et al., 2017). Interpretation models such as perturbation-based, derivative-based, local and global surrogate-based should get attention to solve these problems (Ribeiro, Singh and Guestrin, 2016; Zou et al., 2019).

Statements

Author contributions

NB and SK wrote the manuscript. SK, RW, and KK outlined the manuscript. RW and KK reviewed the manuscript and inspired the overall work.

Funding

This work has been supported by the Scheme for Promotion of Academic and Research Collaboration (SPARC) 2018–19, MHRD (project no. P104). NB was supported by the Junior Research Fellowship Award 2018 by Symbiosis International Deemed University, India. Satyajeet Khare is also a beneficiary of a DST SERB SRG grant (SRG/2020/001414).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

  • 1

    AbapihiB.Mukhsar,Adhi WibawaG. N.Baharuddin,LumbanrajaF. R.FaisalM. R.et al (2021). Parameter estimation for high dimensional classification model on colon cancer microarray dataset. J. Phys. Conf. Ser.1899 (1), 012113. 10.1088/1742-6596/1899/1/012113

  • 2

    AbbertonM.BatleyJ.BentleyA.BryantJ.CaiH.CockramJ.et al (2016). Global agricultural intensification during climate change: A role for genomics. Plant Biotechnol. J.14 (4), 10951098. 10.1111/pbi.12467

  • 3

    AbdiM. J.HosseiniS. M.RezghiM. (2012). A novel weighted support vector machine based on particle swarm optimization for gene selection and tumor classification. Comput. Math. Methods Med., 320698. 10.1155/2012/320698

  • 4

    AboudiN. ElBenhlimaL. (2016). “Review on wrapper feature selection approaches,” in Proceedings - 2016 International Conference on Engineering and MIS, ICEMIS 2016 (IEEE). 10.1109/ICEMIS.2016.7745366

  • 5

    AdiwijayaA.WisestyU.KusumoD.AditsaniaA. (2018). Dimensionality reduction using Principal Component Analysis for cancer detection based on microarray data classification. J. Comput. Sci.14 (11), 15211530. 10.3844/jcssp.2018.1521.1530

  • 6

    AghdamR.BaghfalakiT.KhosraviP.Saberi AnsariE. (2017). The ability of different imputation methods to preserve the significant genes and pathways in cancer. Genomics Proteomics Bioinforma.15 (6), 396404. 10.1016/j.gpb.2017.08.003

  • 7

    AgrahariR.ForoushaniA.DockingT. R.ChangL.DunsG.HudobaM.et al (2018). Applications of Bayesian network models in predicting types of hematological malignancies. Scientific Reports. United States: Springer8 (1), 112. 10.1038/s41598-018-24758-5

  • 8

    AittokallioT. (2009). Dealing with missing values in large-scale studies: Microarray data imputation and beyond. Brief. Bioinform.11 (2), 253264. 10.1093/bib/bbp059

  • 9

    Al-BatahM.ZaqaibehB. M.AlomariS. A.AlzboonM. S. (2019). Gene Microarray Cancer classification using correlation based feature selection algorithm and rules classifiers. Int. J. Onl. Eng.15 (8), 6273. 10.3991/ijoe.v15i08.10617

  • 10

    AlgamalZ. Y.LeeM. H. (2015). Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Syst. Appl.42 (23), 93269332. 10.1016/j.eswa.2015.08.016

  • 11

    AlloulA.SpanogheJ.MachadoD.VlaeminckS. E. (2022). Unlocking the genomic potential of aerobes and phototrophs for the production of nutritious and palatable microbial food without arable land or fossil fuels. Microb. Biotechnol.15 (1), 612. 10.1111/1751-7915.13747

  • 12

    AlmugrenN.AlshamlanH. (2019). A survey on hybrid feature selection methods in microarray gene expression data for cancer classification’. IEEE Access7, 7853378548. 10.1109/ACCESS.2019.2922987

  • 13

    AlshamlanH. M.BadrG. H.AlohaliY. A. (2016). ABC-SVM: Artificial bee colony and SVM method for microarray gene selection and Multi class cancer classification. Int. J. Mach. Learn. Comput.6 (3), 184190. 10.18178/ijmlc.2016.6.3.596

  • 14

    AlshamlanH. M. (2018). Co-ABC: Correlation artificial bee colony algorithm for biomarker gene discovery using gene expression profile. Saudi J. Biol. Sci.25 (5), 895903. 10.1016/j.sjbs.2017.12.012

  • 15

    ArbitrioM.SciontiF.Di MartinoM. T.CaraccioloD.PensabeneL.TassoneP.et al (2021). Pharmacogenomics biomarker discovery and validation for translation in clinical practice. Clin. Transl. Sci.14 (1), 113119. 10.1111/cts.12869

  • 16

    AydadentaH.Adiwijaya (2018). A clustering approach for feature selection in microarray data classification using random forest. J. Inf. Process. Syst.14 (5), 11671175. 10.3745/JIPS.04.0087

  • 17

    AydilekI. B.ArslanA. (2013). A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Inf. Sci.233, 2535. 10.1016/j.ins.2013.01.021

  • 18

    AyyadS. M.SalehA. I.LabibL. M. (2019). Gene expression cancer classification using modified K-Nearest Neighbors technique. Biosystems.176 (12), 4151. 10.1016/j.biosystems.2018.12.009

  • 19

    AzizR.VermaC.JhaM.SrivastavaN. (2017). Artificial neural network classification of microarray data using new hybrid gene selection method. Int. J. Data Min. Bioinform.17 (1), 42. 10.1504/ijdmb.2017.084026

  • 20

    BaansO. S.HashimU.YusofN. (2017). Performance comparison of image normalisation method for DNA microarray data. Pertanika J. Sci. Technol.25 (S), 5968.

  • 21

    BaldiP.LongA. D. (2001). A Bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inferences of gene changes. Bioinformatics17 (6), 509519. 10.1093/bioinformatics/17.6.509

  • 22

    BaltesN. J.VoytasD. F. (2015). Enabling plant synthetic biology through genome engineering. Trends Biotechnol.33 (2), 120131. 10.1016/j.tibtech.2014.11.008

  • 23

    BarrettT.WilhiteS. E.LedouxP.EvangelistaC.KimI. F.TomashevskyM.et al (2013). NCBI GEO: Archive for functional genomics data sets - Update. Nucleic Acids Res.41 (1), 991995. 10.1093/nar/gks1193

  • 24

    BatistaG. E.MonardM. C. (2002). A study of k-nearest neighbour as an imputation method, 112.

  • 25

    BegumS.ChakrabortyD.SarkarR. (2015). “Data classification using feature selection and kNN machine learning approach,” in 2015 International Conference on Computational Intelligence and Communication Networks (CICN) (IEEE), 69. 10.1109/CICN.2015.165

  • 26

    BehzadiP.BehzadiE.RanjbarR. (2014). The application of microarray in medicine. ORL24, 3638.

  • 27

    Ben HurA. (2001). Support vector clustering. J. Mach. Learn. Res.2, 125137.

  • 28

    BengioY.GingrasF. (1995). Recurrent neural networks for missing or asynchronous data. Adv. neural Inf. Process. Syst.8.

  • 29

    BentleyD. R.BalasubramanianS.SwerdlowH. P.SmithG. P.MiltonJ.BrownC. G.et al (2008). Accurate whole human genome sequencing using reversible terminator chemistry. Nature456, 5359. 10.1038/nature07517

  • 30

    BhandariN.KhareS.WalambeR.KotechaK. (2021). Comparison of machine learning and deep learning techniques in promoter prediction across diverse species. PeerJ. Comput. Sci.7, 3655e417. 10.7717/peerj-cs.365

  • 31

    BlanchardA. P.KaiserR. J.HoodL. E. (1996). High-density oligonucleotide arrays. Biosens. Bioelectron.11 (6/7), 687690. 10.1016/0956-5663(96)83302-1

  • 32

    BoT. H.DysvikB.JonassenI. (2004). LSimpute: Accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res.32 (3), e34e38. 10.1093/nar/gnh026

  • 33

    Bolón-CanedoV.Sanchez-MaronoN.Alonso-BetanzosA.BenitezJ.HerreraF. (2014). A review of microarray datasets and applied feature selection methods. Inf. Sci.282, 111135. 10.1016/j.ins.2014.05.042

  • 34

    BolstadB. M.IrizarryR. A.AstrandM.SpeedT. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics19 (2), 185193. 10.1093/bioinformatics/19.2.185

  • 35

    BouguettayaA.YuQ.LiuX.ZhouX.SongA. (2015). Efficient agglomerative hierarchical clustering. Expert Syst. Appl.42 (5), 27852797. 10.1016/j.eswa.2014.09.054

  • 36

    BrazmaA.ParkinsonH.SarkansU.ShojatalabM.ViloJ.AbeygunawardenaN.et al (2003). ArrayExpress - a public repository for microarray gene expression data at the EBI. Nucleic Acids Res.31 (1), 6871. 10.1093/nar/gkg091

  • 37

    BreimanL.SooK. (2001). Random forests. Mach. Learn.45 (1), 117127. 10.1007/978-3-662-56776-0_10

  • 38

    BrownM. P. S.GrundyW. N.LinD.CristiaNiNiN.SugnetC. W.FureyT. S.et al (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. U. S. A.97 (1), 262267. 10.1073/pnas.97.1.262

  • 39

    BrownM. P. S.SlonimD.ZhuQ. (1999). Support vector machine classification of microarray gene expression data. Santa Cruz: University of California, 2528. Technical Report UCSC-CRL-99-09.

  • 40

    BrunM.SimaC.HuaJ.LoweyJ.CarrollB.SuhE.et al (2007). Model-based evaluation of clustering validation measures. Pattern Recognit. DAGM.40, 807824. 10.1016/j.patcog.2006.06.026

  • 41

    BrunetJ. P.TamayoP.GolubT. R.MesirovJ. P. (2004). Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl. Acad. Sci. U. S. A.101 (12), 41644169. 10.1073/pnas.0308531101

  • 42

    BucklandM.GeyF. (1994). The relationship between recall and precision. J. Am. Soc. Inf. Sci.45 (1), 1219. 10.1002/(sici)1097-4571(199401)45:1<12:aid-asi2>3.0.co;2-l

  • 43

    BullardJ. H.PurdomE.DudoitS. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments James. BMC Bioinforma.11 (94), 113. 10.1186/1471-2105-11-94

  • 44

    CarvalhoB. S.IrizarryR. A. (2010). “A framework for oligonucleotide microarray preprocessing,”, 23632367. 10.1093/bioinformatics/btq431Bioinformatics2619

  • 45

    ChandrasekharT.ThangaveK.SathishkumarE. N. (2013). “Unsupervised gene expression data using enhanced clustering method,” in 2013 IEEE International Conference on Emerging Trends in Computing, Communication and Nanotechnology, ICE-CCN 2013 (IEEE), 518522. 10.1109/ICE-CCN.2013.6528554

  • 46

    ChandrasekharT.ThangavelK.ElayarajaE. (2011). Effective clustering algorithms for gene expression data. Int. J. Comput. Appl.32 (4), 2529.

  • 47

    ChaudhariP.AgrawalH.KotechaK. (2020). Data augmentation using MG-GAN for improved cancer classification on gene expression data. Soft Comput.24 (15), 1138111391. 10.1007/s00500-019-04602-2

  • 48

    CheadleC.VawterM. P.FreedW. J.BeckerK. G. (2003). Analysis of microarray data using Z score transformation. J. Mol. Diagn.5 (2), 7381. 10.1016/S1525-1578(10)60455-2

  • 49

    ChenJ. J.WangS. J.TsaiC. A.LinC. J. (2007). Selection of differentially expressed genes in microarray data analysis. Pharmacogenomics J.7, 212220. 10.1038/sj.tpj.6500412

  • 50

    ChenK. H.WangK. J.TsaiM. L.WangK. M.AdrianA. M.ChengW. C.et al (2014). Gene selection for cancer identification: A decision tree model empowered by particle swarm optimization algorithm. BMC Bioinforma.15 (1), 499. 10.1186/1471-2105-15-49

  • 51

    ChenY.LiY.NarayanR.SubramanianA.XieX. (2016). Gene expression inference with deep learning. Bioinformatics32 (12), 18321839. 10.1093/bioinformatics/btw074

  • 52

    ChenZ.Dodig-CrnkovicT.SchwenkJ. M.TaoS. C. (2018). Current applications of antibody microarrays’, Clinical Proteomics. Clin. Proteomics15 (1), 715. 10.1186/s12014-018-9184-2

  • 53

    ChiccoD.JurmanG. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics21 (1), 613. 10.1186/s12864-019-6413-7

  • 54

    CollobertR.WestonJ. (2008) ‘A unified architecture for natural language processing: Deep neural networks with multitask learning’, in Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 160167.

  • 55

    CurtisR. K.OrešičM.Vidal-PuigA. (2005). Pathways to the analysis of microarray data. Trends Biotechnol.23 (8), 429435. 10.1016/j.tibtech.2005.05.011

  • 56

    DalloraA. L.EivazzadehS.MendesE.BerglundJ.AnderbergP. (2017). Machine learning and microsimulation techniques on the prognosis of dementia: A systematic literature review. PLoS ONE12 (6), e0179804e0179823. 10.1371/journal.pone.0179804

  • 57

    DaltonL.BallarinV.BrunM. (2009). Clustering algorithms: On learning, validation, performance, and applications to genomics. Curr. Genomics10 (6), 430445. 10.2174/138920209789177601

  • 58

    DanaeeP.GhaeiniR.HendrixD. A. (2017). “A deep learning approach for cancer detection and relevant gene identification,” in Pacific Symposium on Biocomputing 2017 Biocomputing, 219229. 10.1142/9789813207813_0022

  • 59

    DavisJ.GoadrichM. (2006) ‘The relationship between precision-recall and ROC curves’, In Proceedings of the 23rd international conference on Machine learning, 233240.

  • 60

    DayanP. (1996). Unsupervised learning. The MIT Encyclopedia of the Cognitive Sciences.

  • 61

    De GuiaJ. M.DevarajM.VeaL. A. (2019). “Cancer classification of gene expression data using machine learning models,” in 2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control (Environment and Management, HNICEM 2018. IEEE). 10.1109/HNICEM.2018.8666435

  • 62

    DeegallaS.BostrH. (2007). “Classification of microarrays with kNN : Comparison of dimensionality reduction,” in International Conference on Intelligent Data Engineering and Automated Learning (Springer-Verlag), 800809.

  • 63

    DengL.YuD. (2014). “Deep learning: Methods and applications,” in Foundations and Trends® in signal processing, 198349.

  • 64

    DevarajanK.EbrahimiN. (2008). Class discovery via nonnegative matrix factorization. Am. J. Math. Manag. Sci.28 (3–4), 457467. 10.1080/01966324.2008.10737738

  • 65

    DhoteY.AgrawalS.DeenA. J. (2015). “A survey on feature selection techniques for internet traffic classification,” in Proceedings - 2015 International Conference on Computational Intelligence and Communication Networks (CICN 2015. IEEE), 13751380. 10.1109/CICN.2015.267

  • 66

    Díaz-UriarteR.Alvarez de AndrésS. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinforma.7 (3), 313. 10.1186/1471-2105-7-3

  • 67

    DickS. (2019). Artificial intelligence. Harv. Data Sci. Rev.1 (1), 17. 10.4324/9780203772294-10

  • 68

    DingC.PengH. (2005). Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol.3 (2), 185205. 10.1142/s0219720005001004

  • 69

    DittmanD. J.WaldR.HulseJ. (2010). “Comparative analysis of DNA microarray data through the use of feature selection techniques,” in Proceedings - 9th International Conference on Machine Learning and Applications (ICMLA 2010. IEEE), 147152. 10.1109/ICMLA.2010.29

  • 70

    DoranM.RaicuD. S.FurstJ. D.SettimiR.SchipMaM.ChandlerD. P. (2007). Oligonucleotide microarray identification of Bacillus anthracis strains using support vector machines. Bioinformatics23 (4), 487492. 10.1093/bioinformatics/btl626

  • 71

    DuP.KibbeW. A.LinS. M. (2008). lumi: A pipeline for processing Illumina microarray. Bioinformatics24 (13), 15471548. 10.1093/bioinformatics/btn224

  • 72

    DubeyA.RasoolA. (2021). Efficient technique of microarray missing data imputation using clustering and weighted nearest neighbour’, Scientific Reports. Sci. Rep.11 (1), 2429724312. 10.1038/s41598-021-03438-x

  • 73

    DudoitS.FridlyannndJ. (2005). “Classification in microarray experiments,” in A practical approach to microarray data analysis, 132149. 10.1007/0-306-47815-3_7

  • 74

    DunnJ. C. (1974). Well-separated clusters and optimal fuzzy partitions. J. Cybern.4 (1), 95104. 10.1080/01969727408546059

  • 75

    EidJ.FehrA.GrayJ.LuongK.LyleJ.OttoG.et al (2009). Real-time DNA sequencing from single polymerase molecules. Science323, 133138. 10.1126/science.1162986

  • 76

    EisenM. B.SpellmanP. T.BrownP. O.BotsteinD. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U. S. A.95, 1486314868. 10.1073/pnas.95.25.14863

  • 77

    EisensteinM. (2012). Oxford Nanopore announcement sets sequencing sector abuzz’. Nat. Biotechnol.30 (4), 295296. 10.1038/nbt0412-295

  • 78

    FanL.PohK. L.ZhouP. (2009). ‘A sequential feature extraction approach for naïve bayes classification of microarray data’. Expert Syst. Appl.36, 99199923. 10.1016/j.eswa.2009.01.075

  • 79

    FarswanA.GuptaA.GuptaR.KaurG. (2020). Imputation of gene expression data in blood cancer and its significance in inferring biological pathways. Front. Oncol.9, 14421514. 10.3389/fonc.2019.01442

  • 80

    FawcettT. (2006). An introduction to ROC analysis. Pattern Recognit. Lett.27, 861874. 10.1016/j.patrec.2005.10.010

  • 81

    Fernandez-CastilloE.Barbosa-SantillanL. I.Falcon-MoralesL.Sanchez-EscobarJ. J. (2022). Deep splicer: A CNN model for splice site prediction in genetic sequences. Genes13 (5), 907. 10.3390/genes13050907

  • 82

    Fernández-DelgadoM.SirsatM. S.CernadasE.AlawadiS.BarroS.Febrero-BandeM. (2019). An extensive experimental survey of regression methods. Neural Netw.111, 1134. 10.1016/j.neunet.2018.12.010

  • 83

    FranksJ. M.CaiG.WhitfieldM. L. (2018). Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data. Bioinformatics34 (11), 18681874. 10.1093/bioinformatics/bty026

  • 84

    FreyhultE.LandforsM.OnskogJ.HvidstenT. R.RydenP. (2010). Challenges in microarray class discovery: A comprehensive examination of normalization, gene selection and clustering. BMC Bioinforma.11 (1), 503514. 10.1186/1471-2105-11-503

  • 85

    FriedmanN.LinialM.NachmanI.Pe'erD. (2000). Using Bayesian networks to analyze expression data. J. Comput. Biol.7 (3–4), 601620. 10.1089/106652700750050961

  • 86

    FrommletF.SzulcP.KonigF.BogdanM. (2022). Selecting predictive biomarkers from genomic data. Plos One17 (6), e0269369. 10.1371/journal.pone.0269369

  • 87

    FureyT. S.CristiaNiNiN.DuffyN.BednarskiD. W.SchuMMerM.HausslerD. (2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics16 (10), 906914. 10.1093/bioinformatics/16.10.906

  • 88

    GanX.LiewA. W. C.YanH. (2006). Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Res.34 (5), 16081619. 10.1093/nar/gkl047

  • 89

    García-LaencinaP. J.Sancho-GómezJ. L.Figueiras-VidalA. R. (2008). “Machine learning techniques for solving classification problems with missing input data,” in Proceedings of the 12th World Multi-Conference on Systems, Cybernetics and Informatics, 16.

  • 90

    GautierL.CopeL.BolstadB. M.IrizarryR. A. (2004). Affy - analysis of Affymetrix GeneChip data at the probe level. Bioinformatics20 (3), 307315. 10.1093/bioinformatics/btg405

  • 91

    GentlemanR.CareyV. J. (2008). “Unsupervised machine learning”, in Bioconductor case studies (New York: Springer), 137157. 10.1007/978-0-387-77240-0_7

  • 92

    GoffinetB.WallachD. (1989). Mean squared error of prediction as a criterion for evaluating and comparing system models. Ecol. Model.44, 299306. 10.1016/0304-3800(89)90035-5

  • 93

    GuoY.TibshiraniR. (2007). Regularized linear discriminant analysis and its application in microarrays. Biostatistics8 (1), 86100. 10.1093/biostatistics/kxj035

  • 94

    GuruprasadK.ReddyB. V. B.PanditM. W. (1990). Correlation between stability of a protein and its dipeptide composition: A novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng.4 (2), 155161. 10.1093/protein/4.2.155

  • 95

    GuyonI.MatinN.VapnikV. (1996). Discovering informative patterns and data cleaning, 145156.

  • 96

    GuyonI.WestonJ.VapnikV. (2002). Gene selection for cancer classification using support vector machines. Mach. Learn. (46), 6272. 10.1007/978-3-540-88192-6-8

  • 97

    HambaliM. A.OladeleT. O.AdewoleK. S. (2020). Microarray cancer feature selection: Review, challenges and research directions. Int. J. Cognitive Comput. Eng.1 (11), 7897. 10.1016/j.ijcce.2020.11.001

  • 98

    HansenK. D.IrizarryR. A.WuZ. (2012). Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics13 (2), 204216. 10.1093/biostatistics/kxr054

  • 99

    HarrisT. D.BuzbyP. R.BabcockH.BeerE.BowersJ.BraslavskyI.et al (2008). Single-molecule DNA sequencing of a viral genome. Science320 (5872), 106109. 10.1126/science.1150427

  • 100

    HijikataA.KitamuraH.KimuraY.YokoyamaR.AibaY.BaoY.et al (2007). Construction of an open-access database that integrates cross-reference information from the transcriptome and proteome of immune cells. Bioinformatics23 (21), 29342941. 10.1093/bioinformatics/btm430

  • 101

    HoffmannR. (2007). Text mining in genomics and proteomics. Fundam. Data Min. Genomics Proteomics9780387475, 251274. 10.1007/978-0-387-47509-7_12

  • 102

    HolzingerA.BiemannC.KellD. (2017). What do we need to build explainable AI systems for the medical domain?128. arXiv preprint arXiv:1712.09923.

  • 103

    HuJ.LiH.WatermanM. S.ZhouX. J. (2006). Integrative missing value estimation for microarray data. BMC Bioinforma.7, 449514. 10.1186/1471-2105-7-449

  • 104

    HuangC.ClaytonE. A.MatyuninaL. V.McDonaldL. D.BenignoB. B.VannbergF.et al (2018). Machine learning predicts individual cancer patient responses to therapeutic drugs with high accuracy. Sci. Rep.8 (1), 1644416449. 10.1038/s41598-018-34753-5

  • 105

    HuangH. J.CampanaR.AkinfenwaO.CurinM.SarzsinszkyE.KarsonovaA.et al (2021). Microarray-based allergy diagnosis: Quo vadis?Front. Immunol.11, 594978595015. 10.3389/fimmu.2020.594978

  • 106

    HyvärinenA. (2013). Independent component analysis: Recent advances. Philos. Trans. A Math. Phys. Eng. Sci.371, 20110534. 10.1098/rsta.2011.0534

  • 107

    IrizarryR. A.HobbsB.CollinF.Beazer-BarclayY. D.AntonellisK. J.ScherfU.et al (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics4, 249264. 10.1093/biostatistics/4.2.249

  • 108

    JaggaZ.GuptaD. (2015). Machine learning for biomarker identification in cancer research - developments toward its clinical application. Per. Med.12 (6), 371387. 10.2217/pme.15.5

  • 109

    JenikeM. A.AlbertM. S. (1984). The dexamethasone suppression test in patients with presenile and senile dementia of the Alzheimer’s type. J. Am. Geriatr. Soc.32 (6), 441444. 10.1111/j.1532-5415.1984.tb02220.x

  • 110

    JolliffeI. T. (1986). Principal component analysis. New York: Springer.

  • 111

    JörnstenR.WangH. Y.WelshW. J.OuyangM. (2005). DNA microarray data imputation and significance analysis of differential expression. Bioinformatics21 (22), 41554161. 10.1093/bioinformatics/bti638

  • 112

    JothiR.MohantyS. K.OjhaA. (2019). DK-Means: A deterministic K-means clustering algorithm for gene expression analysis. Pattern Anal. Appl.22 (2), 649667. 10.1007/s10044-017-0673-0

  • 113

    KangM.JamesonN. J. (2018). ‘Machine learning: Fundamentals’. Prognostics Health Manag. Electron., 85109. 10.1002/9781119515326.ch4

  • 114

    KanungoT.MountD.NetanyahuN.PiatkoC.SilvermanR.WuA. (2002). An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell.24 (7), 881892. 10.1109/tpami.2002.1017616

  • 115

    KarthikS.SudhaM. (2018). A survey on machine learning approaches in gene expression classification in modelling computational diagnostic system for complex diseases. Int. J. Eng. Adv. Technol.8 (2), 182191.

  • 116

    KarthikS.SudhaM. (2021). Predicting bipolar disorder and schizophrenia based on non-overlapping genetic phenotypes using deep neural network. Evol. Intell.14 (2), 619634. 10.1007/s12065-019-00346-y

  • 117

    KhatriP.SirotaM.ButteA. J. (2012). Ten years of pathway analysis: Current approaches and outstanding challenges. PLoS Comput. Biol.8 (2), e1002375. 10.1371/journal.pcbi.1002375

  • 118

    KiaD. A.ZhangD.GuelfiS.ManzoniC.HubbardL.ReynoldsR. H.et al (2021). Identification of candidate Parkinson disease genes by integrating genome-wide association study, expression, and epigenetic data sets. JAMA Neurol.78 (4), 464472. 10.1001/jamaneurol.2020.5257

  • 119

    KimH.ParkH. (2007). Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics23 (12), 14951502. 10.1093/bioinformatics/btm134

  • 120

    KimP.TidorB. (2003). Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res.13 (7), 17061718. 10.1101/gr.903503

  • 121

    KiraK.RendellL. A. (1992). “A practical approach to feature selection, machine learning,” in Proceedings of the Ninth International Workshop (ML92) (Burlington, Massachusetts: Morgan Kaufmann Publishers, Inc). 10.1016/B978-1-55860-247-2.50037-1

  • 122

    KodamaY.MashimaJ.KosugeT.OgasawaraO. (2019). DDBJ update: The Genomic Expression Archive (GEA) for functional genomics data. Nucleic Acids Res.47 (1), D69D73. 10.1093/nar/gky1002

  • 123

    KongW.MouX.HuX. (2011). Exploring matrix factorization techniques for significant genes identification of Alzheimer’s disease microarray gene expression data. BMC Bioinforma.12 (5), 79. 10.1186/1471-2105-12-S5-S7

  • 124

    KongW.VanderburgC. R.GunshinH.RogersJ. T.HuangX. (2008). A review of independent component analysis application to microarray gene expression data. BioTechniques45 (5), 501520. 10.2144/000112950

  • 125

    KotsiantisS.KanellopoulosD. (2006). Association rules mining: A recent overview. Science32 (1), 7182.

  • 126

    KotsiantisS. (2007). Supervised machine learning: A review of classification techniques. Informatica31, 249268. 10.1007/s10751-016-1232-6

  • 127

    KramerM. A. (1991). Nonlinear principal component analysis using autoassociative neural networks. AIChE J.37 (2), 233243. 10.1002/aic.690370209

  • 128

    KrętowskiM.GrześM. (2007). Decision tree approach to microarray data analysis. Biocybern. Biomed. Eng.27 (3), 2942.

  • 129

    KumarM.RathN. K.SwainA.RathS. K. (2015). Feature selection and classification of microarray data using MapReduce based ANOVA and K-nearest neighbor. Procedia Comput. Sci.54, 301310. 10.1016/j.procs.2015.06.035

  • 130

    LaiY. H.ChenW. N.HsuT. C.LinC.TsaoY.WuS. (2020). Overall survival prediction of non-small cell lung cancer by integrating microarray and clinical data with deep learning. Sci. Rep.10 (1), 46794711. 10.1038/s41598-020-61588-w

  • 131

    LakiotakiK.VorniotakisN.TsagrisM.GeorgakopoulosG.TsamardinosI. (2018). BioDataome: A collection of uniformly preprocessed and automatically annotated datasets for data-driven biology. Database (Oxford).2018, 114. 10.1093/database/bay011

  • 132

    LandW. H.QiaoX.MargolisD. E.FordW. S.PaquetteC. T.Perez-RogersJ. F.et al (2011). Kernelized partial least squares for feature reduction and classification of gene microarray data. BMC Syst. Biol.5, S13. 10.1186/1752-0509-5-S3-S13

  • 133

    LangfelderP.HorvathS. (2008). Wgcna: An R package for weighted correlation network analysis. BMC Bioinforma.9, 559. 10.1186/1471-2105-9-559

  • 134

    LarsenM. J.ThomassenM.TanQ.SorensenK. P.KruseT. A. (2014). Microarray-based RNA profiling of breast cancer: Batch effect removal improves cross-platform consistency. Biomed. Res. Int.2014, 651751. 10.1155/2014/651751

  • 135

    LazarC.GattoL.FerroM.BruleyC.BurgerT. (2016). Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies. J. Proteome Res.15 (4), 11161125. 10.1021/acs.jproteome.5b00981

  • 136

    LeCunY.BengioY.HintonG. (2015). Deep learning. Nature13 (1), 436444. 10.1038/nature14539

  • 137

    LeCunY.BottouL.BengioY.HaffnerP. (1998). Gradient-based learning applied to document recognition. Proc. IEEE86 (11), 22782324. 10.1109/5.726791

  • 138

    LeeS.BatzoglouS. (2003). Application of independent component analysis to microarrays. Genome Biol.4 (11), R76R21. 10.1186/gb-2003-4-11-r76

  • 139

    LiE.LuoT.WangY. (2019). Identification of diagnostic biomarkers in patients with gestational diabetes mellitus based on transcriptome gene expression and methylation correlation analysis’, Reproductive Biology and Endocrinology. Reprod. Biol. Endocrinol.17 (1), 11212. 10.1186/s12958-019-0556-x

  • 140

    LiH.ZhaoC.ShaoF.LiG. Z.WangX. (2015). A hybrid imputation approach for microarray missing value estimation. BMC Genomics16 (9), 111. 10.1186/1471-2164-16-S9-S1

  • 141

    LiW.SuhY. J.ZhangJ. (2006). “Does logarithm transformation of microarray data affect ranking order of differentially expressed genes?,” in Conf. Proc. IEEE Eng. Med. Biol. Soc., 65936596. 10.1109/IEMBS.2006.260896

  • 142

    LiX.LiM.YinM. (2016). Multiobjective ranking binary artificial bee colony for gene selection problems using microarray datasets. IEEE/CAA J. Autom. Sin., 116. 10.1109/JAS.2016.7510034

  • 143

    LiZ.XieW.LiuT. (2018). Efficient feature selection and classification for microarray data. PLoS ONE13 (8), 02021677e202221. 10.1371/journal.pone.0202167

  • 144

    LiewA. W. C.LawN. F.YanH. (2011). Missing value imputation for gene expression data: Computational techniques to recover missing data from available information. Brief. Bioinform.12 (5), 498513. 10.1093/bib/bbq080

  • 145

    LiuY.-C.ChengC.-P.TsengV. S. (2011). Discovering relational-based association rules with multiple minimum supports on microarray datasets. Bioinformatics27 (22), 31423148. 10.1093/bioinformatics/btr526

  • 146

    LiuY. (2008). Detect key gene information in classification of microarray data. EURASIP J. Adv. Signal Process., 612397. 10.1155/2008/612397

  • 147

    LiuY. (2009). Prominent feature selection of microarray data. Prog. Nat. Sci.19 (10), 13651371. 10.1016/j.pnsc.2009.01.014

  • 148

    LiuZ.SokkaT.MaasK.OlsenN. J.AuneT. M. (2009). Prediction of disease severity in patients with early rheumatoid arthritis by gene expression profiling. Hum. Genomics Proteomics.1 (1), 484351. 10.4061/2009/484351

  • 149

    LoveM. I.HuberW.AndersS. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol.15 (12), 550621. 10.1186/s13059-014-0550-8

  • 150

    LuH.XieR. D.LinR.ZhangC.XiaoX. J.LiL. J.et al (2017). Vitamin D-deficiency induces eosinophil spontaneous activation. Cell. Immunol.256, 5663. 10.1016/j.cellimm.2017.10.003

  • 151

    LuY.LuS.DengY. (2004). Fgka: A fast genetic K-means clustering algorithm. Proc. ACM Symposium Appl. Comput.1, 622623. 10.1145/967900.968029

  • 152

    MaS.SongX.HuangJ. (2007). Supervised group Lasso with applications to microarray data analysis. BMC Bioinforma.8, 6017. 10.1186/1471-2105-8-60

  • 153

    MackC.SuZ.WestreichD. (2018). Managing missing data in patient registries: Addendum to registries for evaluating patient outcomes. A User’s Guide’.

  • 154

    MacQueenJ. (1967). “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 281297. 10.1007/s11665-016-2173-6

  • 155

    ManikandanG.AbiramiS. (2018). “A survey on feature selection and extraction techniques for high-dimensional microarray datasets,” in Knowledge computing and its applications (Springer Singapore), 311333.

  • 156

    MarguliesM.EgholmM.AltmanW. E.AttiyaS.BaderJ. S.BembenL. A.et al (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature437 (7057), 376380. 10.1038/nature03959

  • 157

    MatthewsB. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta405 (2), 442451. 10.1016/0005-2795(75)90109-9

  • 158

    McNeeS. M.RiedlJ.KonstanJ. A. (2006). “‘Being accurate is not enough: How accuracy metrics have hurt recommender systems’,” in Conference on Human Factors in Computing Systems - Proceedings, 10971101. 10.1145/1125451.1125659

  • 159

    McNicholasP. D.MurphyT. B. (2010). Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics26 (21), 27052712. 10.1093/bioinformatics/btq498

  • 160

    MeierL.Van De GeerS.BühlmannP. (2008). The group lasso for logistic regression. J. R. Stat. Soc. Ser. B Stat. Methodol.70 (1), 5371. 10.1111/j.1467-9868.2007.00627.x

  • 161

    MicheuzP. (2020). “Approaches to artificial intelligence as a subject in school education,” in Open Conference on Computers in Education (Cham.: Springer), 313.

  • 162

    MoorthyK.JaberA. N.IsmailM. A.ErnawanF.MohamadM. S.DerisS. (2019). Missing-values imputation algorithms for microarray gene expression data. Methods Mol. Biol., 255266. 10.1007/978-1-4939-9442-7_12

  • 163

    MoorthyK.MohamadM. S. (2012). Random forest for gene selection and microarray data classification. Bioinformation7 (3), 142146. 10.6026/97320630007142

  • 164

    Morais-RodriguesF.Silv Erio-MachadoR.KatoR. B.RodriguesD. L. N.Valdez-BaezJ.FonsecaV.et al (2020). Analysis of the microarray gene expression for breast cancer progression after the application modified logistic regression. Gene726, 1441688. 10.1016/j.gene.2019.144168

  • 165

    MotieghaderH.NajafiA.SadeghiB.Masoudi-NejadA. (2017). A hybrid gene selection algorithm for microarray cancer classification using genetic algorithm and learning automata. Inf. Med. Unlocked9 (8), 246254. 10.1016/j.imu.2017.10.004

  • 166

    NeubauerC. (1998). Evaluation of convolutional neural networks for visual recognition. IEEE Trans. Neural Netw.9 (4), 685696. 10.1109/72.701181

  • 167

    NguyenN. G.TranV. A.NgoD. L.PhanD.LumbanrajaF. R.FaisalM. R.et al (2016). DNA sequence classification by convolutional neural network. J. Biomed. Sci. Eng.09 (05), 280286. 10.4236/jbise.2016.95021

  • 168

    NidheeshN.Abdul NazeerK. A.AmeerP. M. (2017). An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data. Comput. Biol. Med.91, 213221. 10.1016/j.compbiomed.2017.10.014

  • 169

    NikkilaJ.ToronenP.KaskiS.VennaJ.CastrenE.WongG. (2002). Analysis and visualization of gene expression data using Self-Organizing Maps. Neural Netw.15, 953966. 10.1016/s0893-6080(02)00070-9

  • 170

    NikumbhS.GhoshS.JayaramanV. K. (2012). “Biogeography-based informative gene selection and cancer classification using SVM and Random Forests,” in 2012 IEEE Congress on Evolutionary Computation (Brisbane, QLD: CEC 2012), 16. 10.1109/CEC.2012.6256127

  • 171

    ObaS.SatoM. A.TakemasaI.MondenM.MatsubaraK. i.IshiiS. (2003). A Bayesian missing value estimation method for gene expression profile data. Bioinformatics19 (16), 20882096. 10.1093/bioinformatics/btg287

  • 172

    O’ConnellM. (2003). Differential expression, class discovery and class prediction using S-PLUS and S+ArrayAnalyzer. SIGKDD Explor. Newsl.5 (2), 3847. 10.1145/980972.980979

  • 173

    OladejoA. K.OladeleT. O.SaheedY. K. (2018). Comparative evaluation of linear support vector machine and K-nearest neighbour algorithm using microarray data on leukemia cancer dataset. Afr. J. Comput. ICT11 (2), 110.

  • 174

    ÖnskogJ.FreyhultE.LandforsM.RydenP.HvidstenT. R. (2011). Classification of microarrays; synergistic effects between normalization, gene selection and machine learning. BMC Bioinforma.12, 390. 10.1186/1471-2105-12-390

  • 175

    O’SheaK.NashR. (2015). An introduction to convolutional neural networks, 111. arXiv preprint, arXiv:1511.

  • 176

    OuyangM.WelshW. J.GeorgopoulosP. (2004). Gaussian mixture clustering and imputation of microarray data. Bioinformatics20 (6), 917923. 10.1093/bioinformatics/bth007

  • 177

    PanH.ZhuJ.HanD. (2003). Genetic algorithms applied to multi-class clustering for gene ex- pression data partitional clustering techniques’. Genomics Proteomics Bioinforma.1 (4), 279287. 10.1016/S1672-0229(03)01033-7

  • 178

    PanW.LinJ.LeC. T. (2002). Model-based cluster analysis of microarray gene-expression data. Genome Biol.3 (2), RESEARCH00098. 10.1186/gb-2002-3-2-research0009

  • 179

    PanX.TianY.HuangY.ShenH. B. (2011). Towards better accuracy for missing value estimation of epistatic miniarray profiling data by a novel ensemble approach’, Genomics. Genomics97 (5), 257264. 10.1016/j.ygeno.2011.03.001

  • 180

    PanX.YanJ. (2017) ‘Attention based convolutional neural network for predicting RNA-protein binding sites’, arXiv preprint, arXiv:1712, pp. 811.

  • 181

    PariharA.MondalS.SinghR. (2022). “Introduction, scope, and applications of biotechnology and genomics for sustainable agricultural production,” in Plant genomics for sustainable agriculture. Editor LakhanR. (Springer), 114. 10.1007/978-981-16-6974-3

  • 182

    ParikhR.Andjelković ApostolovićM.StojanovićD. (2008a). Understanding and using sensitivity, specificity and predictive values. Indian J. Ophthalmol.56 (1), 341350. 10.4103/0301-4738.41424

  • 183

    ParikhR.MathaiA.ParikhS.Chandra SekharG.ThomasR. (2008b). Understanding and using sensitivity, Specificity and predictive values. Indian J. Ophthalmol.56 (1), 4550. 10.4103/0301-4738.37595

  • 184

    ParkC.HaJ.ParkS. (2020). Prediction of Alzheimer’s disease based on deep neural network by integrating gene expression and DNA methylation dataset. Expert Syst. Appl.140, 112873. 10.1016/j.eswa.2019.112873

  • 185

    ParkH.-S.YooS.-H.ChoS.-B. (2007). Forward selection method with regression analysis for optimal gene selection in cancer classification. Int. J. Comput. Math.84 (5), 653667. 10.1080/00207160701294384

  • 186

    PeaseA. C.SolasD.SullivanE. J. (1994). “Light-generated oligonucleotide arrays for rapid DNA sequence analysis,” in Proceedings of the National Academy of Sciences of the United States of America, 50225026. 10.1073/pnas.91.11.5022

  • 187

    PengJ.GuanJ.ShangX. (2019). Predicting Parkinson’s disease genes based on node2vec and autoencoder. Front. Genet.10, 2266. 10.3389/fgene.2019.00226

  • 188

    PengY.LiW.LiuY. (2006). A hybrid approach for biomarker discovery from microarray gene expression data for cancer classification. Cancer Inf.2, 117693510600200117693510600311. 10.1177/117693510600200024

  • 189

    PetersonL. E.ColemanM. A. (2008). Machine learning-based receiver operating characteristic (ROC) curves for crisp and fuzzy classification of DNA microarrays in cancer research. Int. J. Approx. Reason.47 (1), 1736. 10.1016/j.ijar.2007.03.006

  • 190

    PiroozniaM.YangJ. Y.YangM. Q.DengY. (2008). A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics9 (1), S13S13. 10.1186/1471-2164-9-S1-S13

  • 191

    PochetN.De SmetF.SuykensJ. A. K.De MoorB. L. R. (2004). Systematic benchmarking of microarray data classification: Assessing the role of non-linearity and dimensionality reduction. Bioinformatics20 (17), 31853195. 10.1093/bioinformatics/bth383

  • 192

    PrasannaK.SeethaM.KumarA. P. S. (2014). “CApriori: Conviction based Apriori algorithm for discovering frequent determinant patterns from high dimensional datasets,” in 2014 International Conference on Science Engineering and Management Research, ICSEMR 2014 (IEEE). 10.1109/ICSEMR.2014.7043622

  • 193

    QiuY. L.ZhengH.GevaertO. (2018). A deep learning framework for imputing missing values in genomic data. BioRxiv, 406066.

  • 194

    QiuY. L.ZhengH.GevaertO. (2020). Genomic data imputation with variational auto-encoders. Gigascience, 9. giaa08212. 10.1093/gigascience/giaa082

  • 195

    QuackenbushJ. (2001). Computational analysis of microarray data. Nat. Rev. Genet.2, 418427. 10.1038/35076576

  • 196

    RadovicM.GhalwashM.FilipovicN.ObradovicZ. (2017). Minimum redundancy maximum relevance feature selection approach for temporal gene expression data’. BMC Bioinforma.18 (1), 914. 10.1186/s12859-016-1423-9

  • 197

    RamP. K.KuilaP. (2019). Feature selection from microarray data : Genetic algorithm based approach. J. Inf. Optim. Sci.40 (8), 15991610. 10.1080/02522667.2019.1703260

  • 198

    RefaeilzadehP.TangL.LiuH. (2009). Cross-validation. Encycl. Database Syst.5, 532538. 10.1007/978-0-387-39940-9_565

  • 199

    RhoadsA.AuK. F. (2015). PacBio sequencing and its applications. Genomics Proteomics Bioinforma.13 (5), 278289. 10.1016/j.gpb.2015.08.002

  • 200

    RibeiroM. T.SinghS.GuestrinC. (2016). “Why should I trust you?,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, 11351144. 10.1145/2939672.2939778

  • 201

    RingnérM. (2008). What is principal component analysis. Nat. Biotechnol.26 (3), 303304. 10.1038/nbt0308-303

  • 202

    RitchieM. E.PhipsonB.WuD.HuY.LawC. W.ShiW.et al (2015). Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res.43 (7), e47. 10.1093/nar/gkv007

  • 203

    RobinsonM. D.McCarthyD. J.SmythG. K. (2009). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics26 (1), 139140. 10.1093/bioinformatics/btp616

  • 204

    RothbergJ. M.HinzW.RearickT. M.SchultzJ.MileskiW.DaveyM.et al (2011). An integrated semiconductor device enabling non-optical genome sequencing’. Nature475 (7356), 348352. 10.1038/nature10242

  • 205

    RousseeuwP. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math.20, 5365. 10.1016/0377-0427(87)90125-7

  • 206

    RubinD. B. (1976). Inference and missing data. Biometrika63 (3), 581592. 10.1093/biomet/63.3.581

  • 207

    RyanC.GreeneD.CagneyG.CunninghamP. (2010). Missing value imputation for epistatic MAPs. BMC Bioinforma.11, 197. 10.1186/1471-2105-11-197

  • 208

    SaeysY.InzaI.LarrañagaP. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics23 (19), 25072517. 10.1093/bioinformatics/btm344

  • 209

    SafavianS. R.LandgrebeD. (1991). A survey of decision tree classifier methodology. IEEE Trans. Syst. Man. Cybern.21 (3), 660674. 10.1109/21.97458

  • 210

    SahaS.GhostA.DeyK. (2017). “An ensemble based missing value estimation in DNA microarray using artificial neural network,” in Proceedings - 2016 2nd IEEE International Conference on Research in Computational Intelligence and Communication Networks, February 2019 (Kolkata, India: ICRCICN 2016), 279284. 10.1109/ICRCICN.2016.7813671

  • 211

    SahuB.MishraD. (2012). A novel feature selection algorithm using particle swarm optimization for cancer microarray data. Procedia Eng.38, 2731. 10.1016/j.proeng.2012.06.005

  • 212

    SahuM. A.SwarnkarM. T.DasM. K. (2011). Estimation methods for microarray data with missing values : A review. Int. J. Comput. Sci. Inf. Technol.2 (2), 614620.

  • 213

    SangerF.NicklenS.CoulsonA. R. (1977). DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U. S. A.74 (12), 54635467. 10.1073/pnas.74.12.5463

  • 214

    SayedS.NassefM.BadrA.FaragI. (2019). A Nested Genetic Algorithm for feature selection in high-dimensional cancer Microarray datasets. Expert Syst. Appl.121 (C), 233243. 10.1016/j.eswa.2018.12.022

  • 215

    SchaferJ. L.GrahamJ. W. (2002). Missing data: Our view of the state of the art. Psychol. Methods7 (2), 147177. 10.1037/1082-989X.7.2.147

  • 216

    SchmidtL. J.MurilloH.TindallD. J. (2004). Gene expression in prostate cancer cells treated with the dual 5 alpha-reductase inhibitor dutasteride. J. Androl.25 (6), 944953. 10.1002/j.1939-4640.2004.tb03166.x

  • 217

    Segundo-ValI. S.Sanz-LozanoC. S. (2016). Introduction to the gene expression analysis. Methods Mol. Biol.1434, 2943. 10.1007/978-1-4939-3652-6_3

  • 218

    SharmaA.PaliwalK. K.ImotoS.MiyanoS. (2014). A feature selection method using improved regularized linear discriminant analysis. Mach. Vis. Appl.25, 775786. 10.1007/s00138-013-0577-y

  • 219

    SharmaA.RaniR. (2021). ‘A systematic review of applications of machine learning in cancer prediction and diagnosis’. Arch. Comput. Methods Eng.28, 48754896. 10.1007/s11831-021-09556-z

  • 220

    ShendureJ.PorrecaG. J.ReppasN. B.LinX.McCutcheonJ. P.RosenbaumA. M.et al (2005). Accurate multiplex polony sequencing of an evolved bacterial genome. Science309, 17281732. 10.1126/science.1117389

  • 221

    SmithG. S.SnyderR. L. (1979). F<i>N</i>: A criterion for rating powder diffraction patterns and evaluating the reliability of powder-pattern indexing. J. Appl. Crystallogr.12, 6065. 10.1107/s002188987901178x

  • 222

    SmythG. K.SpeedT. (2003). Normalization of cDNA microarray data. Methods31 (4), 265273. 10.1016/s1046-2023(03)00155-5

  • 223

    SmythG. K. (2005). ‘limma: Linear models for microarray data’. Bioinforma. Comput. Biol. Solutions Using R Bioconductor11, 397420. 10.1007/0-387-29362-0_23

  • 224

    SoutoM. C. P. D.JaskowiakP. A.CostaI. G. (2015). Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinforma.16, 6469. 10.1186/s12859-015-0494-3

  • 225

    StatnikovA.WangL.AliferisC. F. (2008). A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinforma.9, 319410. 10.1186/1471-2105-9-319

  • 226

    StoreyJ.TibshiraniR. (2003). “Statistical methods for identifying differentially expressed genes in DNA microarrays,” in Methods in molecular biology (Totowa, NJ: Humana Press), 149157.

  • 227

    SturmB. L. (2013). Classification accuracy is not enough: On the evaluation of music genre recognition systems. J. Intell. Inf. Syst.41, 371406. 10.1007/s10844-013-0250-y

  • 228

    SubashiniP.KrishnaveniM. (2011). “Imputation of missing data using bayesian principal component analysis on tec ionospheric satellite dataset,” in Canadian Conference on Electrical and Computer Engineering (IEEE), 001540001543. 10.1109/CCECE.2011.6030724

  • 229

    Tabares-SotoR.Orozco-AriasS.Romero-CanoV.Segovia BucheliV.Rodriguez-SoteloJ. L.Jimenez-VaronC. F. (2020). A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data. PeerJ. Comput. Sci.6 (207), 2700e322. 10.7717/peerj-cs.270

  • 230

    TamayoP.SlonimD.ZhuQ. (1999). “Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation,” in Proceedings of the National Academy of Sciences of the United States of America, 29072912. 10.1073/pnas.96.6.2907

  • 231

    TavazoieS.HughesJ. D.CampbellM. J.ChoR. J.ChurchG. M. (1999). Systematic determination of genetic network architecture. Nat. Genet.22 (3), 281285. 10.1038/10343

  • 232

    TibshiraniR.HastieT.NarasimhanB.ChuG. (2003). Class prediction by nearest shrunken centroids , with applications to DNA microarrays. Stat. Sci.18 (1), 104117. 10.1214/ss/1056397488

  • 233

    TibshiranitB. R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol.58 (1), 267288. 10.1111/j.2517-6161.1996.tb02080.x

  • 234

    TomczakK.CzerwińskaP.WiznerowiczM. (2015). The Cancer Genome Atlas (TCGA): An immeasurable source of knowledge. Contemp. Oncol.1, A68A77. 10.5114/wo.2014.47136

  • 235

    Toro-DomínguezD.Lopez-DominguezR.Garcia MorenoA.Villatoro-GarciaJ. A.Martorell-MaruganJ.GoldmanD.et al (2019). Differential treatments based on drug-induced gene expression signatures and longitudinal systemic lupus erythematosus stratification. Sci. Rep.9 (1), 1550215509. 10.1038/s41598-019-51616-9

  • 236

    TroyanskayaO.CantorM.SherlockG.BrownP.HasTieT.TibshiRaniR.et al (2001). Missing value estimation methods for DNA microarrays. Bioinformatics17 (6), 520525. 10.1093/bioinformatics/17.6.520

  • 237

    TuikkalaJ.EloL. L.NevalainenO. S.AittokallioT. (2008). Missing value imputation improves clustering and interpretation of gene expression microarray data. BMC Bioinforma.9, 202214. 10.1186/1471-2105-9-202

  • 238

    TuikkalaJ.EloL.NevalainenO. S.AittokallioT. (2006). Improving missing value estimation in microarray data with gene ontology. Bioinformatics22 (5), 566572. 10.1093/bioinformatics/btk019

  • 239

    TurgutS.DagtekinM.EnsariT. (2018). “Microarray breast cancer data classification using machine learning methods,” in 2018 Electric Electronics, Computer Science, Biomedical Engineerings’ Meeting, EBBT 2018 (IEEE), 13. 10.1109/EBBT.2018.8391468

  • 240

    TyagiV.MishraA. (2013). A survey on different feature selection methods for microarray data analysis. Int. J. Comput. Appl.67 (16), 3640. 10.5120/11482-7181

  • 241

    UhlM.TranV. D.HeylF.BackofenR. (2021). RNAProt: An efficient and feature-rich RNA binding protein binding site predictor. Gigascience, 10. GigaScience, giab05413. 10.1093/gigascience/giab054

  • 242

    UmarovR. K.SolovyevV. V. (2017). Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLoS ONE12 (2), e0171410e0171412. 10.1371/journal.pone.0171410

  • 243

    ValouevA.IchikawaJ.TonthatT.StuartJ.RanadeS.PeckhamH.et al (2008). A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. Genome Res.18 (7), 10511063. 10.1101/gr.076463.108

  • 244

    VihinenM. (2012). How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis. BMC genomics13, S2S10. 10.1186/1471-2164-13-S4-S2

  • 245

    VincentP.LarochelleH.LajoieI. (2010). Stacked denoising Autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res.11, 33713408.

  • 246

    VincentP.LarochelleH. (2008). “Extracting and composing robust features with denoising,” in Proceedings of the 25th international conference on Machine learning, 10961103.

  • 247

    VoA. H.Van VleetT. R.GuptaR. R.LiguoriM. J.RaoM. S. (2020). An overview of machine learning and big data for drug toxicity evaluation. Chem. Res. Toxicol.33 (1), 2037. 10.1021/acs.chemrestox.9b00227

  • 248

    WangA.ChenY.AnN.YangJ.LiL.JiangL. (2019). Microarray missing value imputation: A regularized local learning method’. IEEE/ACM Trans. Comput. Biol. Bioinform.16 (3), 980993. 10.1109/TCBB.2018.2810205

  • 249

    WangX.LiA.JiangZ.FengH. (2006). Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme. BMC Bioinforma.7, 3210. 10.1186/1471-2105-7-32

  • 250

    WinstonP. H. (1992). Artificial intelligence. Addison-Wesley Longman Publishing Co., Inc. ACM digital library.

  • 251

    XiangQ.DaiX.DengY.HeC.WangJ.FengJ.et al (2008). Missing value imputation for microarray gene expression data using histone acetylation information. BMC Bioinforma.9, 117. 10.1186/1471-2105-9-252

  • 252

    YangY.DudoitS.LuuP.LinD. M.PengV.NgaiJ.et al (2002). Normalization for cDNA microarray data: A robust composite method addressing single andmultiple slide systematic variation. Nucleic Acids Res.30 (4), e15e10. 10.1093/nar/30.4.e15

  • 253

    YipW.AminS. B.LiC. (2011). “A survey of classification techniques for microarray data analysis,” in Handbook of statistical bioinformatics springer (Berlin, Heidelberg: Springer Berlin Heidelberg), 193223. 10.1007/978-3-642-16345-610

  • 254

    YuL.LiuH. (2003). “Feature selection for high-dimensional data: A fast correlation-based filter solution,” in Proceedings, Twentieth International Conference on Machine Learning, 856863.

  • 255

    YuxiL.SchukatM.HowleyE. (2018) ‘Deep reinforcement learning: An overview’, , arXiv preprint arXiv:1701.07274, 16, pp. 426440. doi: 10.1007/978-3-319-56991-8_32

  • 256

    ZeebareeD. Q.HaronH.AbdulazeezA. M. (2018). “Gene selection and classification of microarray data using convolutional neural network,” in International Conference on Advanced Science and Engineering (ICOASE) (IEEE), 145150. 10.1109/ICOASE.2018.8548836

  • 257

    ZhangX.JonassenI.GoksøyrA. (2021). Machine learning approaches for biomarker discovery using gene expression data. Bioinformatics, 5364.

  • 258

    ZhangY.YangY.WangC.WanS.YaoZ.ZhangY. (2020). Identification of diagnostic biomarkers of osteoarthritis based on multi-chip integrated analysis and machine learning. DNA Cell Biol.39, 22452256. 10.1089/dna.2020.5552

  • 259

    ZhengC. H.HuangD. S.ShangL. (2006). Feature selection in independent component subspace for microarray data classification. Neurocomputing69, 24072410. 10.1016/j.neucom.2006.02.006

  • 260

    ZouJ.HussM.AbidA.MohammadiP.TorkamaniA.TelentiA. (2019). A primer on deep learning in genomics. Nat. Genet.51 (1), 1218. 10.1038/s41588-018-0295-5

Summary

Keywords

gene expression, microarray, machine learning, deep learning, missing value imputation, feature selection, interpretation, explainable techniques

Citation

Bhandari N, Walambe R, Kotecha K and Khare SP (2022) A comprehensive survey on computational learning methods for analysis of gene expression data. Front. Mol. Biosci. 9:907150. doi: 10.3389/fmolb.2022.907150

Received

29 March 2022

Accepted

28 September 2022

Published

07 November 2022

Volume

9 - 2022

Edited by

Deepak Kumar Jain, Chongqing University of Posts and Telecommunications, China

Reviewed by

Sameet Mehta, Yale University, United States

Stephen R. Piccolo, Brigham Young University, United States

Updates

Copyright

*Correspondence: Rahee Walambe, ; Ketan Kotecha, ; Satyajeet P. Khare,

This article was submitted to Molecular Diagnostics and Therapeutics, a section of the journal Frontiers in Molecular Biosciences

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics