Unified Transcriptomic Signature of Arbuscular Mycorrhiza Colonization in Roots of Medicago truncatula by Integration of Machine Learning, Promoter Analysis, and Direct Merging Meta-Analysis

Plant root symbiosis with Arbuscular mycorrhizal (AM) fungi improves uptake of water and mineral nutrients, improving plant development under stressful conditions. Unraveling the unified transcriptomic signature of a successful colonization provides a better understanding of symbiosis. We developed a framework for finding the transcriptomic signature of Arbuscular mycorrhiza colonization and its regulating transcription factors in roots of Medicago truncatula. Expression profiles of roots in response to AM species were collected from four separate studies and were combined by direct merging meta-analysis. Batch effect, the major concern in expression meta-analysis, was reduced by three normalization steps: Robust Multi-array Average algorithm, Z-standardization, and quartiling normalization. Then, expression profile of 33685 genes in 18 root samples of Medicago as numerical features, as well as study ID and Arbuscular mycorrhiza type as categorical features, were mined by seven models: RELIEF, UNCERTAINTY, GINI INDEX, Chi Squared, RULE, INFO GAIN, and INFO GAIN RATIO. In total, 73 genes selected by machine learning models were up-regulated in response to AM (Z-value difference > 0.5). Feature weighting models also documented that this signature is independent from study (batch) effect. The AM inoculation signature obtained was able to differentiate efficiently between AM inoculated and non-inoculated samples. The AP2 domain class transcription factor, GRAS family transcription factors, and cyclin-dependent kinase were among the highly expressed meta-genes identified in the signature. We found high correspondence between the AM colonization signature obtained in this study and independent RNA-seq experiments on AM colonization, validating the repeatability of the colonization signature. Promoter analysis of upregulated genes in the transcriptomic signature led to the key regulators of AM colonization, including the essential transcription factors for endosymbiosis establishment and development such as NF-YA factors. The approach developed in this study offers three distinct novel features: (I) it improves direct merging meta-analysis by integrating supervised machine learning models and normalization steps to reduce study-specific batch effects; (II) seven attribute weighting models assessed the suitability of each gene for the transcriptomic signature which contributes to robustness of the signature (III) the approach is justifiable, easy to apply, and useful in practice. Our integrative framework of meta-analysis, promoter analysis, and machine learning provides a foundation to reveal the transcriptomic signature and regulatory circuits governing Arbuscular mycorrhizal symbiosis and is transferable to the other biological settings.


INTRODUCTION
Arbuscular mycorrhiza (AM) fungal symbiosis expands the surface area of plant root, allowing for better absorption of substances such as phosphorus, ammonium, and zinc from soil. This symbiosis supports plant development, particularly under nutrient deficiency and other stressful conditions. Specific genetic programs activated by AM inoculation lead to successful microsymbiont colonization and functional symbiosis. Most studies in AM symbiosis are limited to the investigation of a single gene or a cluster of similar genes. Genes such as DMI1, DMI2, NFP, NSP1 (Oláh et al., 2005), MtBcp1 (Hohnjec et al., 2005), ENOD11 (Genre et al., 2005), MIG1 (Heck et al., 2016), RAM1 (Rich et al., 2017), nfr1, nfr5, lys11 (Rasmussen et al., 2016), and NIN (Guillotin et al., 2016) are reported to play roles in the formation of mycorrhizal symbiosis.
The regulatory mechanisms underpinning AM symbiosis in plants are poorly understood. The GRAS transcription factor family contains the best known regulators of AM symbiosis. The function of ATA/RAM1, a member of this family, in reprogramming AM symbiosis has been established (Rich et al., 2017). It has been suggested that RAM1 controls the expression of many essential AM-related genes such as STR, STR2, RAM2, and PT4 (Rich et al., 2017). Another member of the GRAS transcription factor family, MIG1, interacts with DELLA1 and the root GA signaling pathway to regulate cortical cell expansion in developing AM symbiosis (Heck et al., 2016). The role of small RNAs, such as miR171 in establishment of AM symbiosis has also been investigated recently (Couzigou et al., 2017).
Successful AM colonization is vital to establish symbiosis and improve phosphorous and water uptake. The AM type, as well as many, environmental and genetic factors affect the intensity, timing, and the success of AM colonization. Cross-comparison of successful colonization between different AM types in a range of experiments by meta-analysis provides the opportunity to move toward understanding the genetic basis of endosymbiosis (Tromas et al., 2012), the conserved transcriptomic program that can reflect successful AM colonization and establishment. Those genes can unravel the functional groups that may play key roles in the establishment and functioning of the three AM symbioses. The transcriptomic signature of AM colonization can be further employed for: (1) increasing AM efficiency by application of chemical and environmental treatments, (2) monitoring successful/unsuccessful AM colonization, and (3) finding the upstream regulatory mechanisms and regulators such as transcription factors and microRNAs that control AM colonization and symbiosis.
However, no attempt has been made to identify the unified transcriptomic signature of AM symbiosis. The term of "Unified transcriptomic signature" or "biosignature" refers to robust transcript responses that can monitor the successful AM colonization. Overlaps observed in transcriptional profiles of Medicago truncatula roots inoculated with two different Glomus fungi (Hohnjec et al., 2005) support the possibility of achieving a unified transcriptomic signature of AM colonization to provide an insight into the genetic program activated during AM.
The emerging field of meta-analysis may solve the issue of merging different experiments to identify a unique biosignature of Medicago root response to AM inoculation. Cross-species meta-analysis of transcriptomic data has received increased attention in recent years due to the advances in pattern discovery and meta-analysis models (Tromas et al., 2012;Farhadian et al., 2018b). Meta-analysis enables the combination of expression datasets and is highly advantageous in increasing statistical power to detect biological phenomena from studies with a restricted sample size (Johnson et al., 2007).
The biosignature of AM inoculation obtained may be utilized to further computational systems biology analysis, such as promoter analysis, common regulator discovery, and common target discovery, in order to lead us to the key regulators and targets of the AM symbiosis pathway.
Different statistical methods have been developed for metaanalysis of expression data such as combining effect sizes, combining ranks, combining p-values, vote counting, and direct merging (DM) (Borenstein et al., 2009(Borenstein et al., , 2010Campain and Yang, 2010;Chang et al., 2013;Sharifi et al., 2018). Within meta-analysis approaches, DM analysis of expression data or genomic variant data of different studies is an attractive meta-analysis method to increase statistical power and lead to a robust transcriptomic or genomic signature (Tseng et al., 2012). DM, as a metaanalysis approach, has been used in web-tools such as INMEX (Xia et al., 2013(Xia et al., , 2015, A-MADMAN (Bisognin et al., 2009), WGAS (Dai et al., 2007), and GEOSS (Bisognin et al., 2009) for integrative meta-analysis of expression data. DM-based metaanalysis provides the possibility of data collection from different experiments, even when a treatment or a control is missing in one or more experiments. This contributes to a higher statistical power of meta-analysis.
The major concern about the DM approach is heterogenicity across studies. The success of the DM approach depends on normalization across studies to reduce non-biological experimental variation as well as biological variations unrelated to treatment (also called batch effects or study effects) (Johnson et al., 2007;Tseng et al., 2012). Collection of arrays from similar platforms across all studies (mainly Affymetrix) and preprocessing of the CEL expression files by model-based robust multi-array (RMA) normalization  have been suggested to decrease heterogenicity across all studies (Lee et al., 2008;Sims et al., 2008;Tseng et al., 2012). However, it has been debated that RMA is not strong enough to remove batch effects (Guerra and Goldstein, 2009). To sufficiently reduce batch effects for accurate DM, additional normalization techniques such as empirical Bayes methods (Johnson et al., 2007), crossplatform normalization (Shabalin et al., 2008), weighted distance weighted discrimination (Qiao et al., 2010), enrichment-based meta-analysis, and Ratio adjustment and calibration scheme (Cheng et al., 2009) have been used.
Recent advances in application of supervised machine learning models in transcriptomic studies have opened a new venue to engage data mining models in decreasing batch effects and integration of different studies (Pashaiasl et al., 2016a,b). Supervised machine learning has brought new possibilities to predictive studies (Bakhtiarizadeh et al., 2014a;Ebrahimi et al., 2014;Zinati et al., 2014;Kargarfard et al., 2015;Pashaiasl et al., 2016a,b). The capability to simultaneously analyse both categorical and numerical features, power to analyse large data, and various predictive algorithms with diverse statistical backgrounds are distinguished features of supervised machine learning models (Shekoofa et al., 2014;Ebrahimi et al., 2015;Jamali et al., 2016). The possibility to include the categorical variables in predictive models can outstandingly decrease the heterogenicity across studies as the batch effects (Shekoofa et al., 2014). For example, in this study, the different experiments or types of AM can be added as variables and analyzed in the predictive model of the AM transcriptomic signature. This possibility is highly limited in traditional multivariate or regression models.
Due to the central role of colonization in establishing a microsymbiont, we developed a framework for finding the transcriptomic signature of successful AM colonization on roots of Medicago truncatula by integration of meta-analysis and machine learning (attribute weighting) models. Special attention was paid to reducing the batch effects by utilizing normalization methods and finding reliable gene candidates by machine learning models. The genes discovered in the transcriptomic signature were further used as the input of promoter analysis to identify the transcription factors which regulate the signature.

METHODS
A flowchart of the integrative computational systems biological approach employed in this study is presented in Figure 1.

Data Collection for Meta-Analysis
Studies on the AM transcriptome were identified in repositories of high-throughput expression data such as NCBI GEO (https:// www.ncbi.nlm.nih.gov/geo/) and ArrayExpress (https://www. ebi.ac.uk/arrayexpress/). Supplementary Table 1 presents the list of the studies mined and their platforms. The Microarray studies belonged to Medicago truncatula A17.
The microarray experiment (Floss et al., 2017) was originally designed to compare gene expression in roots of Medicago truncatula A17 and Medicago truncatula mutant mtpt4-1 colonized with Gigaspora gigantea. We only used data from three independent biological replicate samples of wild type plants colonized with Gigaspora gigantea for meta-analysis. Samples were harvested 18-day post planting and 11 days post contact with the spores.
In the original experiment (Truong et al., 2015), the impact of P limitation and both P and N limitation on Medicago truncatula A17 root transcriptome in response to Rhizophagus irregularis (previously known as Glomus intraradices) were investigated. In the original experiment, the root transcriptome of both wild type plants and a hypermycorrhizal mutant (B9) grown on limiting or non-limiting phosphate were analyzed to determine which processes were in the hypermycorrhizal mutant. Plants were harvested 4 weeks after inoculation. From this experiment, only data of mycorrhizied wild type plants colonized with Rhizophagus irregularis and grown under P limitation were used for our meta-analysis study.
In the experiment of Hogekamp et al. (2011), gene expression profiles of roots of Medicago truncatula A17 in response to colonization by two different arbuscular mycorrhizal fungi (Rhizophagus irregularis and Glomus mosseae) as well as P treatment with phosphate were studied. From this experiment, data of two groups of samples were used for meta-analysis; data of inoculated plants and non-inoculated plants under P limitation. Non-inoculated plants were used as control.
CEL (expression intensity) files of these studies were downloaded from NCBI GEO databank and their corresponding library (CDF) and annotation (CSV) files from the Affymetrix FTP repository by Affymetrix Expression Console Software (version: 1.3.1.187, https://www.affymetrix.com/).  quartiling/scaling normalization for reducing batch effects before combining samples for supervised machine learning.

RMA Normalization of Samples in Each Study (Within Study Normalization)
CEL files of Affymetrix arrays in each study were normalized by an RMA algorithm  using Affymetrix Expression Console Software (version: 1.3.1.187).

Z-Value Standardization
Z-standardization has been extensively used in meta-analysis (Lipsey and Wilson, 2001;Kinoshita and Obayashi, 2009). RMA normalized expression values of each samples were converted to Z-value by subtracting the mean and dividing by the standard deviation using Minitab 17 (www.minitab.com/).

Between Sample Normalization by Scaling and Quartiling
To unify the RMA-normalized and Z-standardized values, we used an additional normalization step. Here, we evaluated the efficiency of the scaling and quartiling approach  in reducing the batch effects, using CLC Genomics Workbench (QIAGEN, https://www.qiagenbioinformatics.com/ products/clc-genomics-workbench/). Supplementary Figure 1 shows the pseudo code for scaling and quartiling approach. In the Scaling approach (Supplementary Figure 1A), the sets of the expression values for the samples were multiplied by a constant so that the sets of normalized values for the samples have the same 'target' value. The target (normalization) value was defined as Median mean/Median median of all samples. The Mean and Median are the types of normalization value of the samples to ensure that they are equal for the normalized expression values. In quartiling approach (Supplementary Figure 1B), the empirical distributions of the sets of expression values for the samples were used to calculate a common target distribution, which was used to calculate normalized sets of expression values for the samples. Here, the term of empirical distribution refers to real (empirical) statistical characteristics of samples to be used for calculation of normalization values.

Application of Seven Supervised Machine Learning Models to Find the Medicago Response Genes Distinguishing AM Colonized From Non-colonized Symbiosis
At first, a cleaning step was performed and the probsets with no gene annotation, or the ones which matched to multiple genes were removed. Then, the expressions of 33685 probsets/genes in AM colonization and non-colonization conditions, as numerical features, were mined by seven attribute weighting (feature selection) models. Also, study number and type of AM (Gigaspora gigantean, Rhizophagus irregularis, Glomus mosseae, or none) were added to the dataset as categorical features. Consequently, a dataset of 33687 (33685 gene probes + type of AM + Study ID) and 18 records (samples), belonging to two categories of AM-inoculated and non-inoculated (label variable), were used for machine learning. The selected feature selection models were able to analyse both categorical and numerical features simultaneously. This provided the opportunity to assess batch effects.
Feature selection models identify the most important genes whose AM expression differs between colonized and noncolonized symbioses. The resulting weights of each feature selection model were normalized into the interval between 0 and 1 to provide the similar significance across various feature selection models. Weights closer to 1 show a higher relevance (importance) of a particular gene in distinguishing AM inoculated from non-inoculated roots, according the employed feature selection model. The genes determined to be important by most of the feature selection models (intersection of weighting methods with various statistical backgrounds) with cut-off ≥ 0.95 were assumed to be the key distinguishing genes to form the biosignature. The employed feature selection models were: RELIEF, UNCERTAINTY, GINI INDEX, CHI SQUARED, RULE, INFO GAIN RATIO, and INFO GAIN.
RELIEF is a classification attribute weighting model, independent from Heuristic search and is considered to be one of the most successful models for evaluating the quality of features because of its simplicity and efficiency. RELIEF is a robust noise-tolerant model able to feature interactions where it employs the random selection of instances for weight estimation (Kira and Rendell, 1992;Rosario and Thangadurai, 2015). RELIEF estimates the relevance of attributes (genes + study number + AM type) according to how well their values discriminate between the instances of the same and different classes of label (AM colonization/non-colonization) that are near each other .
UNCERTAINTY measures the weight of attributes (genes + study number + AM type) against the label attribute (AM colonization/non-colonization) by estimating the symmetrical uncertainty with respect to the class (Liang, 2011).
GINI INDEX attribute weighting algorithm evaluates the weight of attributes (genes + study number + AM type) by computing the Gini index of the class distribution (AM colonization/non-colonization) and is a measure of data impurity (Lerman and Yitzhaki, 1984;. CHI SQUARED attribute weighting model evaluates the importance of attributes (genes + study number + AM type) with respect to the label attribute (AM colonization/noncolonization) based on chi squared statistic .
INFO GAIN model calculates the relevance of attributes (genes + study number + AM type) by measuring the Information Gain in class distribution (AM colonization/noncolonization) (Guyon and Elisseeff, 2003). INFO GAIN is suitable for datasets such as the expression of genes where attributes cannot take a large number of distinct values.
INFO GAIN RATIO uses information Gain Ratio for feature selection. This model is a modified version of INFO GAIN that biases against considering attributes with a large number of distinct values (Zinati et al., 2014). RULE attribute weighting model estimates the weights of attributes (genes + study number + AM type) with respect to the label attribute (AM colonization/non-colonization) by constructing a single rule for each attribute and calculating the error (Liu and Motoda, 2012).

Multivariate Analysis of the Developed AM Transcriptomic Signature
After developing the AM transcriptomic signature by integration of meta-analysis and machine learning, clustering based on the Average Linkage method and Euclidean distance measure, as well as cross validation based on Discriminant (modeling) analysis, was used to evaluate the power of the emergent AM transcriptomic signature for discrimination of AM-inoculated from non-inoculated samples. For clustering, the expression values of genes which formed the transcriptome biosignatures were standardized. The multivariate analyses mentioned were performed using Minitab 17 (www.minitab.com/).

Gene Ontology (GO) Analysis
For a better understanding of the biological importance of the identified AM transcriptomic signature, we used a Gene Ontology (GO) approach that classifies genes and proteins based on a controlled functional vocabulary in terms of their Molecular Function, Biological Process, and Cellular Component (Ashburner et al., 2000;Fruzangohar et al., 2013Fruzangohar et al., , 2017. Unregulated genes (73 in total) in the AM inoculation transcriptomic signature with a Z-value difference of >0.5, were announced important by most feature selection models that were used as input of Ensembl Biomart and agriGO web applications (Kinsella et al., 2011;Tian et al., 2017). Agrigo employs the Fisher test and FDR correction for identifying the significance of GO terms of input genes compared to whole genome GO distribution (as a control/background).

Upstream Regulatory (Common TFs) Analysis of AM Colonization Signature Through Promoter Analysis of Highly Expressed Meta-Genes in Response to AM Inoculation
The developed transcriptomic signature of successful AM colonization of roots of Medicago truncatula, obtained by integration of meta-analysis and machine learning (attribute weighting) models, was used for upstream regulatory analysis through common regulator discovery, as previously described (Deihimi et al., 2012;Babgohari et al., 2014;Bakhtiarizadeh et al., 2014b;Shamloo-Dashtpagerdi et al., 2015). To this end, the top 20 highly upregulated meta-genes, which responded to AM inoculation, (Supplementary Table 5) were selected for promoter analysis and common transcription factor discovery. In other words, the genes revealed in the transcriptomic signature were further used as the input of promoter analysis to find the transcription factor matrix families which regulate the transcriptome signature. Matrix families are groups of weight matrices for the same or functionally similar transcription factors .
To mine the binding of transcription factor families to the promoter regions, we used the MatInspector webtool (Quandt et al., 1995;Cartharius, 2005;Cartharius et al., 2005;Hosseinpour et al., 2013) to calculate the following scores: Core similarity, Matrix similarity, Model similarity, Free energy, Match rate, and p-value for the common TFs. Core similarity describes the similarity between core sequence of transcription factor matrix family and the input sequence. Core sequence of a transcription factor matrix family is the consecutive highest conserved positions of the matrix. The maximum core similarity of 1.0 is only reached when the highest conserved bases of a matrix match exactly with the input sequence. Matrix similarity is more important than the core similarity that takes into account all bases over the whole matrix length. Matrix similarity of 1.0 reaches only if the candidate sequence corresponds to the most conserved nucleotide at each position of the matrix. The free energy (in kcal/mol) is a thermodynamic parameter for the stability of secondary structures (hairpins) of matrix family with input sequence. The higher the free energy, the more stable the hairpin is. The match rate is the number of matching base pairs in percent of the total element length. The p-value for the common TFs is the probability to obtain an equal or greater number of sequences with a match in a randomly drawn sample of the same size as the input sequence set using Fisher's exact test. The lower this probability, the higher the importance of the observed common transcription factors.
Based on p < 0.01 of the common TFs and Matrix similarity >0.95, the enriched transcription factors on promoter regions of AM colonization signature were recorded as "common regulators." In other words, transcription factors with the highest number of possible interactions with the upregulated genes after AM inoculation were assumed as the key regulators, called common regulators of AM inoculation.

Independent Validation of Meta-Genes Based Biosignature of AM Inoculation by RNA-Seq
For independent validation of the AM colonization signature, derived by integration of meta-analysis and supervised attribute weighting models, independent samples of AM-inoculated and non-inoculated from RNA-seq experiment with GEO accession of GSE94266 (Garcia et al., 2017) were selected. The original experiment was designed to determine the effect of K + on colonization of Medicago truncatula plants (Garcia et al., 2017). Plants were co-cultured with the AM fungus Rhizophagus irregularis under normal and low K+ regimes. We used 3 AM-inoculated samples (GEO accessions: GSM2471944, GSM2471945, and GSM2471946) and 3 AM non-inoculated samples (GSM2471950, GSM2471951, and GSM2471951) of this experiment under normal K + regime to investigate the transcriptome response of Medicago truncatula to AM inoculation. Raw SRA files of the above-mentioned samples (100 bp, single end, Illumina sequencing technology) were retrieved by the DRAsearch tool (http://trace.ddbj.nig.ac.jp/DRASearch/) of the Research Organization of Information and System National Institute of Genetics (NIG), Japan. SRA files were transformed to fastq files using SRA Toolkit software (NCBI). The Medicago truncatula reference genome (Mt4. 0v2 Assembly), including fasta (genome sequence) and GFF3 (genome annotation) files, were downloaded from the Medicago truncatula Genome Database (Young et al., 2011;Krishnakumar et al., 2014). Quality control of reads was analyzed using FastQC package (https://www.bioinformatics. babraham.ac.uk/projects/fastqc/). Low quality reads and adaptor sequences were trimmed by the CLC Genomics Workbench 11.0.1 (QIAGEN). Mapping of short reads to the reference genome was performed using the CLC Genomics Workbench using the following criteria: mismatch cost = 2, insertion cost = 3, deletion cost = 3, length fraction = 80%, and similarity fraction = 80%). Total counts of mapped reads and RPKM (Reads Per Kilobase of transcript per Million mapped reads) were recorded as expression measurements for each gene. The differences related to the depth of sequencing were corrected by per-sample library size normalization using TMM (trimmed mean of M values) normalization method via calculating and adjusting the effective libraries sizes (Robinson and Oshlack, 2010).
To find the differentially expressed genes during AMcolonization vs. non-colonization, Generalized Linear Model (GLM) based on Negative Binomial distribution (Anders and Huber, 2010) was used to fit curves to expression values without assuming that the error on the values is normally distributed. GLM-based p-values for differentially expressed genes were calculated and corrected using FDR statistics and the CLC Genomics Workbench tool. Also, fold changes were calculated from the GLM to correct for differences in library size between the samples and the effects of confounding factors.
To visualize the differentially-expressed genes by heatmap, the following steps were performed: (1) log CPM (Counts per Million) values were calculated for each gene. The CPM calculation uses the effective library size, calculated by the TMM normalization.
(2) log CPM values were standardized across samples for each gene by transforming to Z-values.

Selected Samples From Different Studies for De Meta-Analysis
As the transcriptomic signature of AM may differ in different tissues, we selected the transcriptome files of root samples of Medicago truncatula A17. To reduce the variation between experiments, the Medicago Genome Array of Affymetrix platform with 61278 probset IDs was selected. Some samples in four studies had these criteria (Table 1, Figure 1) (Hogekamp et al., 2011;Bonneau et al., 2013;Truong et al., 2015;Floss et al., 2017). These samples were roots colonized with Gigaspora gigantean, Rhizophagus irregularis, or Glomus mosseaeas well as non-inoculated ones.

Reducing Heterogenicity Between Samples of Studies: A Framework
Integrating Within Study RMA-Normalization, Z-Value Transformation, and Within Samples Scaling/Quartiling Normalization Reducing heterogenicity across studies (batch effects) is an essential step for direct combination of expression data in DE meta-analysis. To reduce the batch effects for DM meta-analysis, we developed an integrative approach of within-study RMAnormalization, Z-value transformation, and within samples scaling/quartiling normalization.

Meta-Genes Based Biosignature of AM Inoculation Derived by Supervised Attribute Weighting Models
To achieve the transcriptomic signature of AM inoculation, the expression of 33685 genes in inoculated and non-inoculated Medicago roots was mined by 7 feature selection models. Also, the effects of Study (batch effect) and AM type were considered by adding 2 additional polynomial attributes to the expression dataset. The weights of genes as well as Study ID and AM type are presented in Supplementary Table 2. The resulting weights of each feature selection model were normalized into the interval between 0 and 1. The genes announced important by most of the feature selection models with the cut-off > 0.95 were assumed as the key distinguishing genes to form the AM inoculation biosignature.
In total, 681 genes received weight equal or higher than 0.95 by most feature selection algorithms (5 out of 6 models), including UNCERTAINTY, GINI INDEX, CHI SQUARED, RULE, INFO GAIN RATIO, and INFO GAIN. RELIEF was not efficient in gene selection and only gave a high weight to 2 genes out of 33685. Within 681 genes, 180 genes had absolute Z-value difference of 0.5 (> 0.5 or <−0.5) between AM inoculated and noninoculated (Supplementary Table 3). As presented in Table 2, 73 genes selected by feature selection models were up-regulated with a Z-value difference of > 0.5.
Within the previously-reported AM markers (Hogekamp et al., 2011), MtGIP1 received high weights (values) in 6 out of 7 of the employed attribute weighting models in order to distinguish AM-inoculated from non-inoculated samples. Figure 3 visualizes the high weights assigned to MtGIP1 by UNCERTAINTY, GINI INDEX, CHI SQUARED, RULE, INFO GAIN RATIO, and INFO GAIN models where weighting closer to 1 shows a higher relevance (importance) of gene according to the respective model. Also, Figure 3 presents the normalized expression value of MtGIP1 in AM-inoculated and non-inoculated samples. MtGIP1 can be assumed to be a reliable AM colonization marker as its predictive powers is confirmed by previous individual studies as well as the combined meta-analysis performed here.

Supervised Machine Learning Models Showed That the Batch Effect (Heterogenicity Between Experiments) Is Remarkably Reduced
The results of attribute weighting (feature selection) models presented in Table 2 show that the effect of Study ID (batch effect) is not significant in deriving the signature of AM inoculation. Interestingly, while 180 genes were selected by most of attribute weighting models to discriminate AM-inoculated from non-inoculated roots (Supplementary Table 3, Table 2), none of the models selected Study ID.
In line with this finding, clustering analysis (Figure 4) showed that the developed AM inoculation signature is able to discriminate between AM-inoculated and non-inoculated samples. As presented in Figure 4, while the AM-inoculated samples had more than 50% similarity to transcriptomic signature genes, this similarity decreased to 22% with AM non-inoculated genes.
The Transcriptomic Signature of Am Inoculation Identifies the Involvement of Hydrolase Activity, Phosphorylation, Cell Wall Organization, and Transport, Based on Computational Systems Biology Analysis Functional annotation of the transcriptomic signature based on GO analysis showed that a majority of genes in the transcriptomic signature encode membrane   Major intrinsic protein (MIP) family transporter MTR_8g087710

Glutathione S-transferase Glutathione S-transferase MTR_1g115195
Oxygen enhancer protein Oxygen-evolving enhancer protein MTR_8g005175 The 73 up-regulated genes responding to AM inoculation were derived by meta-analysis and supervised machine learning analysis.
Frontiers in Plant Science | www.frontiersin.org proteins and are involved in cell wall organization, membrane transport, proteolysis, and oxidoreductase activities (Supplementary Table 4). GO distribution of the transcriptomic signature is presented in Figure 5.
The isoprenoid biosynthetic/metabolic process and the lipid biosynthetic/metabolic process were statistically significant (enriched) biological processess that can be activated by upregulated genes of transcriptomic signature (p-value FDR < 0.05) (Figure 5A). Response to stimulus was another interesting aspect enriched in the GO. In the cellular component GO category, genes involved in response to AM colonization, including cell wall and external encapsulating structure, showed high enrichment (Figure 5B). In terms of Molecular Function, transferase and hydrolase activities were significantly enriched ( Figure 5C) (Supplementary Table 4). In line with this finding, analysis of transcriptome response of Medicago truncatula to Glomus mosseae and Rhizophagus irregularis by Hohnjec et al. (2005), showed that 201 plant genes were significantly co-induced at least 2-fold. These genes were related to functions such as nitrate, ion, and sugar transporter, and enzymes involved in secondary metabolism, proteases, and Kunitz-type protease inhibitors.

Overrepresented Transcription Factor Binding Sites on Promoter Regions of Upregulated AM Colonization Transcriptomic Signature Enabled Discovery of Potential Master Regulators of AM Colonization
Transcription factors with enriched binding sites on promoter regions of the transcriptomic signature are candidates for "common master regulators" (Hosseinpour et al., 2013;Mahdi et al., 2014;Alanazi and Ebrahimie, 2016;Alanazi et al., 2018). Transcription factor matrix families with statistically enriched (p < 0.01) binding sites in the highly upregulated meta-gene transcriptomic signature (top 20 genes) of successful AM inoculation is presented in Table 4 and Supplementary Table 6. The common enriched TFs were: P$FLO2, P$SEF3, P$TERE, P$ASRC, P$CARM, P$TOEF, P$SEF4, P$LREM, P$MYBL, P$CAAT, P$GTBX, and P$WOXF (Table 4 and  Supplementary Table 6).
Promoter analysis of upregulated genes in the AM colonization signature identified the P$FLO2 matrix family as one of the master regulators due to the highest number FIGURE 4 | Evaluation of the developed transcriptomic signature of Arbuscular mycorrhiza (AM) inoculation in discrimination of inoculated from non-inoculated samples. Clustering was performed based on Average Linkage method and Euclidean distance. Clustering based on the identified transcriptomic signature was able to efficiently distinguish between AM inoculated from non-inoculated samples. of binding sites (71) within promoter regions of all the 20 highly upregulated genes (0.0000204538). The P$FLO2 family contains transcription factors with AP2 domains and ethyleneresponsive element (ERE) binding. (Table 4, Figure 6A, and Supplementary Table 6).
The P$CAAT matrix family that includes CCAAT binding transcription factors, such as NF-YA, NF-YB, and LEC1, was selected as another common TF (master regulator) with a low p-value (p = 0.00316799) in common TF analysis. P$CAAT matrix had binding sites on promoter regions in 19 out of the 20 upregulated genes in the AM inoculation signature with the total number of 92 binding sites ( Table 4, Supplementary Table 6, and Figure 6B). P$SEF3 (soybean embryo factor 3) and P$SEF4 (soybean embryo factor 4), that contain SEF3 and SEF4 transcription factors, had a significantly (p-value <0.01) high number of interactions with the top 20 upregulated genes in the AM colonization signature and these were tentatively identified as potential key regulators of AM colonization ( Table 4,  Supplementary Table 6). The P$TERE matrix family that confers transcription factor-specific expression was also enriched in promoter regions of upregulated genes after AM colonization.
Another enriched transcription factor matrix family was P$TOEF that contains the AP2 domain in its structure and is involved in early activation/response ( Table 4,  Supplementary Table 6). RAP2.7 and TOE2 are wellknown members of this matrix family. GO analysis showed that this matrix family is involved in organ morphogenesis.
A matrix family involved in response to fungal colonization, P$ASRC, had 103 binding sites on promoter regions of the top 20 upregulated genes in

The AM Colonization Meta-Signature Showed High Repeatability in an Independent RNA-Seq Experiment of AM Colonization
In an independent RNA-seq experiment, we observed high correspondence between RNA-seq data of AM colonization and the identified AM colonization signature in this study, derived from integration of meta-analysis with supervised attribute weighting models (Figure 7, Supplementary Table 7). Fifty-one of 73 (70%) of the upregulated genes in the developed transcriptomic biosignature of AM colonization were also upregulated in the independent RNA-seq data of AM colonization with FDR-corrected p < 0.01 ( Figure 7A).
Noticeably, the identified AM colonization meta-signature was able to discriminate accurately between AM-inoculated samples and non-inoculated ones (Figure 7B). High correspondence between the expression of some of the important genes of the AM-colonization signature in the original microarray experiments (based on standardized Z-value of expression) and the expression of those genes in the RNA-seq experiment [based on RPKM (Reads Per Kilobase of transcript per Million mapped reads) are visualized in Figure 7C, including the AP2 domain class transcription factor (MTR_6g029180), members of GRAS family of transcription factor (MTR_1g069725, MTR_2g089100), Cyclin-dependent kinase (MTR_1g098300), MIP family transporter (MTR_8g087710), ABC transporter B family-like protein (MTR_8g022270), Legume lectin beta domain protein (MTR_5g031160), Sigma factor sigb regulation rsbqlike protein (MTR_3g045440), and Serine carboxypeptidase-like protein (MTR_3g079620).
In short, after quality control and trimming, 18772504, 19678186, 22009349, 18982223, and 19364133 remained in 3 AM-inoculated and non-inoculated samples. High efficiency of mapping to genes (more than 95%) was observed in all samples. Supplementary Table 8 presents RNA-seq based differential expression analysis of Medicago truncatula response to AM inoculation compared to non-AM inoculation.

DISCUSSION
Finding a biosignature/predictors based on a single transcriptomic experiment is a major challenge due to a large prediction error caused by a large number of independent predictors (genes) and a restricted number of observations (replications) (Baseri et al., 2011). Also of concern is the repeatability of a selected subset of a gene derived from a single experiment/condition. Inter-species analysis of a range of experiments by meta-analysis and machine learning techniques is able to deal with theseshortcomings, leading to the generation of a robust and repeatable biosignature (Farhadian et al., 2018b). Meta-analysis has received increased attention in recent years because of its remarkable potential to increase the statistical power and generalizability of single study analysis (Farhadian et al., 2018a;Sharifi et al., 2018). Meta-analysis not only reinforces the findings of the individual studies, but is also may identify new undetected outcomes/patterns in single studies as meta-analysis considers the direction/trend of variables in each experiment to find the consistent, robust and repeatable patterns in all experiments (Sharifi et al., 2018). The inter-species DE-based meta-analysis employed in this study had more samples and stronger statistical power and was successful in achieving a statistically-reliable transcriptomic biosignature of successful AM inoculation, independent from the study. In addition, the biosignature was repeatable and discriminative when a new and independent RNA-seq experiment was used for its validation. Due to the availability of Medicago truncatula transcriptomic data (as a model plant), the meta-analysis was solely performed on this plant resulting in the identification of a robust and high performance transcriptomic signature of AM colonization. However, in non-model plants with the subsequent generation of new transcriptomic data, it will be necessary the identified Medicago truncatula-derived transcriptomic signature of AM colonization will need further examination.
In DE-based meta-analysis, it is crucial to adjust for batch effects before combining expression datasets. Heterogenicity (batch effects) is the major concern in meta-analysis of expression data (Leek and Storey, 2007;Ramasamy et al., 2008). In this study, we developed a new approach for reducing batch effects and direct merging meta-analysis by combination of meta-analysis, multi-step normalization, and supervised attribute weighting models. We observed that quartiling outperforms the scaling approach in reducing the batch effect. Heterogenicity-reducing based on the quartiling approach has been used extensively for knowledge discovery and pattern recognition in large data analysis, particularly in integrated classification and associationrule mining (CBA) algorithm (Kargarfard et al., 2015(Kargarfard et al., , 2016. As an example, CBA analysis of quartiled protein and DNA measurements was able to find a biosignature for increased host range and the emergence of an outbreak in influenza (Kargarfard et al., 2016). Supervised machine learning has brought new possibilities to predictive studies (Bakhtiarizadeh et al., 2014a;Ebrahimi et al., 2014;Zinati et al., 2014;Ebrahimie et al., 2018b). Supervised attribute weighting (feature selection) algorithms are techniques for reducing the variables and identifying a subset of highly relevant ones in order to improve the efficiency of classification algorithms (Rosario and Thangadurai, 2015). The capability to simultaneously analyse both categorical and numerical features, power to analyse large data, and the ability to produce various predictive algorithms with diverse statistical backgrounds are distinguished features of supervised machine learning models Shekoofa et al., 2014). The possibility to include the categorical variables in predictive models can remarkably decrease the heterogenicity across studies as the batch effects and other non-biological experimental variation were incorporated in the models (Shekoofa et al., 2014). In this study, different experiments or types of AM were added as variables and analyse in the predictive model that resulted in remarkable control of batch effect. This possibility is very limited in traditional multivariate or regression models.
The identified meta-genes of successful AM colonization, derived by integration of meta-analysis with supervised attribute weighting models, was able to discriminate efficiently between AM-inoculated and non-inoculated samples. As a validation analysis, the developed signature showed high performance in distinguishing AM-colonized roots from non-inoculated ones in an independent RNA-seq experiment. Recently, integration of supervised machine learning algorithms with meta-analysis has been used to identify a mastitis bio-signature and early prediction of its occurrence Sharifi et al., 2018). The developed integrative approach in this study, comprising multi-step normalization, direct-merging meta-analysis, and supervised attribute weighting models, is platform-independent approach. By subsequent generation of more RNA-seq data, the developed pipeline may be employed for biosignature discovery in RNA-seq transcriptomic data, integration of microarray and transcriptomic data as is possible using some other NGS platforms, such as ChIP-Seq and SNP to perform meta-analysis on significant peaks in ChIP-Seq experiments and frequency of SNPs in genome-wide experiments.
The core 73 upregulated genes in the developed transcriptomic biosignature contain novel regulators of AM colonization including two transcription factors from the GRAS family (MTR_1g069725, MTR_2g089100), one transcription factor from AP2 domain class (MTR_6g029180), and one Zinc finger protein. It has been documented that the GRAS-type transcription factors, such as NSP1 (Nodulation Signaling Pathway1) and NSP2, play essential signaling functions in promoting both Rhizobium nodulation and mycorrhizal colonization (Kaló et al., 2005;Smit et al., 2005;Liu et al., 2011;Gobbato et al., 2012). Another transcription discovered factor, MTR_6g029180, has an AP2 domain in this structure. Interestingly, it has been reported that ERF transcription factors with a highly conserved AP2 DNA-binding domain are necessary for nodulation and symbiosis (Middleton et al., 2007). Cyclin-dependent kinase (MTR_1g098300) was another highly upregulated gene in the signature of successful AM colonization in this study. Mycorrhizal colonization is classified as postembryonic development of plant organs that need a constant interplay between the cell cycle and developmental programs (Kondorosi and Kondorosi, 2004). Cyclin-dependent kinase controls cell cycle and plays the key role in endoreduplication and activation of the anaphase-promoting complex during symbiotic cell development (Kondorosi and Kondorosi, 2004). The discovery of the essential transcription factors of successful mycorrhizal colonization and symbiosis in the developed biosignature highlights the robustness and applicability of meta-analysis in the AM colonization signature discovery and the importance of the developed transcriptomic signature. The biosignature obtained here provides a platform for increasing the efficiency of AM inoculation in future by finding accelerator AM colonization agents, such as small molecules/chemicals, and manipulating the expression of key genes in the biosignature.
The reasons that some previously-reported AM-associated genes were not identified in the AM meta-signature might be: (1) there are other genes with higher and more repeatable expression in response to AM induction and colonization which are, as a result, selected. These new candidates have higher preference over some of the previously-known biomarkers of AM symbiosis, (2) some AM markers might interact with the type of AM and consequently these will not appear in cross-species metaanalysis, and (3) some AM markers may interact with a specific condition or timing of AM symbiosis. As example, mycorrhizaspecific phosphate transporter seems to be more closely related to P homeostasis rather than colonization as the phosphate transporter mediates early root responses to phosphate status in non-mycorrhizal roots .
Reinforcing the importance of the existence of AP2 transcription factors in the upregulated transcriptomic signature of AM colonization, promoter analysis demonstrated that the P$FLO2 transcription factor matrix family, with the AP2 domain structure and ethylene-responsive element-binding, had the highest number of promoter binding sites of all 20 highly upregulated genes in the AM inoculation signature. Floral homeotic protein APETALA 2, a member of P$FLO2 matrix family, has a documented role in the control of flower and seed development (Jofuku et al., 1994). Strong induction of APETALA 2 in developing nodules of Medicago truncatula has been observed and suggested as a potential regulator of the symbiotic program (El Yahyaoui et al., 2004). Another enriched transcription factor matrix family was the P$TOEF matrix family that contains the AP2 domain in its structure and is involved in early activation/response ( Table 4, Supplementary Table 6). GO analysis showed that these are involved in organ morphogenesis. P$CAAT was another potential master regulator of the identified AM colonization signature that contains CCAATbinding family transcription factors. It has been documented that CCAAT-binding family transcription factors are essential for endosymbiosis establishment and development (Diédhiou and Diouf, 2018). Laser microdissection has documented the expression of CAAT-Box transcription factor in AM, correlated with fungal contact and spread (Hogekamp et al., 2011). Two members of this CCAAT-binding family, NF-YA1a and NF-YA1b, are positive regulators of AM colonization in soybean (Schaarschmidt et al., 2013). Before the present study, most of the known CCAAT-binding family transcription factors had been reported to be involved in nodulation (Marsh et al., 2007;Soyano et al., 2013). Functional genetic studies of symbiotic genes in Medicago truncatula indicate a role for a CCAAT-box transcription factor in rhizobial infection (Cousins, 2016). Analytical approaches based on literature mining have suggested association between a number of potential microRNAs (particularly microRNA169 and microRNA156) and microRNAregulated transcription factors, which may be involved in the coordinated regulation of nitrogen and phosphorous starvation responses in soybean and NF-YA3 and NF-YA8 are targets of microRNA169 (Dehcheshmeh, 2013;Chiasson et al., 2014).
A MYB transcription factor belonging to P$MYBL matrix family was also enriched on promoter region of the identified signature of AM colonization. It has been demonstrated that a transcriptional program for arbuscule degeneration during AM symbiosis is regulated by MYB1 (Floss et al., 2017).
At the regulatory level, promoter analysis of co-expressed genes has demonstrated high potential in identifying key enriched transcription factors, finding undiscovered roles of genes (Deihimi et al., 2012), developing the functional genomics catalog of activated transcription factors during a phenomenon Zinati et al., 2014), and discovery of transcriptional regulatory networks (Bakhtiarizadeh et al., , 2014b. It has been also shown that number and diversity of differential cis-regulatory elements on promoter regions are strong predictors of gene function and level of expression under different conditions (Babgohari et al., 2014;Shamloo-Dashtpagerdi et al., 2015). This has resulted in developing new indicators of gene importance not based on the gene sequence but on the promoter region. In our previous study, we developed a novel pairwise comparison method for in silico discovery of statistically significant cis-regulatory elements in eukaryotic promoter regions (Shamloo-Dashtpagerdi et al., 2015).
Transcription factors have interactions with DNA to regulate gene expression in cells (Pomerantz et al., 2015). In future studies, genome-wide mapping of binding sites of the identified transcription factors [GRAS family transcription factor (MTR_1g069725, MTR_2g089100), AP2 domain transcription factor (MTR_6g029180), and CCAAT-binding transcription factors] by CHIP-seq techniques may unravel the cistrome of successful AM colonization in symbiosis establishment.

CONCLUSION
In this study, we developed a new approach for reducing heterogenicity between experiments (batch effect) and direct merging meta-analysis by combining meta-analysis, multistep normalization, and supervised attribute weighting models. We employed this approach to obtain a unified transcriptomic signature of successful AM colonization in roots of Medicago truncatula. The genes of identified in the signature, derived by integration of meta-analysis with supervised attribute weighting models, were strongly upregulated in all AM symbioses and probably correspond to the end targets of the symbiotic programme. The identified meta-genes of successful AM colonization discriminated efficiently between AM inoculated and non-inoculated samples. Furthermore, the developed signature showed high performance in distinguishing AM-colonized roots from non-inoculated ones in an independent RNA-seq experiment. Important protein classes such as the AP2 domain class transcription factor (MTR_6g029180), GRAS family transcription factors (MTR_1g069725 and MTR_2g089100), and cyclin-dependent kinase (MTR_1g098300) were highly upregulated during AM successful colonization. The developed direct mergingbased meta-analysis, by combining meta-analysis, multi-step normalization, and supervised attribute weighting models, provides the possibility of data collection from different experiments even when a treatment or a control is missing in one or more of the experiments.
We suggest that the promoters of meta-genes identified in the transcriptomic signature of AM colonization may have the power to unravel key transcription factors as master regulators of AM symbiosis. Analysis of promoter regions of the top upregulated meta-genes in the AM-successful colonization signature in this study identified enriched transcription factor binding sites and led us to possible master regulators that form the transcriptome expression pattern. These included AP2 domain class transcription factors, CCAAT-binding family transcription factors, SEF transcription factors, and response to fungus ASRC transcription factors. Further functional characterization of these transcription factors is needed to understand their precise role in AM symbioses.
This study provides a framework for an improved understanding of the dynamics of successful AM colonization in establishing microsymbionts. It offers a new approach for related investigations into the other symbiosis systems.

AUTHOR CONTRIBUTIONS
MM-D, EE, and AN designed the research. ME, MT, ZN, RE, MM-D and EE collected and analyzed the data. ME developed the required software. MM-D and EE wrote the paper. All authors have read and approved the manuscript.