An Interaction-Based Method for Refining Results From Gene Set Enrichment Analysis

Purpose: To demonstrate an interaction-based method for the refinement of Gene Set Enrichment Analysis (GSEA) results. Method: Intravitreal injection of miR-124-3p antagomir was used to knockdown the expression of miR-124-3p in mouse retina at postnatal day 3 (P3). Whole retinal RNA was extracted for mRNA transcriptome sequencing at P9. After preprocessing the dataset, GSEA was performed, and the leading-edge subsets were obtained. The Apriori algorithm was used to identify the frequent genes or gene sets from the union of the leading-edge subsets. A new statistic d was introduced to evaluate the frequent genes or gene sets. Reverse transcription quantitative PCR (RT-qPCR) was performed to validate the expression trend of candidate genes after the knockdown of miR-124-3p. Results: A total of 115,140 assembled transcript sequences were obtained from the clean data. With GSEA, the NOD-like receptor signaling pathway, C-type-like lectin receptor signaling pathway, phagosome, necroptosis, JAK-STAT signaling pathway, Toll-like receptor signaling pathway, leukocyte transendothelial migration, chemokine signaling pathway, NF-kappa B signaling pathway and RIG-I-like signaling pathway were identified as the top 10 enriched pathways, and their leading-edge subsets were obtained. After being refined by the Apriori algorithm and sorted by the value of the modulus of d , Prkcd, Irf9, Stat3, Cxcl12, Stat1, Stat2, Isg15, Eif2ak2, Il6st, Pdgfra, Socs4 and Csf2ra had the significant number of interactions and the greatest value of d to downstream genes among all frequent transactions. Results of RT-qPCR validation for the expression of candidate genes after the knockdown of miR-124-3p showed a similar trend to the RNA-Seq results. Conclusion: This study indicated that using the Apriori algorithm and defining the statistic d was a novel way to refine the GSEA results. We hope to convey the intricacies from the computational results to the low-throughput experiments, and to plan experimental investigations specifically.


INTRODUCTION
MicroRNAs (miRNAs) are a class of noncoding RNAs that play key roles in regulating gene expression and are involved in a variety of biological processes during retinal development (Damiani et al., 2008). In most cases, miRNAs inhibit the expression or promote the degradation of messenger RNA (mRNA) by interacting with the specific sequences located in the 3 UTR of their target mRNA (Ha and Kim, 2014). According to this feature, each miRNA can target hundreds of mRNAs. The development of transcriptomics technologies such as RNA sequencing (RNA-Seq) can provide a broad account of RNA transcripts and consequently insight into changes in the mRNA expression of downstream genes after experimental intervention on the miRNA .
Due to the large amount of sequencing data generated by RNA-Seq, appropriate bioinformatics methods are needed to handle the data. After preprocessing the data, traditional quantitative analysis for mRNA expression analysis focuses on identifying differentially expressed upregulated and downregulated genes between two individual groups. Traditional strategies usually set an arbitrary cutoff in terms of expression fold-change (e.g., fold-change ≥ 1.5 considered significant) to filter the critical genes. Conservative and relaxed cutoff values may cause false negative and false positive results, respectively, making the results less objective and reproducible. To overcome the analytical challenges of focusing on a single gene, gene set enrichment analysis (GSEA) helps to gain further insight into the distribution of genes preannotated in biological categories by incorporating the entirety of gene expression. However, both single and global gene analysis often generate a large number of candidate genes (Ackermann and Strimmer, 2009). The lack of golden standard datasets also makes the assessment of gene set analysis methods rudimentary (Maleki et al., 2020). Hence, instead of extracting more genes from datasets, the goal of this study was to attempt reducing the dimensionality of analysis results and refining the intricacies from the high-throughput results to the low-throughput experiments.
Several studies have characterized miRNA expression in the developing mammalian retina, and miR-124-3p has been shown to be one of the most abundantly expressed miRNAs (Karali et al., 2007;Hackler et al., 2010;Karali et al., 2010;Karali et al., 2016). Although the miR-124a-3p knockdown mouse exhibited neuronal dysfunction and dysmaturation by disinhibition of Lhx2, its downstream responses in mouse retinal development after birth were still relatively deficient at the transcriptome level. In our previous study, miR-124-3p exhibited a significantly increasing trend after birth, and its related pathways predicted by bioinformatics analysis were associated with biological processes that may play crucial roles in mouse retinal development (Wang et al., 2020). To gain insight into its downstream processes, RNA-Seq was used to obtain the expression profile of mRNA transcripts after the knockdown of miR-124-3p.
In GSEA, given a set of genes sorted depending on given conditions (e.g., mRNA expression level of downstream genes) and a biological category, a running sum statistic is computed iteratively from top to bottom of the sorted set to evaluate whether it is enriched in the given biological category. When processed, the running sum will increase whenever a gene belonging to the given biological category is found and otherwise decrease. Therefore, the running sum will be relatively high if the gene set falls at either the top (overexpressed) or bottom (underexpressed) and is likely to be subsequently related to the given biological category.
Based on the properties mentioned above, GSEA is an appropriate tool to obtain a precise description of the downstream effects of miRNAs. Since a miRNA and its target mRNAs demonstrate negative correlations because of degradation (Ritchie et al., 2009;Wang and Li, 2009), when the expression value of the miRNA is manipulated to decrease, the upregulated and downregulated mRNA expression can be assumed to be its direct effects and indirect effects, which are likely to be enriched at the top and bottom in GSEA, respectively, and vice versa.
Through the analysis of a downstream mRNA dataset from RNA-Seq after miR-124-3p knockdown, we demonstrate how the original GSEA method was extended and the results from GSEA were refined, which could be used as a comprehensive protocol for downstream analysis in the loss-and gainintervention to a specific miRNA. Some parameters, such as the leading-edge subset, were modified to better describe the characteristics of the bottom (underexpressed) mRNAs based on the classical GSEA approach by Subramanian et al. (2005). The union of the leading-edge subsets in the enriched KEGG pathways was selected. For traditional GSEA, the number of generated candidate genes is usually still too high. One or several key genes or pathways need to be identified for further functional experiments. Apriori algorithm and a new statistic d were introduced to correct this issue. It is hypothesized that genes play critical roles in the entire leading-edge subsets if they have the most interactions with other genes. Apriori algorithm could use prior Boolean association rules (gene-gene interaction) to mine these pivotal genes or gene sets. Afterward, the expression vector of candidate genes or gene sets were modified with their relationship as a new statistic d. Finally, the candidate genes were refined, and key genes were identified.

Knockdown of miR-124-3p and RNA Extraction
C57BL/6J mice were used to study the transcriptome of miR-124-3p during retinal development. Mice at postnatal day 3 (P3) were given 1 μl of 0.6 nmol/μl miR-124-3p antagomir in the left eye as the anti-miR-124 group and 0.6 nmol/μl antagomir negative control in the right eye as the negative control (NC) group by intravitreal injection. Retinas from mice at P9 were harvested, and total RNA was isolated by TRIzol (Invitrogen; Thermo Fisher Scientific, Inc, Waltham, MA, United States) according to the manufacturer's instructions. Both groups consisted of 12-15 mixed retina tissues, and one biological replicate was conducted. In addition, to increase the heterogeneity of the sample, the samples in each group were from at least two different litters.
2.2. RT-qPCR Validation for the Knockdown of miR-124-3p and RNA-Seq Reverse transcription quantitative PCR (RT-qPCR) was performed to validate the knockdown rate of the anti-miR-124 group compared with the NC group. Total RNA from both groups was reverse transcribed using a PrimeScript RT reagent kit (Takara Bio, Inc, Otsu, Japan). Real-time PCR was subsequently performed on the resulting cDNA template with a TB Green ™ Premix Ex Taq ™ II kit (Takara Bio, Inc, Otsu, Japan) on a StepOnePlus ™ Real-Time PCR System (Applied Biosystems; Thermo Fisher Scientific, Inc, Waltham, MA, United States). The 2 −ΔΔCt method was used to quantify miRNA expression levels with the u6 gene as an internal reference (Livak and Schmittgen, 2001). After significant knockdown of the expression level of miR-124-3p was confirmed by RT-qPCR, RNA-Seq was carried out to detect the expression levels of mRNAs in both groups. The RNA-Seq data in the present study are deposited in the Gene Expression Omnibus (GEO) repository, accession number GSE200915 (https://www.ncbi.nlm.nih.gov/geo/query/acc. cgi?acc=GSE200915).

Data Preprocessing
To normalize the length of the mRNA sequences and the sequencing depth of a sample, transcripts per million (TPM) were used to assess the expression level of mRNAs. Read counts for all the mRNA transcripts were demonstrated in Data Sheet 1. The method for calculating TPM was described in detail elsewhere (Wagner et al., 2012;Dillies et al., 2013). Based on previous research, the frequency distribution of genes with different expression levels has a mode at TPM nearly equal to 0 and a long tail toward higher TPM values (Wagner et al., 2013;Bush et al., 2018;Monaco et al., 2019). Therefore, to determine an appropriate interval to filter genes with very low expression, a base 10 logarithmic scale was used to evaluate the frequency distribution of TPM for the genes in all groups.

Calculating the Expression Difference
In each group set, the expression difference for a gene between the two groups was defined as the log2-transformed foldchange value of TPM. To reduce the false positive rate of the results, only genes that had the same expression trend between the two biological replicates were considered reliable.
Based on these methods, a list of candidate genes and their corresponding gene expression differences was obtained.

Obtaining the Enrichment Score and Its Related Parameters in Kyoto Encyclopedia of Genes and Genomes Pathways
The candidate genes were ranked from highest to lowest expression level, and the enrichment score (ES) for each Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway was calculated based on the GSEA approach (Subramanian et al., 2005). KEGG gene annotation was derived from the KEGG database (https://www.genome.jp/keggbin/download_htext?htext=ko00001.keg&format=json&filedir=). After excluding three types of annotations ("09160 Human Diseases", "09180 Brite Hierarchies", "09,190 Not Included in Pathway or Brite") in KEGG pathways that were not related to retinal development, GSEA was performed on 354 pathways whose expression patterns might have changed in the retina. Afterward, the p value for each KEGG pathway was estimated by comparing the absolute value of its maximum ES with randomly generated sets of absolute values of maximum ES. A p value <0.05 was considered significant or enriched. To balance the accuracy of the estimation and the required computing power, the number of permutations used to generate the comparisons was set to 1,000. For an enriched pathway, a leading-edge subset is a set of genes that contributes to the maximum ES in the ranked list of genes. The subset will be found at the top if the maximum ES is positive (upregulated gene subset) or at the bottom if the maximum ES is negative (downregulated gene subset). Afterward, the union of the leading-edge subsets in the enriched KEGG pathways was selected.

Calculating the Maximum ES
Since a total of n candidate genes were ranked from highest to lowest expression level, we denoted δ (δ 1 , δ 2 . . . δ n ) as the vector of differences, where each element δ i in δ represented the expression difference of a single gene g i between two groups. Accordingly, a corresponding gene vector g (g 1 , g 2 . . . g n ) was defined, where each element g i was the gene name using NCBI Gene Symbol. To calculate the ES in a KEGG pathway, it is necessary to determine whether the gene g i belongs to this pathway. We established the following definition: if g i belonged to the annotations of this pathway, r(g i ) 1; if g i did not belong to the annotations of this pathway, r(g i ) 0, and a new vector r (r 1 , r 2 , . . . r n ) was generated, where each element r i consisted of 0 and 1. To facilitate the iteration in the computer program, the calculation of ES was rewritten into the following form according to the method described by Subramanian et al (Subramanian et al., 2005).
where N H r i , N R r i |δ i | p , es i represents the value of ES and p is a weighing factor with a range of [0, 1]. A value of p 1 was set as a constant in the study. By the method of mathematical induction, it was easy to prove that es n 0. In particular, N R equal to 0 indicated that none of the genes in g belonged to the annotations in the KEGG pathway. To prevent a division by zero error (when N R 0), we set any value of es i equal to 0 under this condition. Thus, the maximum ES, es max sign(es k )max(|es 1 |, |es 2 | . . . |es n |), was obtained, where |es k | was the maximum element in the max(·) function, and sign(·) represented the signum function to indicate the positive and negative values of es max .

Estimating the Significance of the Maximum ES
To estimate the significance of the es max generated from g in the KEGG pathway, a nominal p value was calculated using an empirical phenotype-based permutation method derived from Subramanian et al. (2005). Given the null hypothesis that the order of the elements in g was random, the way to reject the null hypothesis was to calculate the probability of es max in its randomly generated distribution, which was derived from a set of randomly assigned g. When the order of elements in g was disrupted and the other parameters remained unchanged, the new corresponding value of es max was related only to the random rearrangement of elements in g. After repeating the above random rearrangement process v times, a series of values of es ′ max with a total number of v were obtained based on the null hypothesis, and the absolute value was taken: (|es ′ max1 |, |es ′ max2 |, . . . |es ′ maxv |). The empirical cumulative distribution function (CDF)F v for es ′ max (derived from the null hypothesis) can be described as follows: where I(·) is the indicator function of an event and v corresponds to the number of permutations in Subramanian et al (Subramanian et al., 2005). According to the empirical CDF, the p value for an es max in a pathway was calculated as follows:

Obtaining the Leading-Edge Subsets
In an enriched pathway, the leading-edge subset in g appears prior to the peak score for a positive es max and appears subsequent to the peak score for a negative es max . The leading-edge subset L can be described as follows: where i represents the position number of es max .

Refining the Results From Leading-Edge Subsets
Some genes or gene sets with interactions appear more frequently than others, indicating their pivotal roles in the enriched KEGG pathways. For instance, the interacting gene set Raf1-Map2k1-Erk plays an important role in a series of pathways, such as the ErbB signaling, FoxO signaling, and Ras signaling pathways. The Apriori algorithm designed for finding frequent item sets was introduced to identify these frequently appearing genes or interacting gene sets (Agrawal and Srikant, 1994). The goal of the algorithm is to identify genes (including a single gene or multiple genes with interactions) whose frequency of occurrence in the gene interactions is greater than a specified threshold. Prior knowledge of gene-gene interactions was determined by the Reactome database (Croft et al., 2011) (https:// reactome.org/download/tools/ReatomeFIs/FIsInGene_122220_ with_annotations.txt). After the frequent gene sets were calculated by the Apriori algorithm, a new statistic d was defined to comprehensively evaluate the real weights of a frequent gene set. Finally, by sorting the modulus of the statistic d, the significant genes were obtained for further experiments.
The number of iterations was set to 4, which meant that gene interactions were generated with a maximum item size of 4. As Apriori uses a "bottom up" approach, genes in the union of the leading-edge subsets were used as the first-level candidates. Afterward, the number of genes in a candidate was increased by one, and the candidates that had an infrequent pattern (defined by the threshold) or were not consistent with the gene interaction knowledge were pruned. According to the above method, frequent candidates were extended one gene at a time. The threshold was set to 3, indicating that a candidate would be removed if its frequency was less than 3 among all transactions. The algorithm terminated when no frequent candidates could be generated, and frequent gene sets were identified by screening. Since the frequent gene sets were calculated based on prior annotation, a new statistic d was defined to comprehensively evaluate the real weights of a frequent gene set. The statistic d of a frequent gene set is equal to the multiplication between the column vector of the expression level and its correlation matrix, and the modulus of the statistic d represents comprehensive expression based on prior annotation and the actual expression level. Finally, by sorting the modulus of the statistic d, the significant genes were obtained for further experiments.

Generating the Correlation Matrixes From a Gene Set
According to prior knowledge in the Reactome database, there are three categories of gene effects on another: upregulation, downregulation, or no effect, where the symbols '→' and '⇢' represent upregulation and downregulation, respectively. For a vector containing n genes, its correlation matrix M was a square matrix of order n, whose elements were the interaction information among these genes. We defined its element a ij as follows: 1 if g i → g j and i ≠ j −1 if g i / > g j and i ≠ j 0 if g i had no effect on g j and i ≠ j 1 if i j where i and j represent the index numbers of g i and g j in the tuple, respectively. the correlation matrix was described as follows:

Generating Interactions From the Leading-Edge Subsets
Genes from the union of the leading-edge subsets were used to iteratively generate gene transactions. After obtaining the union of the leading-edge subsets with m genes, the vector l consisting of these genes and the correlation matrix M reflecting their interactions were created. Elements in l were considered basic gene interactions in the first iteration. For an iteration vector, a gene would be appended to a gene transaction if it had an interaction (upregulation or downregulation) with the gene at the end of the transaction, and all possible new interactions were used as the basic gene interactions for the next iteration. In the study, gene interactions with a maximum of 4 items were generated. For example, let the 4-dimensional vector l (A, B, C, D) be the union of the leading-edge subsets, and let the interactions be as described above in (1). For 4 iterations, the sequences of gene interactions I (I 1 , I 2 , I 3 , I 4 ) were generated as follows:

Mining Frequent Gene Sets by the Apriori Algorithm
To detect frequent gene sets from the gene transactions, the value support was used to evaluate the frequency of a gene set. For the vector a of a gene set, the support of a was defined as the total count of a in all transactions. For example, the support of (A, C) in interactions I in (3) was 2 ((A, C) in I 2 and (A, C, D) in I 3 ).
Given a threshold min for support (a threshold min 3 was set in this study), gene interactions I (I 1 , I 2 . . . I t ) generated from t iterations and setting both the frequent gene set and the infrequent gene set, F and F, to ∅ initially, the algorithm in the study was described in two steps below: Step 1. Start with i 1. Traverse all interactions in I. Append the interactions whose support ≥ min to F and the interactions whose support < min to F, respectively. Based on the Apriori principle, if a transaction is found to be infrequent, then all its super interactions are also infrequent. Remove the transaction in I i+1 (if it exists) when a transaction in F belongs to the transaction in I i+1 .
Step 2. Repeat Step 1 until no further successful extensions are found. F is the returned result.
For example, for the gene interactions I described above in (3) and a given threshold min 3, the process of the Apriori algorithm was as follows: In i 1, no interactions in I 2 were removed.
In i 2, (A, C, D) and (B, C, D) were super interactions in F and were removed in I 3 .
In i 3, (A, B, C, D) was a super transaction in F and was removed in I 4 . Iteration was terminated, and the frequent gene set F was obtained.

Calculating the Weights of the Frequent Gene Set
After the frequent gene set was obtained by the Apriori algorithm, a new statistic d was introduced to evaluate the real 'weights' of the frequent gene set consisting of expression levels obtained from RNA-Seq. Given an element vector g from the frequent gene set, d was defined as the multiplication of its expression vector δ on its correlation matrix M: Since gene B was upregulated by gene A according to the prior knowledge, theoretically, if the expression value a of gene A was increased (a > 0), its expression value b should also be increased (b > 0). However, under the same condition where the expression value a was increased (a > 0), if the expression value b was decreased (b < 0), or if the increase in the expression value b was not as large as in the first case, d would be smaller in the second case than in the first case, indicating that the weights in the second case were smaller than those in the first case. Specifically, for a single gene S, d is the absolute value of its expression level: Any frequent gene set could be described by d no matter how complicated the interactions among genes, and its d represented the comprehensive weights related to their interactions and expression levels.
For instance, given the 4-dimensional vector of gene set l (A, B, C, D), its expression vector δ (a, b, c, d) and its correlation matrix M given above, its comprehensive weights were described as follows: According to this method, the weights associated with any frequent gene set will be referred to as its value of d , and the frequent gene set with the highest weights could then be identified by ranking the value of d . Since the value of d will increase with the increase in the number of items, the value of d must be compared among frequent gene sets with the same number of genes.

RT-qPCR Validation for the Candidate Genes
RT-qPCR was performed to validate the candidate genes. Total RNA from retinas was isolated and procedures for reverse transcription and Real-Time PCR were described previously earlier. Gene expression levels were quantified using the 2 −ΔΔCt method and normalized to GAPDH levels (Livak and Schmittgen, 2001). Graphical representation of the results was performed using GraphPad Prism v8.3.0 (GraphPad Software, Inc.).

Data Preparation and Normalization
A total of 115,140 assembled transcript sequences were obtained from the clean data, among which 75,959 were distinct mRNA sequences. After TPM normalization, the expression levels of genes were observed on a 10-base logarithmic scale. The overall TPM values showed a one-tailed distribution, which was consistent with previous literature reports (Wagner et al., Figure 1 shows that several genes had TPM values of 10 −4 , an amount of which was so small that it could not be clearly observed on the histogram. In addition, the TPM distribution on the order of 10 −7 was irregular and inconsistent with the one-tailed distribution of the previous orders of magnitude. TPM data below 10 −6 were excluded from further analysis. The expression differences of filtered genes were determined based on the log2-transformed fold-change TPM values. After miR-124-3p was knocked down in the mouse retina, most genes in the anti-miR-124 group were upregulated compared to the NC group, while a small portion were downregulated.

Identified Enriched KEGG Pathways and Their Leading-Edge Subsets
Based on GSEA of 354 pathways, the ranked gene list was enriched at the top in most of the pathways and enriched at the bottom in a small number of pathways. After p values were calculated based on 1,000 permutations, pathways related to retinal development with p values <0.05 were selected (Figure 2). The results indicated that only pathways with a positive maximum ES showed significant enrichment in upregulated genes, consistent with the negative regulation of mRNA by miRNA. The top 10 enriched pathways and the expression levels of their leading-edge subsets are shown in Figure 3: the NOD-like receptor signaling pathway, C-type-like lectin receptor signaling pathway, phagosome, necroptosis, JAK-STAT signaling pathway, Toll-like receptor signaling pathway, leukocyte transendothelial migration, chemokine signaling pathway, NF-kappa B signaling pathway and RIG-I-like signaling pathway.

The Weights of Frequent Gene Sets From the Leading-Edge Subsets
The Apriori algorithm was performed on the union of the leading-edge subsets to identify the frequent gene set. After the first iteration, 60 single genes were identified as frequent. Based on this, frequent gene sets with 2 and 3 genes were further identified, and no frequent gene sets were found with more than 4 genes. Values of d for frequent gene sets with 2 and 3 genes were calculated. In Figure 4, each colored line represents an interaction. The wider the height of the color line is, the greater the value of d . The background color of the gene represents the interactions between the gene and its downstream genes. The darker the background color of the gene is, the higher the number of interactions. Upstream of the frequent interactions with two genes, Prkcd, Igf1, Irf9, Cxcl12, Trim25, Stat2, Stat3 and Stat1 had a considerable d value and interactions with the downstream genes. For frequent interactions with 3 genes in the first set, genes with a considerable d value and interactions were Isg15, Prkcd, Eif2ak2, Cxcl12, Il6st, Pdgfra, Socs4, Stat2, Stat3, Csf2ra, Irf9 and Stat1. In summary, Prkcd, Irf9, Stat3, Cxcl12, Stat1 and Stat2 had the greatest interactions and the greatest value of d to downstream genes among all frequent transactions, indicating that they played a pivotal role in the downstream genes of miR-124-3p. The frequent gene sets and their weights are shown in Data Sheet 2. RT-qPCR was performed to validate the expression of candidate genes. These genes exhibited a similar trend to the RNA-Seq results ( Figure 5). Primer sequences were described in Data Sheet 3.

DISCUSSION
After gain and loss intervention of the upstream miRNA in a functional experiment, transcriptomics technologies such as RNA-Seq make it possible to obtain a large amount of downstream gene expression data in a single experiment.
FIGURE 5 | Results of RT-qPCR showed the expression of candidate genes bewteen NC and anti-124 group, which exhibited a similar trend to the RNA-Seq results. Each RT-qPCR experiment was repeated three times with independently isolated RNA samples. Results were presented as mean ± standard deviation (anti-miR-124 group, miR-124-3p knockdown group; NC group, negative control group; *p < 0.05, **p < 0.01) Frontiers in Genetics | www.frontiersin.org May 2022 | Volume 13 | Article 890672