Identification of Molecular Subtypes and Key Genes of Atherosclerosis Through Gene Expression Profiles

Atherosclerotic cardiovascular disease (ASCVD) caused by atherosclerosis (AS) is one of the highest causes of mortality worldwide. Although there have been many studies on AS, its etiology remains unclear. In order to carry out molecular characterization of different types of AS, we retrieved two datasets composed of 151 AS samples and 32 normal samples from the Gene Expression Omnibus database. Using the non-negative matrix factorization (NMF) algorithm, we successfully divided the 151 AS samples into two subgroups. We then compared the molecular characteristics between the two groups using weighted gene co-expression analysis (WGCNA) and identified six key modules associated with the two subgroups. Kyoto Encyclopedia of Genes and Genomes (KEGG) and gene ontology (GO) enrichment analysis were used to identify the potential functions and pathways associated with the modules. In addition, we used the cytoscape software to construct and visualize protein–protein networks so as to identify key genes in the modules of interest. Three hub genes including PTGER3, GNAI1, and IGFBP5 were further screened using the least absolute shrinkage and selection operator (LASSO) and support vector machine-recursive feature elimination (SVM-RFE) algorithms. Since the modules were associated with immune pathways, we performed immune cell infiltration analysis. We discovered a significant difference in the level of immune cell infiltration by naïve B cells, CD8 T cells, T regulatory cells (Tregs), resting NK cells, Monocytes, Macrophages M0, Macrophages M1, and Macrophages M2 between the two subgroups. In addition, we observed the three hub genes were positively correlated with Tregs but negatively correlated with Macrophages M0. We also found that the three key genes are differentially expressed between normal and diseased tissue, as well as in the different subgroups. Receiver operating characteristic (ROC) results showed a good performance in the validation dataset. These results may provide novel insight into cellular and molecular characteristics of AS and potential markers for diagnosis and targeted therapy.


INTRODUCTION
Atherosclerosis (AS) is a chronic inflammatory disease in which atherosclerotic plaques are deposited on the walls of blood vessels and induce arterial stenosis. In general, atherosclerosis is usually asymptomatic in the early stages. However, increased severity of the disease causes various diseases, such as coronary artery disease, peripheral artery disease, and cerebrovascular disease (Kong et al., 2014;Tern et al., 2018). It has been reported that there was sharp increase in the incidence of AS in 2017, with the AS-related mortality rising to 31% (Stefanadis et al., 2017). In addition, the AS is characterized by vascular inflammation, FIGURE 1 | A flowchart for the analysis procedure to identify potential molecular subtypes and key genes in this study.
Frontiers in Molecular Biosciences | www.frontiersin.org 2 April 2021 | Volume 8 | Article 628546 endothelial dysfunction, plaque formation, and diminished oxygen supply to target organs (Jeong, 2010). Currently, the underlying molecular mechanism of AS is still unclear, and an understanding of the molecular pathogenesis of atherosclerosis process could contribute to develop individualized treatment strategies for AS patients.
Until now, most studies on AS were mainly focused on the cardiovascular and cerebrovascular diseases, except for lipid deposition theory and inflammatory response involving immune cell, such as monocytes/macrophages, neutrophils, and T cells involved in AS (Niccoli et al., 2018). Presently, the conventional approaches for treating AS involve regulating some risk factors including smoking, alcohol, genetic factors, hypertension, and hyperlipidemia to prevent the development of AS. After the development of AS, the treatment strategy switches to use of anti-platelet aggregation and hypolipidemic drugs. In addition, local vascular stents or bypass measures can be used for the management of AS. However, these therapeutic strategies have not been effective in the management of AS as seen by the high morbidity and mortality associated with AS (Kelly et al., 2018). Thus, there is urgent need to delineate the molecular pathogenesis of AS, identify potential therapeutic markers, and develop effective drugs. Moreover, the C-reactive protein (CRP) has been involved in multiple processes of AS and is the optimal inflammatory biomarker to the prognosis of atherosclerotic events (Kampoli et al., 2009). The fibrinogen, apolipoproteins, and interleukins are tightly correlated with the progression of AS and have a strong relevance in risk prediction (Montagnana et al., 2008). However, the clinical predictive value of these markers is poor (Revkin et al., 2007). Therefore, a reliable and specific biomarker for promoting the clinical diagnosis of atherosclerosis is urgently required.
The present study aimed to identify potential subgroups of AS through integrating multiple datasets based on the gene expression profile. The protein-protein interaction (PPI) network and hub genes were constructed and screened through the weighted gene co-expression analysis (WGCNA) algorithm and machine learning methods based on the two subgroups. Moreover, the immune cells infiltration level for each AS sample were estimated, and the relationship between immune cells, subgroups, and hub genes was further explored. These findings might improve our understanding of the molecular pathogenesis of AS and identify potential markers for the diagnosis of AS.

Data Collection
Three expression profile data sets including GSE20129, GSE43292, and GSE57691 were downloaded from the Gene Expression Omnibus database using the GEOquery R package (Davis and Meltzer, 2007). The GSE20129 dataset consisted of two independent sets of data generated using two different General Public License (GPL) platforms. The dataset with the largest number of samples (N = 119) generated using the GPL6104 platform (Illumina humanRef-8 v2.0 expression beadchip) was selected for further analysis. In addition, the GSE20129 samples were collected from the peripheral blood cell. The GSE43292 samples were originated from carotid atheroma plaque, and the GPL platform for GSE43292 was GPL6244 ([HuGene-1_0-st] Affymetrix Human Gene 1.0 ST Array [transcript (gene) version]), which includes 32 AS samples and 32 normal samples. The GSE57691 were derived from atherosclerotic vascular tissues, and the GPL platform Frontiers in Molecular Biosciences | www.frontiersin.org for GSE57691 was GPL10558 (Illumina HumanHT-12 V4.0 expression beadchip), which contained nine AS samples and 10 normal samples (Supplementary Table 1).

Batch Effect Removal
The sva R package was used to filter out any batch effect resulting from the combination of the two datasets (Leek et al., 2012). The expression values of the datasets (GSE20129 and GSE43292) were transformed using log2 before cross-platform normalization was carried out. We then used the comBat tool to eliminate the batch effect between the two data sets. Principal component analysis was applied to evaluate the performance of the comBat tools.

Clustering of AS Samples
The non-negative matrix factorization (NMF) algorithm was used to carry out clustering analysis to identify potential clustering groups of the AS samples (Possemato et al., 2011). The cophenetic correlation coefficients were used to determine the optimal cluster number.

Gene Set Variation Analysis
Gene set variation analysis (GSVA) is a non-parametric and unsupervised gene set enrichment method that evaluates the association between biological pathways and gene signatures based on expression profile data  (Hänzelmann et al., 2013). The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis data were retrieved from the c2.cp.kegg.v7.1.symbols.gmt file. Using GSEA analysis, the score for each AS samples on KEGG pathway was calculated. The pathways in which the gene signatures were enriched were identified using the "limma" R package with the cutoff fdr <0.05 and | log2FC| ≥ 0.2 (Ritchie et al., 2015;Yang et al., 2020).

Weighted Gene Co-expression Network Analysis
Weighted gene co-expression analysis was conducted to identify potential modules that characterize the pathways or function of the subgroups on the basis of gene expression profile (Langfelder and Horvath, 2008). Since there was no significant difference in the expression levels of some genes, only genes with the highest 25% variance were used for WGCNA analysis. Pearson correlation analysis was carried out between gene pairs and the results used to construct the corresponding Pearson correlation matrix. The absolute values of the correlation co-efficient between the gene pairs were considered to be the co-expression similarity matrix. The power function formula, amn = | cmn| β (cmn represent the Pearson correlation between gene m and gene n; amn represent the adjacency between gene m and gene n), was used to build a weighted adjacency matrix. Moreover, we chose a β value to enhance the similarity matrix and achieve a scale free co-expression network (scale free R 2 = 0.9). The adjacency matrix was further converted into a topological overlap matrix (TOM) and module dendrograms were constructed using average linkage hierarchical clustering methods. The gene in the min modules was at least 30. After merging the gene modules on the basis of similarity, we identified six valuable modules. The Pearson correlation coefficient was calculated between modules and subgroups using the cor function in the WGCNA R package.

Enrichment of the Modules and Protein-Protein Network Construction
The "clusterProfiler" R packages were used to perform gene ontology (GO) and KEGG enrichment analysis for the genes contained in the modules (Van den Berg et al., 2009). Functions or pathways that were significantly enriched were identified based on the criterion: adjusted P < 0.05. The STRING database was used to construct the PPI network (Yang et al., 2020), while the cytoscape software was used to visualize the network. The key modules and genes were identified using the MCODE function within cytoscape with the criterion: score >2 (Szklarczyk et al., 2018).

Identification of Key Genes
After selecting genes from MCODE module, we performed least absolute shrinkage and selection operator (LASSO) analysis with a turning/penalty parameter conducted using 10-fold cross validation to find the most relevant genes. LASSO is a dimension-reduction algorithm that has superiority in analyzing high-dimensional data when compared to regression analysis (Tibshirani, 1996;Bader and Hogue, 2003). Moreover, the   support vector machine-recursive feature elimination (SVM-RFE) was also used to identify the key genes. SVM-RFE is a machine learning algorithm to identify the best variables by deleting SVM-generated eigenvectors (Huang et al., 2014). Finally, we incorporated the genes from LASSO analysis and SVM-RFE analysis to serve as the target genes.

Evaluation of the Immune Cell Infiltration
The gene expression profiles of the AS samples were uploaded to CIBERSORTx online tools and the LM22 signature was used as the reference for 1,000 permutations (Ma and Li, 2018). The proportion of each immune cell was calculated through bulk-mode batch correction and absolute mode algorithm and an immune cells matrix was generated. Only samples with a P value <0.05 were selected for further analysis. The correlation coefficient between immune cells was calculated using the "corrplot" R package, while the Wilcoxon test was used to estimate the difference between subgroups. Moreover, the receiver operating characteristic (ROC) curve was applied to evaluate the accuracy of the key genes.

Data Preprocessing
A total of 151 AS samples and 32 normal samples from GSE20129 and GSE43292 were used as the training dataset, and GSE57691 (N = 19) data set served as the external validation data set. The detailed flowchart was presented in Figure 1.
We then used the combat method to filter batch effects after combining gene expression data from the two training datasets and remained with data for 13,460 genes. We then conducted principle component analysis (PCA) on the two data sets to establish the relationship between the two training datasets.
As shown in Figure 2A, samples from the two independent data sets formed different clusters before batch effect removal, but clustered together after batch effect removal ( Figure 2B). This indicated that cross-platform normalization had successfully removed the batch effect.

NMF Clustering of AS Samples and GSVA Enrichment Analysis
After batch effect was filtered, the gene expression profiles of 151 AS samples were used to carry out NMF cluster analysis.
To identify the potential AS subgroups, we selected the top 1,000 variance genes for the clustering analysis. The cophenetic correlation coefficients was used to determine the optimal k number, with the results showing that k = 2 was the optimal subgroup number (Figures 3A,B). Results of PCA analysis showed that there was significant difference between the two subgroups ( Figure 3C). Consequently, we divided the AS samples into two subgroups, subgroup 1 (N = 93) and subgroup 2 (N = 58). We further used the GSVA enrichment analysis to compare the pathways enriched in the two groups. As shown in Figure 4, the pathways associated with subgroup 1 include p53 signaling pathway, apoptosis, T cell receptor signaling pathway, B cell receptor signaling pathway, NOD_LIKE receptor signaling pathway, and TOLL_LIKE receptor signaling pathway. On the other hand, the pathways enriched in subgroup 2 include fatty acid metabolism, adherens junction, and tyrosine metabolism.

Gene Co-expression Network Construction and Module Enrichment
Weighted gene co-expression analysis analysis was performed to construct a co-expression network to identify potential modules most relevant to the AS subgroups. First, in order to obtain good quality results, genes with the highest 25% of variance in their gene expression profiles were used to carry out WGCNA analysis. As a result, clustering analysis was carried out on 3,365 genes with 151 AS samples using the average linkage method and Pearson's correlation method (Supplementary Figure 1). In the study, the soft threshold power value of β = 5 soft power was selected to obtain a scale-free co-expression network (scale-free R 2 = 0.85). From this analysis six modules were used for further analysis (Figures 5A-C).
In order to investigate the association between modules and the AS subgroups, the module significance (MS) was employed as the gene expression level of the corresponding module to calculate the Pearson correlation coefficient between the module and the subgroups. As shown in Figure 5D, the turquoise module had the highest correlation with subgroup 1 (cor = -0.85) and subgroup 2 (cor = 0.85) and was therefore selected for subsequent analysis. In addition, to systemically investigate the function and pathway of each module, we further performed GO and KEGG pathway enrichment analysis. As shown in Figure 6A, the GO enrichment terms were different among the modules suggesting that these modules play different roles in the subgroups. However, the results of KEGG pathway analysis revealed that similar pathways were enriched in the blue and turquoise module, indicating that the two modules were similar (Figure 6B).

Protein-Protein Network Construction and Key Modules Selecting
After selecting the turquoise module from the WGCNA analysis, we further investigated the role of the genes in the module. We then used the genes to build the PPI network based on STRING database and used the cytoscape software to visualize the network, in which a total of 23,023 edges and 1,835 nodes were incorporated. The network was processed using the MCODE module to identify possible key modules and the top two important modules acquired (Figure 7). Results of GO and KEGG analysis showed that the genes in the key module 1 and key module 2 were significantly enriched in the immune-related pathways and function (Figure 8).

Screening and Verification of Potential Diagnostic Markers
We used the LASSO and SVM-RFE algorithms to identify robust genes from the key modules. The algorithm identified nine genes (Figures 9A,B), while the SVM-RFE algorithm identified 4 genes (Supplementary Figure 2). Three of the genes identified by the two algorithms overlapped including PTGER3, GNAI1, and IGFBP5, suggesting that these were the robust genes for AS ( Figure 9C). The diagnostic efficacy of PTGER3, GNAI1, and IGFBP5 was determined using a validation dataset. When PTGER3, GNAI1, and IGFBP5 were fitted into one variable, the diagnostic efficiency reached a higher level in the validation set (AUC = 1) (Figure 9D), indicating that PTGER3, GNAI1, and IGFBP5 had high diagnostic value for the AS.

Immune Cell Infiltration Analysis
Since results of GO and KEGG enrichment analysis showed that the genes were enriched in immune related pathways, we used the CIBERSORTx online tool to determine the level of immune cell infiltration. Samples were considered to have immune cells infiltration if the P value < 0.05. The distribution of immune cells was showed in a barplot (Figure 10A), and the correlation between immune cells was calculated using the corrplot R package (Figure 10B). Moreover, we further explored the association between immune cells and the AS subgroups. The results of this analysis showed that there was a significant difference in the infiltration level of naïve B cells, CD8 T cells, Tregs, resting NK cells, Monocytes, Macrophages M0, Macrophages M1, and Macrophages M2 between the two subgroups ( Figure 10C). In addition, we found that the three key genes were positively correlated with Tregs and negatively correlated with Macrophages M0 (Figure 11A-C). Interestingly, the expression levels of the three genes were significantly high in the normal tissue compared to AS tissue ( Figure 11D).

DISCUSSION
Recently, AS has been associated with high morbidity and mortality (Ma and Li, 2018). AS is a heterogeneous disease, and its pathogenesis is complex and poorly understood. Therefore, there is a compelling need to identify biomarkers that might be useful for diagnosing AS. Liu et al., using the GSE20129 dataset, revealed five genes including APH1B, JAM3, FBLN2, CSAD, and PSTPIP2 play an essential role in the progression of AS and may be potential biomarkers for diagnosis of AS (Liu et al., 2016). Huang et al. discovered three differentially expressed genes (KDELR3, CD55, and DYNC2H1) can serve as diagnostic and therapeutic targets in AS macrophages through microarray analysis . Wang et al. (2019) constructed a miRNA-TFgene network through microarray data analysis in the AS macrophages. However, these studies only concentrated on the characterization of potential biomarkers, and the samples size of the dataset is small, which leads to low accuracy. A previous review reported that a small sample size could exert an impact on biomarker identification due to the heterogeneity of AS, especially in the mRNA level (Chen and Stewart, 2016). Considering that, we incorporated multiple data sets to identify the robust subgroups of AS to better understand the underlying molecular pathogenesis of AS. Using the NMF algorithm, we successfully divided the AS samples into two subgroups. Results of PCA analysis showed that our transcriptome classification was robust. Moreover, GSVA enrichment analysis showed that subgroup 1 and subgroup 2 were associated with different pathways. Subgroup 1 was enriched in the immune-related pathways, while subgroup 2 was enriched in the metabolism pathways. Taken together, the transcriptome classification of AS was closely associated with specific pathways, which may have important clinical implications for the treatment of AS.
Weighted gene coexpression network analysis is a systematic biological method used to identify key gene modules associated with phenotypic traits (Botía et al., 2017). Compared to thousands of genes focus on differential expression, WGCNA can exploit thousands of most variable genes or all genes to explore the relationship between gene module and clinical traits (Wan et al., 2017;Tang et al., 2019). It can provide a specific measure for clinical prediction of AS diagnosis. Here, the association of gene modules and subgroups was investigated using WGCNA analysis. We identified the turquoise module is an interest module that correlated with the molecular subgroups. Further analysis using the LASSO and SVM-RFE algorithms identified three robust key genes including PTGER3, GNAI1, and IGFBP5 from the module. Among these genes, PTGER3 is usually involved in TLR4/PTGS2 signaling and has been identified as an potential biomarker in atherosclerotic plaque (Ferronato et al., 2018). IGFBP5 (insulin like growth factor binding protein 5) is epigenetically silenced by H3K27me3 in advanced atherosclerotic plaques, suggesting that targeting the H3K27me3/IGFBP5 pathway may provide novel therapeutics for atherosclerosis (Xu et al., 2018). GNAI1 (G Protein Subunit Alpha I1) has been identified as a potential biomarker in multiple tumors, such as ovarian cancer, colorectal cancer, and renal cell carcinoma, but its role in AS has not yet been reported (Liang et al., 2016;Zhan et al., 2018). These results indicated that the three genes play a crucial role in the development of AS. In addition, the three genes PTGER3, GNAI1, and IGFBP5 showed a high performance in the validation data set, indicating that they have potential as diagnostic markers in AS. However, in vitro experiments or in vivo experiments are required to validate these findings. Since the subgroups were enriched in immune related pathways, we further investigated the association between immune cells and the AS subgroups. We discovered that the naïve B cells, T CD8 cells, Tregs, resting NK cells, and Monocytes are highly expressed in subgroup 2 compared to group 1, while Macrophages M0, Macrophages M1, and Macrophages M2 had low expression levels in subgroup 2. Previously studies have reported that macrophages play a central role in the development of atherosclerosis. In each vascular bed, macrophages contribute to the maintenance of the local inflammatory response, propagate plaque development, and promote thrombosis (Moore and Tabas, 2011;Moore et al., 2013). Thus, we speculate that subgroup 1 is more prone to developing advanced AS compared to subgroup 2, and further studies should be implemented to validate the result.
Despite the molecular subtypes and key genes associated with AS that have been identified in this study, several limitations should be elucidated. First, AS may have distinct pathologies in gender and age, which exert an impact on our result. Second, due to the different GPL version platforms, tissue source, and experimental treatment of these datasets, the batch of effect could not be completely removed. Third, the sample size for the current study is relative small and may affect the accuracy of the result. Finally, there was no experimental validation in this study, and further validation should be carried out to validate these findings and speculations.
In summary, we identified two novel subgroups of AS, which provide novel insight into cellular and molecular characteristics of AS. Moreover, we identified three key genes including PTGER3, GNAI1, and IGFBP5 as potential markers for diagnosis and targeted therapy of AS.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. The datasets generated for this study can be found in the https://www. ncbi.nlm.nih.gov/geo/.

AUTHOR CONTRIBUTIONS
YY designed the study and collected the dataset. YC and YZ analyzed the dataset. YY and YC wrote the manuscript. XY and ZX revised the manuscript. All authors contributed to the article and approved the submitted version.

FUNDING
This study was supported by grants from the National Natural Science Foundation of China (No. 81801184 to YY and No. 81970383 to YC) and Chongqing Basic Research and Frontier Exploration Project (cstc2018jcyjAX0664).

ACKNOWLEDGMENTS
We appreciated Gene Expression Omnibus (GEO) database for providing the original study data. We thank to the reviewers for their careful work and thoughtful suggestions that have helped improve this paper substantially.