Identification of 67 Pleiotropic Genes Associated With Seven Autoimmune/Autoinflammatory Diseases Using Multivariate Statistical Analysis

Although genome-wide association studies (GWAS) have a dramatic impact on susceptibility locus discovery, this univariate approach has limitations in detecting complex genotype-phenotype correlations. Multivariate analysis is essential to identify shared genetic risk factors acting through common biological mechanisms of autoimmune/autoinflammatory diseases. In this study, GWAS summary statistics, including 41,274 single nucleotide polymorphisms (SNPs) located in 11,516 gene regions, were analyzed to identify shared variants of seven autoimmune/autoinflammatory diseases using the metaCCA method. Gene-based association analysis was used to refine the pleiotropic genes. In addition, GO term enrichment analysis and protein-protein interaction network analysis were applied to explore the potential biological functions of the identified genes. A total of 4,962 SNPs (P < 1.21 × 10−6) and 1,044 pleotropic genes (P < 4.34 × 10−6) were identified by metaCCA analysis. By screening the results of gene-based P-values, we identified the existence of 27 confirmed pleiotropic genes and highlighted 40 novel pleiotropic genes that achieved statistical significance in the metaCCA analysis and were also associated with at least one autoimmune/autoinflammatory in the VEGAS2 analysis. Using the metaCCA method, we identified novel variants associated with complex diseases incorporating different GWAS datasets. Our analysis may provide insights for the development of common therapeutic approaches for autoimmune/autoinflammatory diseases based on the pleiotropic genes and common mechanisms identified.


INTRODUCTION
Autoimmune/autoinflammatory diseases are chronic conditions initiated by loss of immunological tolerance to self-antigens (1). In Europe and North America, these conditions occur with an estimated incidence of ∼90 cases per 100,000 personyears and a prevalence of between 7.6 and 9.4% (2, 3). The chronic nature of such diseases has a significant impact in terms of the utilization of medical care, direct and indirect economic costs and quality of life. In addition, extensive clinical and epidemiologic observations have shown that autoimmune/autoinflammatory diseases are characterized by familial clustering of multiple diseases, epidemiological co-occurrence, overlapped autoantibody level, and the efficacy of therapies across diseases. These observations provide evidence that different autoimmune/autoinflammatory diseases share a substantial portion of their pathobiology and genetic predisposition underlies disease susceptibility (4)(5)(6).
The genetic effect of a single nucleotide polymorphism (SNP) or gene on two or more phenotypic traits can be described as pleiotropic and the outcome is genetically correlated. In general, this concept concerns across-trait architecture (7). To date, 186 statistically significant susceptibility loci have been identified in genome-wide association studies (GWAS) of autoimmune/autoinflammatory diseases (2, 8). These were confirmed in subsequent studies, with more than half of the susceptibility genes found to be shared by at least two distinct autoimmune/autoinflammatory diseases (9)(10)(11)(12). For example, PTPN22 c.1858C>T (rs2476601) was identified as a susceptibility gene in independent GWAS across multiple autoimmune/autoinflammatory diseases (13). In addition, there is evidence that loci associated with predisposition to one disease can have effects on the risk of a second disease, although the risk alleles for the two diseases may not be the same (9). However, the evidence for specific shared risk variants is modest, and consequently the genetic mechanisms underlying the patterns of disease aggregation remain unclear. It is, therefore, important to identify shared genetic risk factors acting through common biological mechanisms and to assess the overlapping pathophysiological relationships among autoimmune/autoinflammatory diseases using effective analytical approaches.
GWAS is a standard univariate approach used to investigate and identify potentially causal or risk-conferring genetic variants associated with common human diseases at the level of individual measurements (14,15). GWAS, especially those with large sample size, and meta-analysis of multiple studies, have a dramatic impact on susceptibility locus discovery and in addition, highlight and extend the previously observed commonality among autoimmune/autoinflammatory diseases. However, with the identification of millions of SNPs and a growing number of phenotypes, this univariate approach has been used to detect complex genotype-phenotype correlations with limited success. Furthermore, studies of statistical methods have confirmed that multivariate analysis provides higher statistical power for the detection of unexplained heritability due to the consideration of correlations not only among multiple SNPs, but also among different traits or diseases (16). In previous studies, bivariate analysis has been used to investigate genetic risk factors associated with complex traits, and multivariate analysis of this aspect of complex diseases is rare (17,18). Therefore, multivariate analysis of the publicly available GWAS summary statistics in particular is highly essential and relevant to the identification of pleiotropic genes.
Cichonska et al. (19) recently employed meta-analysis using canonical correlation analysis (metaCCA) to allow multivariate representation of both genotypic and phenotypic variables based on the published univariate GWAS summary statistics. This new approach has been applied to identify potential pleiotropic genes associated with lipid-related measures, psychiatric disorders, and chronic diseases (19)(20)(21)(22). In this study, the genetic pleiotropyinformed metaCCA method was used to identify shared variants and pleiotropic effects in seven autoimmune/autoinflammatory diseases: celiac disease (CEL), inflammatory bowel disease (IBD), which includes Crohn's disease (CRO) and ulcerative colitis (UC), multiple sclerosis (MS), primary biliary cirrhosis (PBC), rheumatoid arthritis (RA), systemic lupus erythematosus (SLE) and type 1 diabetes (T1D). In addition, gene-based association analysis was used to refine pleiotropic genes. GO term enrichment analysis and protein-protein interaction network analysis were applied to explore the potential biological function of the identified genes.

GWAS Datasets
All the GWAS summary statistics of seven autoimmune/autoinflammatory diseases investigated in the present study were downloaded from ImmunoBase (website: https://www.immunobase.org/), which is a web-based resource focused on the genetics and genomics of immunologicallyrelated human diseases. The CEL data were obtained from a second-generation GWAS of 4,533 cases and 10,750 control subjects including 523,402 SNPs (23). The association summary statistics of IBD, including 9,735,446 imputed SNPs, were obtained from a meta-analysis with a total sample size of 59,957 subjects (24). The MS dataset consisted of 464,357 genotyped or imputed SNPs from a collaborative GWAS involving 9,772 cases and 17,376 controls of European descent collected by 23 research teams from 15 different countries (25). The PBC dataset including 1,134,141 SNPs were obtained from a metaanalysis (2,764 cases and 10,475 controls) and an independent cohort study (3,716 cases and 4,261 controls) (26). The RA dataset was also obtained from a GWAS meta-analysis of 5,539 autoantibody-positive RA cases and 20,169 controls of European descent, followed by replication in an independent set of 6,768 RA cases and 8,806 controls, which included a total of 8,254,863 SNPs (27). The SLE dataset comprised 7,219 cases and 15,991 controls of European ancestry, yielding a total of 7,915,250 SNPs from a new GWAS, meta-analysis of published GWAS and a replication study (28). The T1D dataset consisting of 8,781,607 SNPs was extracted from a Mendelian randomization analysis with 5,913 T1D cases and 8,828 reference samples (29). All the samples in the present study came from Northern and Western European ancestry (CEU) population. The summary statistics have been undertaken genomic control in the individual study or meta-analysis. Furthermore, the ImmunoBase was searched using a global cutoff for minor allele frequency (MAF) <99% for all datasets. Further details of the inclusion criteria and quality control methods used in the different GWAS studies are described in the original publications (23-29). It should be noted that the summary statistics of GWAS or meta-analysis were required to include P-values, regression coefficients and standard error.

Data Preparation
When dealing with the various datasets, we first combined the seven summary statistics for the common SNPs studied in all datasets. The result of 324,031 overlapping SNPs was then selected to for multivariate analysis. Second, gene annotation was completed according to the 1,000 Genome datasets (website:https://www.cog-genomics.org/static/ bin/plink/glist-hg19) using PLINK1.9. Third, the linkage disequilibrium (LD)-based SNP pruning method was used to remove SNPs with large pairwise correlations. This SNP pruning method was performed using 50 SNPs as a window where the LD was calculated between each pair of SNPs. The MAF was also used as a criterion in the SNP pruning method to remove the SNPs with smaller MAFs for pairs with R 2 > 0.2. Following this initial removal of SNPs in high LD, each sliding window of five SNPs forward and the process was repeated until there were no pairs of SNPs with MAFs > 0.2 (30). All datasets were pruned using the HapMap 3 CEU genotypes as a reference panel (website: http://www.sanger.ac.uk/resources/downloads/human/ hapmap3.html). Following this pruning procedure, 41,274 SNPs located in 11,516 gene regions remained and were included in the metaCCA analysis. Finally, the regression coefficient beta was normalized before conducting the metaCCA analysis in instance when the individual-level dataset genotype and phenotype matrices were not standardized. Standardization was achieved subsequently according to the following equation: where SE gp is the standard error of β gp , as given by the original GWAS result, g is the number of genotypic variables, p is the number of phenotypic variables, and n is the sample number for each disease.

MetaCCA Analysis
MetaCCA is an extension of the CCA method, which requires a full covariance matrix ( ), consisting of four covariance matrices, which can be obtained using the following formula: where XY is a cross-covariance matrix between all genotypic and phenotypic variables, ∧ XX is a genotypic correlation structure between SNPs, ∧ YY is a phenotypic correlation structure between traits, and T XY is the transposition of XY (19).
XY is constructed as the normalized regression coefficient βgp: In metaCCA, ∧ XX is calculated using a reference database representing the study population, such as the 1,000 Genomes database, or other genotypic data available for the target population. More accurate results will be obtained if ∧ XX is estimated from the target population or a population of the same ethnicity instead of interracial populations (19). In our study, We next determined whether the full covariance matrix was positive semidefinite (PSD); if not, an iterative procedure was used to shrink the full covariance matrix until became PSD. In the next analysis, the PSD of the full covariance matrix was entered into the CCA framework to determine the final genotype-phenotype association (19), where the correlation between genotype and phenotype is defined as the canonical correlation, r (31).
In this study, two types of multivariate analysis were evaluated and the conservative Bonferroni correction method was used as the threshold for nominal significance. If the P-value of the canonical correlation r of any SNP was smaller than 1.21 × 10 −6 (=0.05/41,274), it was deemed significantly associated with the seven diseases. Second, multivariate SNP-multivariate phenotype association analysis at the gene level was performed to identify any potential pleiotropic gene. Similarly, genes with a P-value of canonical correlation smaller than 4.34 × 10 −6 (=0.05/11,516) were identified as potential pleiotropic genes associated with multiple autoimmune/autoinflammatory diseases.

Gene-Based Association Analysis
VEGAS2 (Versatile Gene-based Association Study−2) is a method of gene-based association analysis used to calculate FIGURE 1 | Manhattan plot of -log 10 (metaCCA) values for univariate SNP-seven autoimmune/autoinflammatory diseases analysis. The red line marks the -log 10 (metaCCA) value of 5.92 corresponding to P < 1.21 × 10 −6 . If the -log 10 (metaCCA) value of a certain SNP was >5.92, this SNP was identified as a pleiotropic SNP for seven diseases.
the correlation of multiple SNPs in a gene region with one phenotype using original GWAS summary statistics (32). This method has been widely applied in the genetics studies and has been shown to offer higher sensitivity and lower false positive rates compared to other genebased approaches (33). In the present study, the VEGAS2 method was combined with the metaCCA to refine the genes identified using the metaCCA by computing the gene-based P-value for each specific disease. This analysis was performed using https://vegas2.qimrberghofer.edu.au/, with a threshold of 1.00E-06.

GO Term Enrichment Analysis
In clarifying polygenic associations, it is useful to determine whether or not the implicated genetic variants occur in genes involved in a biological pathway (34). GO term enrichment analysis classifies gene functions based on three main categories: molecular function, cellular component and biological process. We conducted GO term enrichment analysis to determine which GO term were over-or underrepresented). All significant genes re-identified by VEGAS2 in our study were annotated and enriched (website: http:// amp.pharm.mssm.edu/Enrichr/). An adjusted P < 0.05 in the enrichment analysis was considered to indicate nominal significance (35).

Protein-Protein Interaction Network
Protein-protein interactions (PPIs) are crucial for all biological processes and the networks provide many new insights into protein function (36,37). In order to detect interactions and associations of all putative pleiotropic genes, PPIs analysis were conducted by searching the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database (website: http://string-db.org/), which comprises known and predicted associations from curated databases or high-throughput studies, in addition to other associations derived from text mining, co-expression, and protein homology (38).

Potential Pleiotropic SNPs and Genes Identified by metaCCA Analysis
After gene annotation and SNP pruning, 41,274 SNPs located in 11,516 gene regions were available for the metaCCA analysis. The size of SNP representation among the genes ranged from 1 to 213 SNPs, and the median number of SNPs in each gene was 3.72. For the univariate SNP-multivariate phenotype analysis, 4,962 SNPs reached the Bonferroni corrected threshold (P < 1.21 × 10 −6 ), and the canonical correlation r between each SNP and phenotype ranged from 0.0372 to 0.6586. The results are presented in a Manhattan plot in Figure 1. For the multivariate SNP-multivariate phenotype analysis, 1,044 genes with a significance threshold of P < 4.34 × 10 −6 were identified as potential pleiotropic genes. The canonical correlation r between genotype and phenotype ranged from 0.0322 to 0.5899.

Refining the Pleiotropic Genes by Gene-Based Association Analysis
The VEGAS2 algorithm based on original GWAS summary statistics was used to identify association of one gene with a specific disease. In the gene-based association analysis, 19 genes were identified for CEL, 111 for IBD, 16 for MS, 20 for PBC, 19 for RA, 20 for SLE, and 33 significant genes for T1D (P = 1.00E-06).
By screening the results of gene-based analysis P-values, we identified 67 putative pleiotropic genes yielding statistical significance in the metaCCA analysis and found to be associated with at least one disease in the VEGAS2 analysis. In particular, 17 genes were found to be associated with more than one disease in the original GWAS. The results of the metaCCA and VEGAS2 analysis are summarized in Table 1.
Specifically, 27 of these 67 putative pleiotropic genes had previously been reported to be associated with more than one of these seven diseases. Of these 27 confirmed pleiotropic genes, six genes (ADAD1, CIITA, CLEC16A, IL23R, MAGI3, and PTPN2) were associated with more than one disease in the VEGAS2 analysis of the original GWAS summary statistics. Of the 40 novel putative pleiotropic genes detected,  16 were previously reported to be associated with only one autoimmune/autoinflammatory disease. EFR3B and RBM17 were reported to be associated with T1D only in published studies but were shown to be associated with multiple diseases in the VEGAS2 analysis. The remaining 24 significant genes were implicated as candidate novel pleiotropic genes for these seven diseases. More significantly, nine genes (C1orf141, CALU, CCDC136, FGF2, LOC101927051, MAP4K4, MPZL3, PAPOLG, and SH3BP1) were associated with more than one disease in the VEGAS2 analysis, although they had never been reported to be associated with any autoimmune/autoinflammatory disease. The detailed features of 67 significant pleiotropic genes are shown in Table 2.

GO Term Enrichment Analysis
GO enrichment analysis revealed that the biological functions of these pleiotropic genes were involved mainly in the metabolism of lipids. When the 67 pleiotropic genes associated with autoimmune/autoinflammatory diseases were used as the gene sets for the GO term enrichment analysis, several functional terms were identified as being enriched. For the GO biological process and molecular function, there were 63 and 5 terms were identified to be significantly enriched in the development of autoimmune/autoinflammatory diseases, respectively. Details of the top five significant GO terms are shown in Table 3.

Protein-Protein Interaction Network Analysis in String
The 67 putative pleiotropic genes identified were retrieved from the STRING database. Of these, 63 genes were clearly enriched in two confirmed clusters: JAK2 and MAP3K7 clusters (Figure 2). Three of the novel putative pleiotropic genes detected, FGF2, IL22RA2, and ITGAM, are involved in the JAK2 cluster, which participates in various processes such as cell growth, development, differentiation or histone modifications, and mediates essential signaling events in both innate and adaptive immunity. Three other novel genes, MAP4K4, PRKAA1, and TNIK, were involved in the MAP3K7 cluster, which acts as an essential component of the MAP kinase signal transduction pathway and plays a role in the response to environmental stress and cytokines such as TNF-alpha.

DISCUSSION
In the present study, a novel analytical approach-metaCCA-was used to explore the common genetic variants associated with autoimmune/autoinflammatory diseases by combining seven available independent GWAS or meta-analysis summary statistics. A total of 67 putative pleiotropic genes were successfully identified after verification using gene-based analysis. In particular, 27 confirmed genes were identified as pleiotropic in previous different types of studies and were validated in the present study, 16 novel pleiotropic genes were previously reported to be associated with one autoimmune/autoinflammatory disease, and 24 candidate novel pleiotropic genes that had never been reported to be associated with any autoimmune/autoinflammatory disease. The improved detection of pleiotropic genes and the associated biological pathways may provide novel insights into the shared genetic factors involved in development of autoimmune/autoinflammatory diseases.  Among the 27 confirmed pleiotropic genes, 6 genes (ADAD1, CIITA, CLEC16A, IL23R, MAGI3, and PTPN2), which play an important role on the pathomechanism of autoimmune/autoinflammatory diseases, were shown to be associated with more than one autoimmune/autoinflammatory disease not only in the literature review, but also in the VEGAS2 analysis using original GWAS summary statistics. For example, common genetic variants of CLEC16A, also known as C-type lectin-like domain family 16A, had been reported to be associated with CEL, IBD, MS, PBC, and T1D (10). As the non-HLA genome-wide significant risk gene, CLEC16A is essential for autophagosomal growth and autophagy processes, which are of major importance for proper immune regulation, including regulation of inflammasome activation (39,40). Moreover, recent data from murine studies and our PPIs analysis indicated that CLEC16A plays a key role in beta cells functions by regulating mitophagy/autophagy and mitochondrial health (41). PTPN2 is another important and confirmed pleiotropic gene associated with several autoimmune/autoinflammatory diseases we studied (10). The GO term enrichment analysis results suggested that PTPN2 encodes T-cell protein tyrosine phosphatase, acting as a negative regulator of the JAK/STAT signaling pathways downstream of cytokines and playing a prominent role in T-cell activation, signaling and/or effector function, which may represent potential targets for the pharmacotherapy of autoimmune diseases. In addition, Mei et al. (42) also showed that, in the Northeastern Chinese population, PTPN2 polymorphisms are associated with psoriasis, which is another chronic immune-mediated disease with a complex etiology. Sixteen novel putative pleiotropic genes detected in this study had previously been confirmed to be associated with one form of autoimmune/autoinflammatory disease. Interestingly, EFR3B and RBM17 had been reported to be associated only with T1Din published studies, but were confirmed to be associated with other diseases in the VEGAS2 analysis (13,(43)(44)(45). EFR3B is an associated gene located in 2p23, which probably acts as the membrane-anchoring component and is involved in responsiveness to G-protein-coupled receptors. Although Bradfield et al. (13) have confirmed that EFR3B is an associated loci and protein-protein interaction network analysis also provided some protein information, it is still unclear whether this role is direct or indirect. RBM17, which is involved in the regulation of alternative splicing and the utilization of cryptic splice sites, is essential for survival and cell maintenance. Fortunately, genetic and serologic data suggest that the inherited altered genetic constitution located between IL2RA and RBM17 may predispose to a less destructive course of RA (46,47). Although the 14 remaining genes were identified associated with one form of autoimmune/autoinflammatory disease in the literature review and VEGAS2 analysis, further experimental studies are required to confirm their role as pleiotropic genes associated with autoimmune/autoinflammatory diseases. CUL2, which is associated with response to the hypoxic environment and activation of tumor immunity, has been identified in association with CRO in nine independent case-control series (48). Zhang et al. (49) suggested that human immunodeficiency virus type 1 and simian immunodeficiency virus viral infectivity factor form a CRL5 E3 ubiquitin ligase complex that suppresses virus restriction by host APOBEC3 proteins, and that CUL2 eventually suppresses this pathway and increases the risk of autoimmune/autoinflammatory diseases (50).
Significantly, nine genes (C1orf141, CALU, CCDC136, FGF2, LOC101927051, MAP4K4, MPZL3, PAPOLG, and SH3BP1) were found to be associated with more than one disease in the VEGAS2 analysis, although these genes had never been reported to be associated with any autoimmune/autoinflammatory disease in previous GWAS. MAP4K4 has been enriched in several GO terms including MAP kinase c kinase activity (GO:0008349), which is an important contributor to the risk of developing type 2 diabetes mellitus in a Chinese Han population (51). In addition, Aouadi et al. (52) demonstrated that orally delivered small interfering RNA targeting macrophage MAP4K4 suppresses systemic inflammation, thus implicating this technology as a new strategy to attenuate inflammatory responses in human disease. C1orf141 is another significant candidate novel pleiotropic gene found to be associated with IBD and PBC in our study. It has been recently shown that C1orf141 is a susceptibility variant in psoriasis, a chronic inflammatory hyperproliferative cutaneous disease (53). Further studies are required to confirm our findings and provide a detailed description of each candidate novel gene and the associated pathomechanism.
Systematic and comprehensive searches for pleiotropic genes and their effects are essential for an understanding of the mechanisms underlying the development of autoimmune/autoinflammatory diseases (4). Compared to the univariate GWAS analysis based on a cross-sectional population, our study was cost-effective and reliable, not only due to the increased sample size achieved by integrating the summary statistics of seven large GWAS. This approach also increased the statistical power of the study and provided a wealth of information by the simultaneous analysis of multiple related autoimmune/autoinflammatory diseases. However, because of a lack of detailed original individual measures, we were unable to determine whether the effects of pleiotropic genes on risk of these diseases are direct or indirect. Alternative approaches and experimental studies are required to validate these novel genes identified in this study.
In summary, we have provided convincing evidence of the existence of 27 confirmed pleiotropic genes and highlighted 40 novel pleiotropic genes associated with autoimmune/autoinflammatory diseases by performing a systematic multivariate analysis of the open GWAS data using metaCCA. Furthermore, we have illustrated potential biological functions of these pleiotropic genes and our results contribute to a better understanding of common genetic mechanisms, and eventually the development of improved diagnosis, prognosis and targeted therapies.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article.

AUTHOR CONTRIBUTIONS
XJ, YY, and XS conceived and designed the study, analyzed the data, and wrote the manuscript. YY and FX performed the methodology and data curation. XJ, NS, YF, JT, and YL analyzed the data. XS, WW, CS, and HD contributed to the critical revision of the manuscript.