Unsupervised Clustering Reveals Distinct Subtypes of Biliary Atresia Based on Immune Cell Types and Gene Expression

Background Biliary atresia (BA) is a severe cholangiopathy of early infancy that destroys cholangiocytes, obstructs ductular pathways and if left untreated, culminates to liver cirrhosis. Mechanisms underlying the etiological heterogeneity remain elusive and few studies have attempted phenotyping BA. We applied machine learning to identify distinct subtypes of BA which correlate with the underlying pathogenesis. Methods The BA microarray dataset GSE46995 was downloaded from the Gene Expression Omnibus (GEO) database. Unsupervised hierarchical cluster analysis was performed to identify BA subtypes. Then, functional enrichment analysis was applied and hub genes identified to explore molecular mechanisms associated with each subtype. An independent dataset GSE15235 was used for validation process. Results Based on unsupervised cluster analysis, BA patients can be classified into three distinct subtypes: Autoimmune, Viral and Embryonic subtypes. Functional analysis of Subtype 1 correlated with Fc Gamma Receptor (FCGR) activation and hub gene FCGR2A, suggesting an autoimmune response targeting bile ducts. Subtype 2 was associated with immune receptor activity, cytokine receptor, signaling by interleukins, viral protein interaction, suggesting BA is associated with viral infection. Subtype 3 was associated with signaling and regulation of expression of Robo receptors and hub gene ITGB2, corresponding to embryonic BA. Moreover, Reactome pathway analysis showed Neutrophil degranulation pathway enrichment in all subtypes, suggesting it may result from an early insult that leads to biliary stasis. Conclusions The classification of BA into different subtypes improves our current understanding of the underlying pathogenesis of BA and provides new insights for future studies.


INTRODUCTION
Biliary atresia (BA) is a severe neonatal disease that often ends with cholestasis and progressive hepatic failure. Although BA is a rare disease, with variable incidence ranging from 1 in 5,000 to 1 in 19,000 live births per year, it is the leading cause of end-stage liver disease in children and the leading indication for pediatric liver transplantation worldwide (1). The development of the Kasai hepatic portoenterostomy (HPE) and liver transplantation have substantially improved patient outcomes. However, survival rates have been associated with the time-to-surgery and early diagnosis; even after successful HPE, progression to liver injury and fibrosis is still observed in 40%-50% of patients (1). After liver transplantation, the management of long-term complications and immune suppression is also a major clinical challenge (2,3). Elucidating the pathogenesis underlying the extent and the source of the disease phenotype variance is key to improve current BA management and develop novel therapeutic strategies.
Pathological jaundice and acholic stools in the first few months of life are the clinical hallmarks of BA. Depending on time of disease onset, BA patients were initially divided into "embryonic/developmental" BA (<20%) and "perinatal/ acquired" BA (>80%) (4). The presence of associated congenital anomalies such as polysplenia suggests a causative insult early in gestation and diagnosis of Embryonic BA is made. The absence of associated anomalies suggests that perinatal BA may be caused by a perinatal insult such as a viral infection. As of 2012, variants such as cyst-associated and cytomegalovirus (CMV)-associated BA have been added to the initial classification, based on presence of nonhepatic malformations, time of disease onset and morphological or molecular analysis of the hepatobiliary system (1,4). Studies on the epidemiology, pathology and clinical forms of BA implicate numerous factors in disease pathogenesis, including: defective embryogenesis, abnormal fetal or prenatal circulation, genetic abnormalities, environmental toxins, viral infection, abnormal inflammation and autoimmunity, and susceptibility factors (1,2). These seemingly different factors can be grouped into the broadly defined categories of abnormal morphogenesis, environmental factors and inflammatory dysregulation, which could interplay to produce a particular phenotype of BA. In this regard, analysis of gene expression patterns in BA can improve understanding of clinical heterogeneity and pathogenesis, determine disease prognosis, assist in clinical decision making, and may be the cornerstone of the design of personalized trials to block disease progression.
A growing body of evidence suggests the presence of activated immune cells in liver tissue of BA patients at the time of diagnosis, and an association with autoimmunity, amplification of epithelial injury, and bile duct obstruction (4)(5)(6)(7)(8). With substantial developments made in microarray technology and bioinformatics analysis, databases including Gene Expression Omnibus (GEO) have been increasingly used for enrichment analysis and identification of differential expressed genes (DEGs). Herein, we collected the BA microarray datasets from GEO, analyzed the immune-cell composition according to gene expression and tried to identify subtypes of BA at genetic and immune cell level. In this study, we applied a machine learning approach: unsupervised clustering on immune cells and gene expression to reveal the heterogeneity among BA patients and identified a de novo classification for BA. Then, validation was carried out using an independent BA dataset. Finally, we explored the mechanisms associated with each BA subtype, by identifying related immune cells and the hub genes.

Data Collection
Two datasets [GSE46995 (7) and GSE15235 (9)] were downloaded from the GEO database. GSE46995 used as a training dataset included 64 BA liver tissue samples, 14 diseased control samples (intrahepatic cholestasis) and 7 normal control samples, while GSE15235 consisting of 47 BA liver tissue samples was used for validation.

Cluster Analysis
R package "CancerSubtypes" (10) was used for identifying, validating and visualizing molecular cancer subtypes from multi-omics data. Feature selection was based on the top 5000 most variance features. The ensemble Similarity Network Fusion and Consensus clustering algorithm (SNF-CC) was used to observe gene expression patterns across BA patients and cluster identification. Silhouette width index was calculated to assess accuracy and fitness of the clustering assignment. Silhouette width values range between -1 and 1. Values close to 1 correspond to well-defined clustering results. The xCell (11) function of "Immunedeconv", another R package, was used for analyze infiltrating immune cells. xCell can show the relative enrichment of predetermined gene profiles combinations and perform cell type enrichment analysis from gene expression data for 64 immune and stroma cell types. The "Complexheatmap" package was then used for draw heatmaps.
Differential Expression Analysis of Genes and Differences Among Different Subtypes significant. The representative DEGs were identified by a Venn plot method. For example, to identify the DEGs of subtype 1, a Venn plot of (subtype 1 + subtype 2) v/s (subtype 1 +subtype 3) was drawn to obtain the overlap of DEGs. p<0.05 represented statistically significant difference. Screening of immune and stroma cells for each subtype was performed by "tableone" and "t-test" and p<0.05 was used as the selection criteria. The screening method used to identify each BA subtype's representative immune and stroma cells was the same as screening DEGs.

Functional Enrichment Analysis and Protein-Protein Interaction Network Analysis
We performed enrichment analysis and Reactome pathway analysis using R packages "clusterProfiler" and "ReactomePA" to provide a functional interpretation of the representative genes. Then, the protein-protein interaction (PPI) network analysis was constructed by STRING (https://string-db.org/). The hub genes were analyzed, and the PPI network was visualized by cytoscape plug-in Molecular Complex Detection (MCODE) and CytoHubba of Cytoscape software version 3.7.2. MCODE was used to screen the modules of the PPI network identified by degree cutoff =2, node score cutoff =0.2, K core =2 and a maximum depth=100. Then, the importance of these hub genes was calculated by cytoscape plug-in CytoHubba, using the topological algorithm "degree". In the PPI network, genes shown in red indicated more frequent interactions with other proteins.

Subtype Identification of Biliary Atresia
A flowchart of our study is illustrated in Figure 1. First, we downloaded GEO dataset GSE46995. Then we used xCell to assess the relative enrichment of gene profiles resulting in different immune and stroma cell types for each group (Figure 2A). We found significant differences in cell types among the three groups (Supplementary Table 1). In addition, the top 5000 most variance genes were selected by the most variance method. The optimal number of clusters (K) was generated according to "wss" (for total within sum of square) and visualized by factoextra package (K=3, Figures 2B, C). By applying an ensemble method of SNF-CC, BA datasets were clustered into three distinct subtypes (Figures 2D, E) with a silhouette width value of 0.91, suggesting a high degree of cohesion and separation among each subtype in the silhouette plot ( Figure 2D). Validation was carried out using dataset GSE15235: 47 samples of BA were similarly clustered into three distinct subtypes based on gene expression and immune cells (Supplementary Figure 1A), with a silhouette width value of 0.89 (Supplementary Figure 1B). Gene Set Variation Analysis (GSVA) (12) was then applied to detect subtle pathway activity changes within the GSE46995 dataset, which similarly classified the samples into 3 subtypes (Supplementary Figure 2).   Figure 3A). The expression of these representative genes in three subtypes was then illustrated on a heatmap ( Figure 3B). In addition, subtypes 1, 2, and 3 consisted of 10, 14 and 24 representative cells, respectively ( Figure 3C), with the representative cells illustrated on a heatmap ( Figure 3D). The

Function and Pathway Enrichment Analyses
To explore possible pathological mechanisms in each BA subtype, we extracted the representative genes and performed Molecular Function (MF) enrichment and Reactome pathway analysis. MF analysis of genes from Subtype 1 showed correlation with GDP binding and fatty acid binding, while reactome analysis showed enrichment in neutrophil degranulation, rRNA processing and Fc Gamma Receptor (FCGR or FcgR) A B D C  activation pathways ( Figure 4A). MF analysis of the genes from Subtype 2 showed correlation with cell adhesion molecule and cytokine receptor activity, while Reactome analysis showed enrichment in neutrophil degranulation, signaling by interleukin such as IL-4, IL-13, IL-10, and response to metal ions pathways ( Figure 4B). MF analysis of genes from Subtype 3 showed correlation with cell adhesion molecule binding and structural constituent of ribosome cytokine binding. Reactome analysis showed enrichment in neutrophil degranulation, signaling and regulation of expression of Roundabout (Robo) receptors pathways (see Figure 4C).

Hub Genes of Different Subtypes
We further identified hub genes from representative genes of each subtype and PPI networks were constructed (Supplementary Table 2  ITGB2, CYBB, CDK1, CCNE2 and MAD2L1. The relationship between these pathways and representative genes was also visualized in Supplementary Figure 3. In Subtype1, RPS27A and UBE2D1 were connection points between the peroxisomal protein import pathway and rRNA processing pathway (Supplementary Figure 3A). FCGR2A was a connection between the FCGR activation pathway and the neutrophil degranulation pathway (Supplementary Figure 3A). In Subtype 2, hub genes IL-10, IL-1B and ANXA1 were enriched in pathways including signaling by interleukin such as IL-4, IL-13 and IL-10 (Supplementary Figure 3B). In Subtype3, hub genes RPS29, RPLP0 and RPS2 were at the junction of multiple pathways such as signaling and regulation of expression of Robo receptors, SRP-dependent cotranslational protein targeting membrane translation initiation complex formation, etc. (Supplementary Figure 3C). Since the "neutrophil degranulation" pathway was present in all 3 BA subtypes by Reactome Analysis, the genetic information associated with neutrophil degranulation was downloaded from the website: https://reactome.org/. The intersection of the above genes and representative genes in the three subtypes resulted in 35 overlapping genes. The expression of these 35 genes was significantly different in all BA subtypes (Supplementary Table 3). The differential expression of these 35 genes in BA and non-BA group was also analyzed (Supplementary Figure 4). 5 hub genes TYROBP, FCGR2A, CD68, C3AR1 and ITGB2 were then included. The differential expression of these five hub genes in the healthy subject group and the three BA subtypes was analyzed using R package "ggpubr". As shown in Supplementary Figure 5, except for Subtype 3, the expression of hub genes CD68, TYROBP, FCGR2A, C3AR1 and ITGB2 was significantly different between the healthy group and Subtypes 1 and 2.

DISCUSSION
Herein, we provide new insights into the mechanisms underlying etiological heterogeneity and increase our understanding on the role of immune system in BA. By applying gene expression microarray and immune cell composition analysis, BA can be classified into three subtypes: Autoimmune, Virus-related and Embryonic subtypes.
Subtype 1 was associated with tissue cells including HSC, B cell, cDC, macrophages, platelets and hepatocytes. The transformation and proliferation of HSCs from fat-storing cells to myofibroblasts in BA have been reported to play an important role in hepatic fibrogenesis and BA pathogenesis, leading to increased extracellular matrix and progressive scarring (13). Immune tolerance in the liver is mediated by specialized antigen-presenting cells (APCs), including both HSC and DC, amongst others (14). Antigen-specific B cells responsible for initiation the humoral response play a major role in BA immunopathogenesis (4). B cells can act as APCs that on their own, activate T cells and the Th-1 pathway (15). Amy G. Feldman et al. found that B cells played a major role in the development of biliary obstruction in a Rhesus rotavirus (RRV)induced mouse model of BA (15). cDCs are also essential for T cell activation and can present native antigens to B cells (16). In addition, reactome analysis showed enrichment in FcgR activation and hub gene Fc Fragment Of IgG Receptor IIa (FCGR2A) pathways. FCGR2A, formerly known as CD32, plays an important role in the clearance of immune complexes by phagocytic cells such as macrophages and neutrophils. Prior studies have demonstrated strong evidence that FCGR2A is not only involved in immune system activities but also with autoimmunity (17,18). Based on the above findings, we can infer subtype 1 may be related to an autoimmune response targeting bile ducts which result into BA.
Interestingly, functional analysis of the representative genes of subtype 2 showed correlation between the DEGs and immune receptor activity, cytokine receptor activity and cell adhesion molecule binding by MF analysis. KEGG pathway analysis revealed significant viral protein interaction with cytokines and cytokine receptor (Supplementary Figure 6). Based on the above findings, we can infer that Subtype 2 may be related to infection and cytokine activity. Furthermore, the distinct immune cells in Subtype 2 included innate immune cells such as macrophages M1, neutrophils and monocytes. Intriguingly, CMV employs M1-associated markers and chemokines to promote the proinflammatory activation of infected monocytes/macrophages (19). CMV-induced pathologies are known to be mediated predominately by infected monocytes, which serve as a permissive system and are a long-term reservoir for the virus (20). In our study, CD68 was identified as the hub gene for Subtype 2. De facto, CD68 is the most extensively used marker, expressed on all macrophages (21). H Kobayashi et al. found macrophage-associated antigens (CD68) were associated with poor prognosis in BA patients (22). While much emphasis has been placed on CMV, other viruses such as the Epstein-Barr virus, reovirus and rotavirus have previously been documented in BA development. Since no clinical data such as CMV titer was available, we could only infer that Subtype 2 was related to perinatal viral infection. Moreover, the strong correlation with signaling by interleukin including IL-4, IL-13 and IL-10 was also found. This raises the possibility that the modulation of interleukin signaling may potentially represent a novel therapeutic approach to block progression of the disease in this particular subtype.
Reactome analysis of Subtype 3 showed enrichment within neutrophil degranulation, signaling and regulation of expression of Robo receptors pathways. The Robo family encodes transmembrane receptors that regulate axonal guidance and cell migration. Robo4 is an endothelial-specific receptor that participates in endothelial cell migration, proliferation, and angiogenesis as well as the maintenance of vasculature homeostasis (23). Hub gene Cyclin dependent kinase 1 (CDK1) has been primarily identified as a key cell cycle regulator in both mitosis and meiosis (24), and CCNE2 (cyclin E2) negatively regulates CCNE1 activity and S-phase progression during liver regeneration (25). A correlation between ITGB2 (CD18 protein) polymorphism and BA pathogenesis was substantiated by a prior study, suggesting the role of genetic predisposition in this particular subtype (26). Therefore, it can be inferred that subtype 3 may be related to embryonic BA as developmental defects of the liver and biliary tract are caused by genetic influences (1).
Interestingly, Reactome pathway analysis showed enrichment within the neutrophil degranulation pathway in all BA subtypes. Nowadays, much controversy surrounds the role of neutrophils in BA. The wide distribution of neutrophils in BA was previously documented by S. Changho (27). Intriguingly, further studies documented release of neutrophils following bile duct ligation (28,29), suggesting that bile duct obstruction, whether due to edema induced by a viral infection or to other mechanisms, is the initial triggering mechanism that causes chemotaxis of neutrophils. Interestingly, the expression of hub genes (C3AR1 and ITGB2) in subtype 3 was not significantly different from the healthy group (Supplementary Figure 5). This may be due to other representative genes that may play a more important role in neutrophil degranulation in this particular subgroup. A histological approach was initially adopted by Moyer et al. (9) to classify 47 BA patients into three groups: inflammation, fibrosis and unclassified groups. However, after undergoing molecular profiling, 4 cases were still left unclassified. Another focus of the study was to correlate each group with prognosis, for example the fibrosis group was associated with a significantly lower transplantfree survival than the inflammation group. However, the authors conceded that temporal differences in age at diagnosis for the molecular groups raise the possibility that the two gene expression signatures reflect two distinct but interrelated stages of the disease. Indeed, elements of inflammation and fibrosis are often visible simultaneously at BA diagnosis. Inflammatory signatures may be prominent in early stage of disease and can transition to fibrosis at more advanced disease stages. Herein, a different methodology was adopted since we aimed to explore etiological heterogeneity in BA by analyzing genetic signatures and characteristics of the microenvironment in BA.
One significant limitation of our study is our inability to correlate each BA subtype with prognosis. No survival data was available from the downloaded dataset. In our study, a high stroma score was obtained for Subtype 2 (viral-related). Clinical studies have consistently shown poorer prognosis of CMV positive patients after HPE (30,31). Interestingly, worse survival rates have also been associated with tumors with high stroma scores (32).
Nonetheless, hospitals performing more than five cases per year had better surgical outcomes compared with hospitals performing a smaller number of HPE (33). At four years, significant differences in overall survival were found when HPE was done at a center with a higher caseload (34). Such reports suggest the difficulty in obtaining accurate prognosis data, especially when it comes to rare diseases. In addition, the use of xCell to characterize the tumor microenvironment has several limitations. Correlations with direct measurements are not perfect, and inferences being enrichment scores cannot be interpreted as proportions. Thus, xCell results should be interpreted with caution and require validation via other methods. Accordingly, our attempt to classify BA by integrating multi-omics data requires further validation via in vivo and in vitro studies.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

AUTHOR CONTRIBUTIONS
ZC, JG, and XL designed this study and took responsibility for the integrity of the data and the accuracy of the data analysis. XP and JC analyzed the data and wrote the article, SC, ZG, GL and YC contributed to revise the article. All authors contributed to data interpretation, and reviewed and approved the final version.