Exome Sequencing Identifies Susceptibility Loci for Sarcoidosis Prognosis

Many sarcoidosis-associating immunological genes have been shown to be shared between other immune-mediated diseases. In Finnish sarcoidosis patients, good prognosis subjects more commonly have HLA-DRB1*03:01 and/or HLA-DRB1*04:01-DPB1*04:01 haplotype, but no marker for persistent disease have been found. The objective was to further pinpoint genetic differences between prognosis subgroups in relation to the HLA markers. Whole-exome sequencing was conducted for 72 patients selected based on disease activity (resolved disease, n = 36; persistent disease, n = 36). Both groups were further divided by the HLA markers (one/both markers, n = 18; neither of the markers, n = 18). The Finnish exome data from the Genome Aggregation Database was used as a control population in the WES sample. Statistical analyses included single-variant analysis for common variants and gene level analysis for rare variants. We attempted to replicate associated variants in 181 Finnish sarcoidosis patients and 150 controls. An association was found in chromosome 1p36.21 (AADACL3 and C1orf158), which has recently been associated with sarcoidosis in another WES study. In our study, variations in these genes were associated with resolved disease (AADACL3, p = 0.0001 and p = 0.0003; C1orf158, p = 7.03E-05). Another interesting chromosomal region also peaked, Leucocyte Receptor Complex in 19q13.42, but the association diminished in the replication sample. In conclusion, this WES study supports the previously found association in the region 1p36.21. Furthermore, a novel to sarcoidosis region was found, but additional studies are warranted to verify this association.


INTRODUCTION
Sarcoidosis is granulomatous disease characterized by the presence of non-caseating granulomas in affected organs. Sarcoidosis can affect any organ in the body and the disease course can be selflimited or chronic (1). These different outcomes have led to the classification of sarcoidosis based on prognosis: resolved disease, i.e., duration of disease <2 years and persistent disease with a longer duration (≥2 years) (2).
Exact etiology of sarcoidosis is still unclear. The heterogeneity of sarcoidosis in clinical course and organ involvement has led to the hypothesis that sarcoidosis might not be one disease, but consists of several disease entities, each with distinct genetic associations (3). This may explain why no single sarcoidosisassociated variant has been found, but a wide range of genes with relatively small effects. With a current knowledge it seems, that the genetic susceptibility to sarcoidosis is mainly affected by the genetic variation in the genes and pathways related to granuloma formation and immune response. The MHC region in chromosome 6p21.3 contains multiple genes essential for the immune system. The association for sarcoidosis in the MHC have been found in all classes through I to III (4,5). In our previous studies in Finnish patients, an association between class II HLA-DRB1 * 03:01 and the haplotype consisting of HLA-DRB1 * 04:01-DPB1 * 04:01 and good prognosis compared to poor prognosis was found (44.9 vs. 22.7%; p = 0.001; OR = 2.78; 95% CI = 1.45-5.24) (6,7).
Besides the MHC, other susceptible sarcoidosis risk/protective chromosome regions and genes have been found throughout the genome. Beside classical candidate-gene approaches, genomewide association analyses (GWAS) have become method of choice nowadays. However, as sarcoidosis is a rare disease, its prevalence in Finland being 28 per 100,000 (8), collections of large case-control materials for GWAS are demanding. A possible new method for finding causality in genetics behind sarcoidosis is whole-exome sequencing (WES). The exome sequences encompass only about 2% of the human genome but harbors about 85% of all described disease-causing variants (9), making smaller sample sizes sufficient for identification of novel genes.
Aim of this study was to further characterize genetic differences between Finnish sarcoidosis prognosis utilizing whole-exome sequencing method in the subset of 72 patients and to replicate the findings in a bigger data set of 188 Finnish patients. In Finnish patients, subjects with good prognosis are more likely to have above mentioned class II HLA markers, but no genetic markers have been found for persistent disease. The objective was to further pinpoint genetic variety of sarcoidosis prognosis in relation to the HLA markers.

Study Subjects and Selection Criteria
Study subjects and their characteristics have been previously described (7). In summary, a total of 188 Finnish patients with verified pulmonary sarcoidosis had been followed-up for 5-15 years (Supplementary Table 1). After follow-up the patients were clinically divided into those with disease resolved within 2 years (n = 90) and to those with persisting activity after 2 years (n = 98). Disease activity was evaluated using the generally accepted WASOG (World Association of Sarcoidosis and Other Granulomatous diseases) criteria (10). The clinical examinations included a chest radiograph, a lung function test (spirometry, diffusion capacity), electrocardiography (ECG), liver enzymes, serum calcium, creatinine, serum lysozyme, and serum ACE.
For the WES study, a subset of 72 patients were chosen (Figure 1). Patients were selected based on disease activity (resolved disease, n = 36; persistent disease, n = 36). These subsets were further divided by the HLA markers previously known to influence disease prognosis in Finnish patients (HLA-DRB1 * 03:01 and HLA-DRB1 * 04:01-DPB1 * 04:01) yielding in 18 patients with one or both markers (HLA+) and 18 patients with neither of the markers (HLA-). Same selection was used in both resolved and persistent patient groups. The Genome Aggregation Database (gnomAD) (11) was used as control population in the WES study. GnomAD database consists of exome data from 11 150 Finnish subjects, with no known diseases.
For the replication of the SNPs found in the WES study, the rest of the original above mentioned 188 Finnish sarcoidosis patients were genotyped. A control population of 150 healthy subjects representing the Finnish population (12) was included for the replication phase.
The study protocol was approved by the Ethics Committee of the Department of Internal Medicine, Hospital District of Helsinki and Uusimaa, Helsinki, Finland. All subjects provided written informed consent for their participation in genetic association studies.

Whole Exome Sequencing
The DNA extraction have been previously explained (7). The WES was performed in Institute for Molecular Medicine Finland (FIMM). 150 ng of gDNA was fragmented with Covaris E220 evolution instrument (Covaris, Woburn, MA, USA). Sample libraries and MedExome Enrichment were processed according to SeqCapEZ HyperCap Workflow User's Guide (Roche Nimblegen, Madison, WI, USA). Enrichment was performed in 4 samples Multiplexed DNA Sample Library Pools using 1 µg of each library. The amplified library was purified with 1.8x Agencourt AMPure XP beads and eluted to 200 µl EB-buffer. Library was quantified for sequencing using 2100 Bioanalyzer High sensitivity kit. Sequencing was performed with Illumina HiSeq2500 system in HiSeq High Output mode using v4 kits (Illumina, San Diego, CA, USA). Read length for the paired-end run was 2 × 101 bp. 91% of all samples had the target coverage of 20x.

Variant Calling
Sequencing data were pre-processed using Trimmomatic (leading:3, trailing:3, sliding window:4:15, illuminaclip:2:30:10, minlen:36). Paired-end reads passing the filtering were mapped to the reference genome (GRCh38) using the BWA-MEM algorithm. Reads were sorted by coordinate and duplicates were marked with the Picard tools (http://broadinstitute.github. io/picard/). Base qualities were recalibrated and local indel realignment performed around indels at the BAM-level using Genome Analysis Toolkit (GATK) followed by merging of data by sample and remarking of duplicates with the Picard tools. Joint genotyping was performed on the whole cohort of 72 samples using GATK HaplotypeCaller and variants were recalibrated using GATK VariantRecalibrator. GATK tools were applied as recommended by GATK guidelines and using GATK resource files that had been converted from GRCh37 to GRCh38 using CrossMap and chain files from EnsEMBL. Quality control analysis of sequencing data was performed using the FastQC (http://www.bioinformatics.babraham.ac.uk/ projects/fastqc) with default settings. Additionally, frameshift indels were excluded due to recognized difficulties calling from WES data (13).
Annotation and filtering of variants was performed using the Annovar tool against the RefGene database. Filtering quality control was as follows: coverage >10, variant quality value >40, frequency <1%, and assumed to impair protein function. The mean coverage of exomic regions for each variant was 43X with individual variant reads varying between 17 and 153X. Reads were manually checked using Integrative Genomics Viewer (IGV). Variants with unknown frequency in Finnish population in the gnomAD database were excluded. Version information and references of tools used in variant calling are given in Supplementary Table 2.

Testing for Genetic Association
We analyzed associations for single variants and variant groups within genes. The associations were evaluated between all the dichotomous groups (persistent vs. resolved disease, HLA+ persistent vs. resolved disease, and HLA-persistent vs. resolved disease). For single-variant analysis, the common variants [minor allele frequency (MAF) > 0.01] were included. The hypergeometric distribution method in EPACTS software pipeline (https://genome.sph.umich.edu/wiki/EPACTS) was used. The gene-based tests were assessed to find whether variants in each gene are jointly associated with the phenotype. Genebased association improves the power to detect rare variants (MAF < 0.05) in limited sample size. Two methods were used to test the associations: hypergeometric distribution and SKAT-O (14) methods implemented in the EPACTS software. SKAT-O test was used, because it assumes that each variant can have either positive or negative effect to the phenotype. The p were corrected for multiple comparisons (number of studied variants) by using False Discovery Rate (FDR) method and Bonferroni correction. A value of p < 0.05 was considered statistically significant. Age, gender, and extrapulmonary manifestations were used as covariates in all statistical tests. To assess the functionality of the found variations in protein level, we used Sift (15) and PolyPhen (16) databases, which both predict possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations. To further investigate the possibly functional effects of the significant SNPs, we used the Genotype-Tissue Expression (GTEx) Portal (17) to study the expression quantitative trait loci (eQTL).
For the replication, the SNPs that reached the significant association level in single-variant and gene-based analysis were included (Supplementary Table 3). The SNP genotyping was performed with the Agena Bioscience (Sequenom) MassARRAY System (Agena Biosciences, San Diego, California) at the Institute for Molecular Medicine Finland (FIMM), Helsinki, Finland, with standard protocols. Genotypes were called using Sequenom's MassARRAY Typer software. The allele frequencies between different groups were compared by a case-control association analysis (Chi-square χ 2 test, PLINK software) (18). We applied the following quality control filters: minimum call rate per sample of 90%, SNP minor allele frequency (MAF) > 0.01 and Hardy Weinberg equilibrium (HWE) > 0.001. Total success rate for accepted SNP arrays was 95% in the replication samples.  showing the highest associating peak in the chromosome 19, although not reaching the exome-significance level. Supplementary Figure 1 shows the quantile-quantile (Q-Q) plot for the single-variant tests, which demonstrates the similarity between observed and expected significance values. Table 1 shows the single variants that yielded the strongest associations in analyses between resolved and persistent, and between HLA+ and HLA-prognosis groups. In analysis between resolved and persistent patients altogether seven variants associated with the prognosis. Interestingly, five of these variants were located in two chromosomal regions: three in chromosome 1 (1p36.21) and two in chromosome 19 (19q13.42). In HLA+ group, the associations in these two chromosomal locations remained. In HLA-patients the chromosomal locations 1p36.21 and 19q13.42 did not associate with the prognosis. None of these associations remained significant after correcting for multiple testing with FDR or Bonferroni correction.

Gene-Based Analysis
In gene-based tests, all variants found in the same gene are grouped to assess the association with the phenotype. Also, in gene-based analysis the locations 1p36.21 and 19q13.42 showed associations while not remained significant after correcting with FDR or Bonferroni correction ( Table 2). In location 19q13.42, variants in gene LAIR1 were also significantly associated with the prognosis, in addition to previously seen LILRB4 and KIR3DL1/KIRSDS1. Manhattan plot from the gene-based analysis between HLA+ persistent and resolved patients is seen in Supplementary Figure 2. No association in locations 1p36.21 and 19q13.42 was seen in HLA-patients. Another chromosomal location was also seen: in 2p22.3, two genes were shown to associate with disease prognosis; RASGRP3 with persistent disease and LTBP1 in resolved disease. However, these genes did not show significant association in any of the other subgroup analysis.

Replication Analysis
For the replication analysis, the rest of the Finnish sample set of 188 samples and 150 Finnish control material were included. After quality control, total of 150 controls and 181 cases (72 from the WES study and 109 from the replication study) were included in the analysis. In the replication analysis between persistent and resolved sarcoidosis patients, the SNPs in the chromosomal location 1p36.21 were statistically significant as in the WES study ( Table 3). The association in these SNPs was also found in the analysis between HLA + persistent and resolved patients, and in contrast to the WES study, these SNPs were also significantly associated in the HLA-disease subgroup. When combining the data from the WES and replication studies, the associations became even stronger in all subgroup analysis. In chromosomal location 19q13.42, KIR3DL1/KIR3DS1 SNPs did not replicate and LILRB4 SNPs associated significantly in the HLA + resolved patients in the combined data set ( Table 3). None of the associations remained statistically significant after correction with SNP count (data not shown). However, the MAFs in the control population corresponded to the ones in the gnomAD Finnish control population except for the KIR3DL1/KIR3DS1 SNP rs643861, which lacked allele frequency data in the gnomAD database (Table 3). However, in the NCBI dbSNP database (https://www.ncbi.nlm.nih.gov/snp/) MAF from TopMed database is A = 0.185, which resembles MAF found in our study. TOPMed uses hg38 as a reference genome, in gnomAD the hg19 is used as a reference genome which might explain the differences.
The WES study associations were found in single-variation and gene levels in the chromosomal locations 1p36. 21 and 19q13.42. In the replication sample, the associations in the location 1p36.21 were found only in the SNPs associating in the single-variation analysis (Supplementary Table 3). In chromosomal location 19q13.42, none of the other SNPs in the KIR3DL1/KIR3DS1 associated with the disease (data   Table 3). In chromosomal location 2p22.3, two genes were shown to associate with the disease prognosis in the WES study; RASGRP3 with persistent disease and LTBP1 in resolved disease. In the replication study, only one SNP in the RASGRP3 and LTBP1 passed the quality control (Supplementary Table 3) and none of these SNPs associated with the trait (data not shown).

DISCUSSION
The heterogeneity of sarcoidosis in clinical course and development of organ involvement suggests that manifestation of the disease phenotype is a result of multiple genetic variations.
There is a special interest in finding genetic differences between sarcoidosis patients with different prognosis, because at the moment prediction of the disease course at the time of diagnosis is difficult. There are also significant ethnic differences in  Frontiers in Immunology | www.frontiersin.org disease development and prognosis in different populations (19). These differences represent a compelling need for ethnic-selective biomarkers to assess disease progression in diverse populations.
Here we present results from WES study in Finnish sarcoidosis patients. The strategy was to sequence groups of sarcoidosis patients with known differences in prognosis to identify variants that could alter functional properties of proteins. In Finnish sarcoidosis patients, certain HLA markers (HLA-DRB1 * 03:01, HLA-DRB1 * 04:01-DQB1 * 04:01) have been shown to be more common in patients with sarcoidosis disease resolving within 2 years, i.e., good prognosis disease. The aim of this study was to further pinpoint genetic variety in different sarcoidosis prognosis groups in relation to these HLA markers.
The most strongly associated and replicated chromosomal location was found in 1p36.21, containing two genes (AADACL3 and C1orf158). This region has also been found to associate with familial sarcoidosis in a recent WES study in German patients (20). However, in French WES study of familial sarcoidosis, no association with this region was found (21). In our study, variations in these genes were associated with disease resolution independent of the HLA markers DRB1 * 03:01 and DRB1 * 04:01. The found variations in AADACL3 are predicted to be benign, but the variation in C1orf158 is thought to alter the function of the protein. However, the functionality of these variants in sarcoidosis predisposition is unclear at the moment. The variants rs3010877 in AADACL3 and rs1132185 in C1orf158 are known to act as eQTLs in not sun exposed skin, but no clear in sarcoidosis pathogenesis is evident. Due to LD, it is also possible that association derives from elsewhere in the chromosome. However, no functionally plausible genes reside in near proximity of these genes, while the further region in 1p36 has been associated with resolution of chest findings in sarcoidosis patients (22). Therefore, further studies are needed to evaluate the possible causality of these genes and sarcoidosis susceptibility.
In the replication part of the study, only association with LILRB4 remained statistically significant when WES and replication data were combined, the other associations were weakened or could not been replicated. It should be noted, that the size of the replication material is still relatively small, and possible true causality cannot be ruled out.
The 19q13.42 region, called the Leukocyte Receptor Complex (LRC), comprises a large set of genes encoding immunoglobulin superfamily receptors (23). The leukocyte immunoglobulinlike receptors (LILR) can interact with the HLA class I (24) and are expressed by a range of immunologically active cells, including natural killer (NK) and antigen-presenting cells (APCs) (25). LILRs have both activating and inhibitory effects (26). Previously no associations have been found between LILRB4 and sarcoidosis, but variations in this gene have shown to increase risk of another immune-mediated disease, systemic lupus erythematosus (SLE) (27). In our study, variations in this gene associated with good prognosis patients having the good prognosis HLA Class II markers. LILRB4 belongs to inhibitory receptors expressed on dendritic cells and has been shown to interact with the HLA-G. (28). LILRB4-HLA-G interaction has been demonstrated to limit the activation of dendritic cells (29), which are important APCs in sarcoidosis inflammation. In sarcoidosis, the association with HLA-II is thought to be more important, but it should be noted that limited information exists in relation to HLA-G in sarcoidosis. In one study, the HLA-G variation was seen in sarcoidosis patients (30). Also, there is a strong LD in the MHC region, and it is possible that patients with the HLA class II markers have haplotypes containing the HLA class I as well. The combination of these MHC variations, with the variants in the LILRB4, could alter the immune reaction in sarcoidosis and lead to selfresolving disease.
In addition to LILRB4, in the WES study, association was found with variants in LAIR1 and KIR3DL1/KIR3DS1. Leukocyte-associated immunoglobulin-like receptor-1 (LAIR1) has been shown to inhibit T and NK cell activation (31), but there are no known HLA ligands for this receptor. In our study, association with variants in the LAIR1 was seen when persistent and resolved patients were compared. No association was evident with the HLA markers, which is in concordance with the notion that there are no known HLA ligands for this protein. No association have been found between this gene and sarcoidosis, but also this immune-modulator gene has been shown to have effect in SLE (32). However, this association was seen only in gene-based analysis and with single method. Due to technical issues, the variation could not be replicated, so the effect should be considered cautiously. The other association seen only in the WES study encode KIRs, receptors expressed mainly on the surface of natural killer (NK) cells. Most diverse of the KIR genes is locus KIR3DL1/KIR3DS1 which associated with persistent sarcoidosis in our study. The gene encodes both inhibitory (3DL1) and activating (3DS1) receptors (33), so either or both of these functions could be affected, shifting the course of the disease toward chronic. In a recent WES study of three families sharing pediatric sarcoidosis, the KIR3DL1/KIR3DS1 in gene level was found to be shared in these families (34). To our knowledge, no case-control study has found association in KIR genes and sarcoidosis.
There are some limitations in this study. Sarcoidosis being a relatively rare disease, the collection of large data sets is demanding in a small country as Finland. Therefore, due to relatively small sample size, the power might have not been sufficient to detect all associations, however, the genebased analysis in the WES study and the replication sample increases the power. Because of the patient selection based on the HLA markers, the analysis between all persistent and resolved patients has some statistical limitations. However, in the replication study, the whole Finnish sarcoidosis sample is involved. The other confounding factors: gender, age and extrapulmonary manifestations, were conditioned in all analyses.
In summary, we used WES to further characterize genetic differences between persistent and resolved sarcoidosis patients and to test the hypothesis whether different variations are associated in relation to the class II markers (HLA-DRB1 * 03:01/HLA-DRB1 * 04:01-DPB1 * 04:01). An association with resolved disease raised in the chromosomal region 1p36.21, a region which has recently been associated with sarcoidosis in another WES study. Another interesting chromosomal region peaked, 19q13.42, but in the replication study the association weakened. However, due to relatively small sample size, the true causality of these variants should be evaluated in a larger data set.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation, to any qualified researcher.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Ethics Committee of the Department of Internal Medicine, in the Hospital District of Helsinki and Uusimaa, Helsinki, Finland (Approval Dnro 362/E5/05). The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
EL, MK, JS, OS, and M-LL: study design and data analysis and manuscript writing. The manuscript has been read and approved by all named authors.

FUNDING
This study was supported by the Nummela Foundation for Medical Research.