Exome-Wide Rare Variant Analysis From the DiscovEHR Study Identifies Novel Candidate Predisposition Genes for Endometrial Cancer

Endometrial cancer is the fourth most commonly diagnosed cancer in women. Family history is a known risk factor for endometrial cancer. The incidence of endometrial cancer in a first-degree relative elevates the relative risk to range between 1.3 and 2.8. It is unclear to what extent or what other novel germline variants are at play in endometrial cancer. We aim to address this question by utilizing whole exome sequencing as a means to identify novel, rare variant associations between exonic regions and endometrial cancer. The MyCode community health initiative is an excellent resource for this study with germline whole exome data for 60,000 patients available in the first phase, and further 30,000 patients independently sequenced in the second phase as part of DiscovEHR study. We conducted exome-wide rare variant association using 472 cases and 4,110 controls in 60,000 patients (discovery cohort); and 261 cases and 1,531 controls from 30,000 patients (replication cohort). After binning rare germline variants into genes, case-control association tests performed using Optimal Unified Approach for Rare-Variant Association, SKAT-O. Seven genes, including RBM12, NDUFB6, ATP6V1A, RECK, SLC35E1, RFX3 (Bonferroni-corrected P < 0.05) and ATP8A1 (suggestive P < 10−5), and one long non-coding RNA, DLGAP4-AS1 (Bonferroni-corrected P < 0.05), were associated with endometrial cancer. Notably, RECK, and ATP8A1 were replicated from the replication cohort (suggestive threshold P < 0.05). Additionally, a pathway-based rare variant analysis, using pathogenic and likely pathogenic variants, identified two significant pathways, pyrimidine metabolism and protein processing in the endoplasmic reticulum (Bonferroni-corrected P < 0.05). In conclusion, our results using the single-source electronic health records (EHR) linked to genomic data highlights candidate genes and pathways associated with endometrial cancer and indicates rare variants involvement in endometrial cancer predisposition, which could help in personalized prognosis and also further our understanding of its genetic etiology.

Endometrial cancer is the fourth most commonly diagnosed cancer in women. Family history is a known risk factor for endometrial cancer. The incidence of endometrial cancer in a first-degree relative elevates the relative risk to range between 1.3 and 2.8. It is unclear to what extent or what other novel germline variants are at play in endometrial cancer. We aim to address this question by utilizing whole exome sequencing as a means to identify novel, rare variant associations between exonic regions and endometrial cancer. The MyCode community health initiative is an excellent resource for this study with germline whole exome data for 60,000 patients available in the first phase, and further 30,000 patients independently sequenced in the second phase as part of DiscovEHR study. We conducted exome-wide rare variant association using 472 cases and 4,110 controls in 60,000 patients (discovery cohort); and 261 cases and 1,531 controls from 30,000 patients (replication cohort). After binning rare germline variants into genes, case-control association tests performed using Optimal Unified Approach for Rare-Variant Association, SKAT-O. Seven genes, including RBM12, NDUFB6, ATP6V1A, RECK, SLC35E1, RFX3 (Bonferroni-corrected P < 0.05) and ATP8A1 (suggestive P < 10 −5 ), and one long non-coding RNA, DLGAP4-AS1 (Bonferroni-corrected P < 0.05), were associated with endometrial cancer. Notably, RECK, and ATP8A1 were replicated from the replication cohort (suggestive threshold P < 0.05). Additionally, a pathway-based rare variant analysis, using pathogenic and likely pathogenic variants, identified two significant pathways, pyrimidine metabolism and protein processing in the endoplasmic reticulum (Bonferroni-corrected P < 0.05).
In conclusion, our results using the single-source electronic health records (EHR) linked to genomic data highlights candidate genes and pathways associated with endometrial cancer and indicates rare variants involvement in endometrial cancer predisposition, which could help in personalized prognosis and also further our understanding of its genetic etiology.

INTRODUCTION
MyCode Community Health initiative is a precision medicine initiative by Geisinger Health System (1). As part of the initiative, blood and other samples are collected from the patients who have consented to participate in the MyCode Community Health initiative. The samples are stored in a systemwide biobank and are sequenced at Regeneron Genetics Center as part of the DiscovEHR project. The high-throughput sequencing data coupled with longitudinal electronic health records (EHR) has been used for genetic research. The genetic data from DiscovEHR has been successfully used to detect various diseasecausing variants, which confer increased risk to develop one or more of 21 conditions including hereditary breast and ovarian cancer, familial hypercholesterolemia, cardiomyopathy, Marfan syndrome, and Lynch syndrome (2). The clinically actionable pathogenic variants identified are delivered to the patients through the return of results program at Geisinger Health System and their EHR is updated with the information, which can be readily accessed by their provider (1). As of February, 2019, the results for 1,048 MyCode participants who have one or more of clinically actionable conditions were delivered through the return of results program (2). One of the primary conditions diagnosed as part of the MyCode initiative is Lynch syndrome, and 99 patients out of 1,048 were diagnosed with Lynch Syndrome. Lynch syndrome is caused by germline mutations in one of several DNA mismatch repair genes. Among Lynch syndrome patients, endometrial cancer is the second most diagnosed condition (3). Women with Lynch syndrome have about a 50% chance of developing endometrial cancer (3, 4). Familial risk of endometrial cancer increased by ∼2 fold with a firstdegree female relative with endometrial cancer (5). However, the germline abnormalities associated with Lynch syndrome only explain a small fraction of heritability of endometrial cancer, which suggests there are other rare germline variants that can help explain the heritability of the disease.
The germline variants have been studied for finding associations with various diseases on a genome-wide scale. Many successful genome-wide association studies (GWAS) have identified common variants associated with multiple complex diseases, including cancer. The NHGRI-EBI catalog has almost 60,000 unique SNP-trait associations (6). Many GWAS have been also conducted on endometrial cancer. One of the recent metaanalysis studies using 7,737 endometrial cancer cases and 37,144 controls identified seven loci associated with endometrial cancer (7). However, to date, all the GWAS using common variants are only able to explain <6% of the familial relative risk for endometrial cancer (7)(8)(9)(10). Since common variants discovered to be associated with endometrial cancer have only modest effect size, the missing heritability could be further explained by rare variants.
Rare variants are known to play an essential role in human diseases. Due to the evolutionary purifying selection, deleterious alleles are likely to be rare (11). Moreover, many rare variant analysis studies have found variants to be associated with various cancer types, such as prostate cancer (12), melanoma (13), colorectal cancer (14), urinary tract cancer (15), etc. Notably, rare pathogenic germline variants in various cancers have been linked to functional consequences (16). To the best of our knowledge, rare variants have not been widely studied in endometrial cancer. An exome-wide rare variant association study was conducted in 2014, but they could not find any variants significantly associated with endometrial cancer (17). Therefore, there is a need to further investigate rare variants that are responsible for the endometrial cancer susceptibility. Identifying cancer predisposition genes based on pathogenic rare variants could give novel insight into the genetic basis of endometrial cancer and could be valuable for preventive markers and precision medicine.
In summary, we set out to discover rare variants associated with endometrial cancer using whole exome data where 472 endometrial cancer patients and 4,110 non-cancer patients were pulled from a single hospital system. Their phenotypes and demographics were derived from their EHRs stored in systemwide EPIC database. The rare variants were binned into genes and pathways, and subsequently, statistical tests were run to test the association of biologically-informed units with endometrial cancer. Moreover, the significant results from the discovery phase were validated using an independent DiscovEHR replication cohort with 261 endometrial cancer patients and 1,531 noncancer patients, also pulled from the same hospital system.

Study Design and Quality Control
Whole exome sequencing was performed on 60,000 samples as part of DiscovEHR study. From 60,000 samples all patients with endometrial cancer (N = 481) and matched controls (N = 4,403) were selected. Further, an independent cohort of 30,000 samples was also sequenced as part of DiscovEHR study, and endometrial cancer (N = 263) and controls (N = 1,586) was pulled as a replication dataset. The controls consisted of women with no diagnosis of cancer and were retrieved by matching the age and body mass index (BMI) with all cancer patients in the cancer registry. Age and BMI were calculated as described in the Methods section. For case population, all patients diagnosed with any type of cancer were retrieved from the cancer registry and then a subset of patients with international classification of diseases for oncology (ICD-O) site codes relevant to endometrial cancer, specifically C54.0: Isthmus uteri, C54.1: Endometrium, C54.3: Fundus uteri, and C54.9: Corpus uteri were selected.
After retrieving the patient IDs for case and control populations, their whole exome sequencing data was retrieved from DiscovEHR. Further, quality filters were applied to the data before performing a rare variant analysis. Any loci with <90% call rate in the exome data were removed, and all patients were confirmed to have genotype call rate >90%. All third-degree relatives (Identity by descent (IBD) > 0.125) were removed. Additionally, principal components were calculated to adjust for population stratification using EIGENSOFT (18). There were 299 related patients with IBD < 0.125 with 9 cases and 290 controls in discovery dataset and 55 related patients with two cases and 53 controls in replication dataset that were removed. Further, 10 controls from discovery and replication datasets were removed because of missing BMI in their EHR. After the marker quality FIGURE 1 | Schematic overview of the association study. The blood samples were collected and sequenced as part of MyCode and DiscovEHR projects. The phenotype information was pulled from the cancer registry and EHR. control (QC) step, the whole exome sequence had 2,431,845 rare variants with minor allele frequency (MAF) < 0.05, and the rare variants were binned into gene bins determined by Entrez Gene annotations. The schematic overview of the study is shown in Figure 1.
The population characteristics for case and control populations after QC steps are summarized in Table 1. The average age of endometrial cancer patients was 60 years with a standard deviation of 11.69 years and control population had an average age of 58 with a standard deviation of 14.45. Further, in case population 64 were deceased, and 408 were still alive as of May 2017, when the data was retrieved from the cancer registry. The control population had 222 deceased and 3,888 patients alive. The table also provides the American Joint Committee on Cancer (AJCC) stage information and the number of patients with a history of any cancer in the family. The two tailed t-test statistics for age and BMI between case and control group in discovery and replication are listed in Supplementary Table 1.

Gene-Based Rare Variant Analysis
BioBin (19)(20)(21), a rare variant analysis tool, was used to bin rare variants into genes, and SKAT-O (22) was used to identify associations between genes and phenotype of interest. Any variant below MAF of 5% was considered rare. In the discovery analysis, 2,431,845 rare variant loci were binned into 22,126 genes. Further, any genes with <20 minor allele counts (MAC) were filtered out, leaving only 20,385 genes. Both burden and dispersion tests can be applied to test for association. However, SKAT-O is a unified approach that optimally combines the burden and nonburden sequence kernel association test (SKAT) and increases statistical power to detect associations (22). After running SKAT-O on 20,385 genes, the association p-values were further adjusted for multiple testing using Bonferroni correction and any adjusted p-value below 0.05 was considered significant. We identified six genes and one long non-coding RNA that reached global significance-RBM12, NDUFB6, DLGAP4-AS1, ATP6V1A, RECK, SLC35E1, and RFX3 as shown in Manhattan plot in Figure 2.
The RECK gene was found to be associated with endometrial cancer and was replicated (p-value < 0.05). ATP8A1 showed a suggestive association (p-value < 10 −5 ) in the discovery analysis and was also replicated (p-value < 0.05). Table 2 shows the number of loci, allele counts, SKAT-O p-value, and Bonferroni corrected p-value of the six protein-coding genes, one long noncoding RNA as well as additional suggestive gene, ATP8A1 from discovery and replication analysis. Table 2 also summarizes the distribution of MAC between case and control populations.
To further characterize the contribution of each locus binned in the significant genes, the association tests were rerun by removing each locus at a time from the gene/bin and pvalues (P rm ) were generated. A significant increase in P rm (less significant) would indicate the higher significance of the locus in the gene and no change in P rm would indicate the locus in consideration is insignificant. After running the tests, the P rm was lower for 245 loci out of 1,136 total loci binned. All the loci with positive P rm in RECK and annotated as moderate or high impact by Variant Effect Predictor (VEP), are listed in Table 3 and the corresponding plot in Figure 3. A complete list of loci and their P rm for all other significant genes can be found in Supplementary Tables 2-9.

Variant Annotation
Variants in significant genes were annotated using ClinVar (2018-07) and Variant Effect Predictor (VEP v92) to determine the clinical significance, effect of variants on the protein and implications in human inherited diseases. ClinVar is a public archive that connects human variation to phenotypes, the clinical significance, relationship to human health, and other supporting data obtained through submissions by various groups and aggregated to reflect both consensus and conflicting assertions (23). VEP provides information about the variants' location, gene/transcript affected by variants, types of mutation (i.e., stop gained, missense, stop lost, and frameshift) and protein change scores, which indicate possible partial/complete loss of function of the protein due to amino acid substitution (24). None of the identified variants were found in ClinVar database. However, all variants were successfully annotated using VEP. The distribution of variant types across significant genes as annotated by VEP is shown in Figure 4. Majority of the variants across all significant genes are intron variants (62.0%), followed by missense (12.8%), synonymous (10.6%), and other variants (14.6%).
Identified variants were also annotated with a COSMIC database, which contains a manually curated list of somatically    Table 4 with tissue type where the somatic mutations were observed. Two of the somatic mutations COSM1177811 in RECK and COSM1036559 in ATP6V1A were observed in Endometrium.

Pathway-Based Rare Variant Analysis
Rare variants were also binned into pathways using KEGG pathway annotations. As pathway bins would include a large number of loci with a limited sample size, only loci categorized as pathogenic or likely pathogenic with at least one star in ClinVar and loci categorized as high impact by VEP were binned. The association analysis based on pathway bins was also performed using SKAT-O. Out of 317 pathways tested, six pathways were significant with p-value < 0.1 after FDR ( Table 5) and two pathways-pyrimidine metabolism and protein processing in endoplasmic reticulum were found to be significant with pvalue < 0.05 after Bonferroni correction. However, none of the pathways observed to be associated with endometrial cancer were replicated.

Survival Analysis
The significant genes were further analyzed to determine their association with survival outcome of endometrial cancer patients. The survival analysis was performed using Cox regression, FIGURE 3 | Plot of all variants with lower P rm in RECK which were classified as moderate or high impact by VEP. The y-axis represents negative log scaled P rm -P val where P val is the original SKAT-O p-value listed in Table 2, and the x-axis is relative genomic coordinate in the gene. adjusting for age and BMI. NDUFB6 was observed to be significantly associated with survival with Cox regression p-value < 0.05 ( Table 6). Out of 472 endometrial cancer patients, 74 had rare variants in NDUFB6, and they had a lower survival rate in comparison to endometrial cancer patients with no rare variants in NDUFB6 (Figure 5).

DISCUSSION
The results from this study illustrate that a population from a single hospital system can be used to identify rare germline variants associated with endometrial cancer diagnosis. Genomewide rare variant analysis using the DiscovEHR cohort identified seven genes and one long non-coding RNA to be associated with endometrial cancer, of which two genes RECK and ATP8A1 were replicated (suggestive threshold P < 0.05). Additionally, the significance of the variants was evaluated by the backward elimination approach. Variants were also annotated using ClinVar and VEP to examine the variant consequence and any known associations. Further, survival analysis was run to access how the presence of rare variants in a gene influenced the survival of the patient. In summary, many rare variants were discovered, which positively contributed to the association of gene, and some of them were also found to be known somatic mutations in cancer including endometrial cancer. One of the genes found to be associated with endometrial cancer, NDUFB6, was also found to be significantly associated with survival. Several genes identified in this study have been previously implicated in endometrial cancer or other cancers. The second most significant gene found to be associated with endometrial   cancer in this study, NDUFB6 is a nuclear-encoded subunit of NADH-ubiquinone oxidoreductase, also known as Complex I (CI), which is the largest complex of the electron transport chain in mitochondria. CI is known to play a role in tumorigenesis, resistance to cell death and metastasis (25). Moreover, various oncocytic cancer cells have an excessive number of mitochondria. Mutations in NDUFB6 have been observed in oncocytic thyroid tumor (26) and downregulation of NDUFB6 due to the loss in 9p24.1-p13.3 is known to be responsible for metastasis in renal cell carcinoma (27). Besides, in this study, there were 13 missense variants out of which 7 were predicted to be damaging by polyphen score that could disrupt the function of NDUFB6. Moreover, the other 99 variants in intron or UTR region could modify the gene expression. They also could result in alternative splicing of NDUFB6 as there are three distinct isoforms that are encoded by the transcript variants (RefSeq, Jan 2011). Further, one of the endometrial cancer studies investigating mitochondrial DNA mutations conducted an immunohistochemistry analysis of various types of type 1 endometrial cancer samples by staining for nuclear-coded NDUFB6 and mitochondria-coded MTND6. In a 0-4 intensity score for staining, samples with oncocytic-like foci showed about 23% (3/13) complete loss of staining, 23% (3/13) partial loss of staining with an intensity score of 2 and 23% (3/13) partial loss of staining with the intensity score of 3 for NDUFB6 (28). However, in case of endometroid samples with no specific differentiation aspects, 92% (12/13) showed complete staining (intensity 4). Thus, the evidence strongly suggests a role of NDUFB6 in oncocytic endometrial carcinoma and further studies would be required to elucidate precise mechanisms. The survival analysis performed in this study also showed endometrial cancer patients with rare variants in NDUFB6 have a significantly lower survival rate than endometrial cancer patients with no rare variants reemphasizing its possible role in endometrial cancer.
Another gene found to be associated with endometrial cancer in this study, SLC35E1 is known to be upregulated in latestage endometrial endometrioid carcinoma (29). It is also known to have differential membrane proteome expression between normal and inflammatory breast cancer cells, which is a rare and very aggressive form of breast cancer (29,30). The SLC35E1 is not well-studied, and its mechanism of action in endometrial cancer is not well-understood. The disruption of tumor suppressor genes and activation of oncogenes are common in cancers. The gene RECK, which was replicated in this study, is known to be a tumor suppressor (31). RECK negatively regulates some matrix metalloproteinases which are known to facilitate tumor invasion and metastasis (31). The epigenetic downregulation of RECK is known to stimulate invasion and migration in colon cancer (32), breast cancer (33), prostate cancer (34), lung cancer (35), and gastric cancer (36). Moreover, RECK is also part of KEGG pathway "MicroRNAs in cancer." RECK has already been suggested as a promising prognostic marker, and therapeutic agent in the cancers mentioned above (37) and potentially could apply to endometrial cancer. Altogether, 189 variants were discovered in RECK, of which stop gained (N =1), frame shift (N = 1), and missense variants (N = 44) could lead to abnormal protein product and disrupt tumor suppressor function of gene RECK. Particularly, RECK is known to produce two proteins which have opposing effects, the shorter isoform of RECK leads to faster cell migration (38). Other variants in intron, splicing region and UTR region could also promote cancer by downregulating canonical RECK isoform or alternative splicing of RECK, producing the shorter isoform. Other candidate genes found to be associated with endometrial cancer, ATP6V1A and ATP8A1 act as oncogenes. ATP6V1A is known to drive proliferation and invasion in gastric cancer (39,40) and ATP8A1 in non-small cell lung cancer (41). The most significantly associated gene in this study, RBM12 is associated with colorectal cancer (42) and tumorigenesis of Meibomian cell carcinoma (42,43).
One of the pathways discovered by binning rare variants into KEGG pathways, Protein processing in the endoplasmic reticulum is associated with endometrial cancer (44,45). Disruption in protein processing in endoplasmic reticulum could cause endoplasmic stress and activation of the unfolded protein response (UPR) and GRP78 which facilitates growth and invasion of endometrial cancer (44,45). Another pathway, Pyrimidine metabolism was also significantly associated with endometrial cancer. Pyrimidines include cytosine, thymine, and uracil which are the basic building blocks of DNA and RNA. Disorders of purine and pyrimidine metabolism are known to increase cancer risk and even act as tumor suppressor depending upon the type and site of alterations (46).
Although we found significant associations, further studies are required to elucidate the functional and molecular mechanisms of the variants and genes in endometrial cancer. A potential limitation of our study is the replication cohort was not sufficiently powered to replicate results with genome-wide significance. Even though we found genome-wide significant genes associated with endometrial cancer in discovery dataset, the results need to be further confirmed by other independent studies, which is common practice for GWAS. Moreover, bigger sample sizes would increase statistical power and help us detect more rare variants and genes with modest effects. DiscovEHR study is ongoing, and participants/patients are still being enrolled and sequenced. Future studies could be conducted to replicate the results when more sequence data is available. Another limitation is that our data consists predominantly of patients with European ancestry due to the inherent ethnic distribution of Geisinger patients. Thus, we can only discover variants associated with European ethnicity. That being said, association analyses in a homogeneous population can be more powerful because the pool of case and controls are not divided across populations, and it can result in more robust associations. This limitation may be addressed in the future as DiscovEHR has started recruiting MyCode participants from geographical areas with a diverse ethnic population.
In conclusion, we have identified seven genes and one long non-coding RNA that are associated with endometrial cancer. At least two of the genes found have some known role in endometrial cancer. Additionally, many genes are associated with other cancers. We suggest that the genes and variants we identified in this study could help explain a fraction of the endometrial cancer heritability, facilitate personalized prognosis, and also aid in increasing our understanding of endometrial cancer etiology.

Study Population
Geisinger Health System is a health care provider in southcentral and northeastern Pennsylvania and southern New Jersey. All the patients who use Geisinger health services are eligible to participate in MyCode community initiative and the study population consisted of these participants. As part of MyCode, all patients who enroll in the program are sequenced regardless of the medical conditions they have. The blood samples were collected at Geisinger and sequenced by Regeneron Genetics Center as part of DiscovEHR study. To date, whole exome sequencing has been performed on approximately 90,000 samples and are linked to their EHR under a protocol approved by Geisinger Institutional Review Board. Additionally, Geisinger also maintains a separate cancer registry, which contains information on all patients who have been diagnosed with cancer. The Geisinger cancer registry also contributes patient data to the National Cancer Database. There were 8,791 patients out of 90,000 sequenced patients diagnosed with any type of cancer from the cancer registry and 481 EMCA cases were identified among them using ICD-9 site codes-C54.3, C54.9, C54.1, and C54.0. The control population consisted of a subset of age and BMI matched patients out of 90,000 sequenced samples with no history of cancer diagnosis based on the absence of any ICD9/ICD10 code related to cancer in a problemlist entry of the diagnosis code, an inpatient hospitalizationdischarge diagnosis code, or an encounter diagnosis code. The age for cancer patients was taken as age at initial diagnosis of cancer and for controls, the current age or age at death depending on whether they are alive or dead, respectively. BMI was calculated using median BMI for a year from initial diagnosis for cases. The BMI for controls were calculated using median BMI for a year from the current date for controls still alive and median BMI for a year from date of death for controls who were dead.

Sample Preparation, Sequencing, and Quality Control
The sample preparation is described in detail in Dewey et al. (47). The DNA samples were transferred using 2D matrix tubes (Thermo Scientific) logged in LIMS (Sapio Sciences) and stored in automated biobank at −80 • C (LiCONiC Tubestore). The sample quality was tested by running 100 ng of sample on a 2% pre-cast agarose gel (Life Technologies). Additionally, the quantity of sample was determined by fluorescence. The exome capture was prepared through a fully automated approach developed at Regeneron using a custom reagent kit from Kapa Biosystems. The captured DNA was PCR amplified and quantified by qRT-PCR (Kapa Biosystems).
The Exome Sequencing was performed at Regeneron Genetics Center. The 60,000 samples from phase 1 were sequenced using NimbleGen probe target-capture (SeqCap VCRome), and 30,000 samples from phase 2 were sequenced using a slightly modified version of xGen capture (Integrated DNA Technologies), which had supplemental probes added to capture regions of the genome well-covered by VCRome capture reagent but poorly covered by xGen, followed by sequencing on the Illumina HiSeq 2500 platform using the same protocol previously described in detail (48,49). In summary, the sequencing coverage depth was sufficient to provide >20x haploid depth of over 85% of targeted bases in 96% of samples, with ∼80x mean haploid read depth of targeted bases. Further, the reads generated for all samples (FASTQ files) were aligned to genome reference (GRCh38) using BWA-mem (50). The duplicate reads were identified and flagged using Picard MarkDuplicates tool for exclusion in later analysis (51). The variants were called using Genome Analysis Toolkit (GATK) (52,53). The INDEL-realigned and duplicate-marked reads were processed using GATK HaplotypeCaller to identify variations from genome reference generating genomic VCF files (gVCF). Further, both single-nucleotide variants (SNVs) and indels were identified using GATK's GenotypeGVCFs after genotyping each sample and a training set consisting of 50 randomly selected samples resulting in single-sample VCF files. The joint calling was done in batches of 200 single-sample gVCFs to create pVCF files and all pVCF files generated from joint calling were merged. This process was repeated for both datasets. Further, quality control steps were applied by filtering variant SNP sites for QualityByDepth (QD) score < 3 and depth < 7, and indels for QD < 5 and depth < 10. SNP sites and indel sites that don't carry an alternate Allele Balance (AB) ≥ 15% and AB ≥ 20%, respectively, in at least one sample were filtered out. Further, markers with a call rate <90% and samples with the call rate <90% were removed. Related samples up to a 3rd degree (IBD ≥ 0.125) were removed before running the association.

Rare Variant Gene-Based Association Test
All variants below 5% MAF were considered rare and they were binned in genes using BioBin v2.3.0 (19) (https://ritchielab. org/software/biobin-download). BioBin is a software that can collapse variants into biologically-informed bins, such as genes or pathways, and perform rare variant burden tests. BioBin uses a database called Library of Knowledge Integration (LOKI) which integrates knowledge from various disparate data sources about genomic locations of SNPs and genes, as well as known relationships among genes and proteins such as interaction pairs, pathways and ontological categories (19). The gene annotations in LOKI are derived from Entrez Gene (54). BioBin uses these annotations from LOKI as bin regions to bin the variants. After creating the bins, the variants were weighed using Madsen and Browning weights (55). All variants with MAF <5% were binned and genes with <20 variants were filtered out. SKAT-O was run using the R package (22). Additionally, age, BMI and first four principal components were used as covariates. The QQ-plot using gene based SKAT-O p-values is provided in Supplementary Figure 1. The p-values from association tests were adjusted for multiple testing using Bonferroni correction and any gene with p-value <5% was considered significant.

Rare Variant Pathway-Based Association Test
Rare variants were binned into gene pathways to test the association of pathways to endometrial cancer. LOKI is integrated with KEGG pathway information (56). The version of LOKI used in this study integrated the latest data on 15 April 2017 using the KEGG API. The rare variants were binned into 302 pathways and weighed using Madsen and Browning weights (55). Associations were tested using SKAT-O, adjusting for age, BMI and first four principal components. The QQplot using pathway based SKAT-O p-values is provided in Supplementary Figure 2. The SKAT-O p-values were adjusted for multiple testing using Bonferroni correction and any p-value <5% was considered significant.

Survival Analysis
Survival analysis was run using Cox regression adjusting for age and BMI, using the following model: Survival (months, alive) ∼ x + age + BMI Where "x" is the input from BioBin phe-bins output file which contains Madsen-browning weighted rare variant counts for each patient and gene. The "months" were measured as the number of months from birth to death for dead patients and the number of months from birth to the last follow-up recorded in the EHR for patients alive.

DATA AVAILABILITY
The raw data supporting the conclusions of this manuscript will be made available by the authors to any qualified researcher subject to a data use agreement.

ETHICS STATEMENT
The DisocvEHR study cohort is derived from individuals who consented to participate in Geisinger's MyCode Community Health Initiative as described previously (1,57). Additionally, IRB approval was obtained for this work (IRB-2016-0119).