The Cancer-Associated Genetic Variant Rs3903072 Modulates Immune Cells in the Tumor Microenvironment

Genome-wide association studies (GWAS) have hitherto identified several germline variants associated with cancer susceptibility, but the molecular functions of these risk modulators remain largely uncharacterized. Recent studies have begun to uncover the regulatory potential of noncoding GWAS SNPs using epigenetic information in corresponding cancer cell types and matched normal tissues. However, this approach does not explore the potential effect of risk germline variants on other important cell types that constitute the microenvironment of tumor or its precursor. This paper presents evidence that the breast-cancer-associated variant rs3903072 may regulate the expression of CTSW in tumor-infiltrating lymphocytes. CTSW is a candidate tumor-suppressor gene, with expression highly specific to immune cells and also positively correlated with breast cancer patient survival. Integrative analyses suggest a putative causative variant in a GWAS-linked enhancer in lymphocytes that loops to the 3’ end of CTSW through three-dimensional chromatin interaction. Our work thus poses the possibility that a cancer-associated genetic variant could regulate a gene not only in the cell of cancer origin but also in immune cells in the microenvironment, thereby modulating the immune surveillance by T lymphocytes and natural killer cells and affecting the clearing of early cancer initiating cells.


INTRODUCTION
Genome-wide association studies (GWAS) have been effective in identifying common genetic risk factors for several diseases including cancer. The cancer-associated genetic variants discovered by GWAS, however, are not necessarily causative themselves but may be in linkage disequilibrium (LD) with other functional variants. Since most GWAS variants are located in noncoding regions, previous functional characterization studies have focused on the gene regulatory function of these linked variants in cancer cells themselves and in matched normal counterparts (Cowper-Sal lari et al., 2012;Ghoussaini et al., 2014;Claussnitzer et al., 2015;Zhang et al., 2018b). For example, usage of breast cancer epigenome facilitated the discovery of a GWAS-linked functional variant that disrupts a binding site of FOXA1, which is a critical pioneer factor in estrogen receptor-positive (ER+) breast cancers (Cowper-Sal lari et al., 2012); similarly, another study identified a functional diabetes-associated variant using the epigenomic information in adipose-derived mesenchymal stem cells (Claussnitzer et al., 2015). Although new insights have resulted from these investigations, a provocative question that has not yet been examined is whether select cancer-associated germline variants could also be functional in cell types other than the cell of cancer origin, such as endothelial cells and immune cells, within the heterogeneous tumor microenvironment (Liu and Mardis, 2017). For example, in tumor-infiltrating lymphocytes (TIL), genetic variants regulating cytotoxicity-controlling genes may impact TIL's ability to eliminate cancerous cells, thereby functioning as cryptic modulators of cancer susceptibility that have escaped our attention to date. Since cancer initiation not only involves the acquisition of mutations in normal cells but also depends on the efficiency of immune surveillance against abnormal cells, it is important to identify cancerassociated germline variants that may contribute to cancer susceptibility through modulating immune cells (Lim et al., 2018;Schmiedel et al., 2018).
We have previously introduced a systematic computational framework for studying regulatory functions of noncoding GWAS variants associated with ER+ breast cancer by employing epigenomic information from breast cancer cell lines and normal mammary epithelial cells (Zhang et al., 2018b). By incorporating additional data in immune cells, we here apply this approach to present evidence for the possibility that a breast cancer GWAS variant may influence immune cells in the microenvironment of tumor or its precursor. We demonstrate that the breast-cancer-associated single nucleotide polymorphisms (SNP), rs3903072, targets the gene CTSW uniquely in TILs. CTSW encodes a cysteine proteinase highly specific to natural killer (NK) cells and T cells and is potentially involved in regulating their cytotoxity; consistently, CTSW expression negatively correlates with both the risk allele at rs3903072 and the survival probability in breast cancer. We propose an intergenic regulatory variant, in high LD with rs3903072, as a predicted functional SNP, which falls in a putative regulatory element (PRE) physically interacting with the 3' of CTSW. Our work renews the interest of CTSW in tumor surveillance and showcase a situation that shall be considered in functional characterization of GWAS variants.

GWAS Variants
A list of ER+ breast-cancer-associated variants were first obtained from Michailidou et al. (2013) and the NHGRI-EBI GWAS catalog (MacArthur et al., 2017). GWAS variants associated with immunoinflammatory traits were identified using the disease category information from the Experimental Factor Ontology (EFO) database (Malone et al., 2010). We ranked the ER+ breast cancer GWAS SNPs based on the number of proximal immunoinflammatory GWAS SNPs and found rs3903072 to be the top SNP (Supplementary Methods). A supplementary table was also obtained from Michailidou et al. (2017), which provides all SNPs associated with breast cancer with p < 10 -5 .

The Cancer Genome Atlas (TCGA) Cancer Data
The germline genotypes at tag SNPs of breast cancer (Breast Invasive Carcinoma, BRCA) patients in the TCGA dataset were downloaded from the TCGA Data Portal. The tumor copy number segmentation data in hg19 from the NCI Genomic Data Commons (GDC) Legacy Archive (Grossman et al., 2016) were used to compute gene copy number (CN). The processed gene expression data in fragments per kilobase million (FPKM) measured by RNA-seq were downloaded from the TCGA GDC data portal (Grossman et al., 2016). Germline genotypes from normal tissues and CN/RNAseq data from tumor tissues were matched using TCGA barcodes representing patients. For three other TCGA cancer datasets-uterine corpus endometrial carcinoma (UCEC), head-neck squamous cell carcinoma (HNSC), and low-grade glioma (LGG)-only the germline genotypes and processed gene expression levels were used. Genotype imputation was then performed for BRCA, UCEC, HNSC, and LGG datasets using the Michigan Imputation Server (Das et al., 2016) (Supplementary Methods).

The Genotype-Tissue Expression (GTEx) Project Data
The GTEx gene expression levels in reads per kilobase million (RPKM) and tissue type annotations were obtained from the GTEx portal (GTEx Consortium, 2013;Carithers et al., 2015). We analyzed the GTEx data with two aims: comparing the expression levels of a certain gene across different tissues, and analyzing the correlation between the genotype Abbreviations: GWAS, genome-wide association studies; LD, linkage disequilibrium; ER+, estrogen receptor-positive; TIL, tumor-infiltrating lymphocytes; SNP, single nucleotide polymorphisms; NK, natural killer; PRE, putative regulatory element; EFO, Experimental Factor Ontology; TCGA, The Cancer Genome Atlas; BRCA, breast invasive carcinoma; GDC, Genomic Data Commons; CN, copy number; FPKM, fragments per kilobase million; UCEC, uterine corpus endometrial carcinoma; HNSC, head-neck squamous cell carcinoma; LGG, low-grade glioma; GTEx, genotype-tissue expression; RPKM, reads per kilobase million; eQTL, expression quantitative trait loci; THPA, the human protein atlas; CCLE, cancer cell line encyclopedia; FANTOM, functional annotation of the mammalian genome; ENCODE, Encyclopedia of DNA Elements; GEO, Gene Expression Omnibus; ChIA-PET, chromatin interaction analysis by paired-end tag sequencing; GEO, Gene Expression Omnibus; 3D, three-dimension; DHS, DNase I hypersensitive sites; PWM, position-specific weight matrices; mRNA, messenger RNA; RNA-seq, RNA sequencing; MAF, minor allele frequency; KIRC, kidney renal clear cell carcinoma; KIRP, kidney renal papillary cell carcinoma; KICH, kidney chromophobe; DNase-seq, DNase I hypersensitive sites sequencing; CAGE, cap analysis of gene expression; SMC1, structural maintenance of chromosomes protein 1; IL-2, interleukin-2. at a GWAS SNP and candidate target gene expression level. For the first purpose, we used the mean expression level of CTSW measured in the GTEx tissues (Supplementary Methods). For the second aim, we used the fully processed, filtered, and normalized gene expression levels in the breast mammary tissue and the whole blood from GTEx Analysis V7 (dbGaP Accession phs000424.v7.p2); the imputed genotypes were extracted from the controlled-access dbGaP Accession phg000520.v2 (GTEx V2) dataset.
Expression Quantitative Trait Loci (eQTL) Analysis for TCGA-BRCA In this paper, a pilot eQTL analysis was first performed among ER+ breast cancer patients. For this ER+ breast cancer analysis, we constructed a multivariate linear model for each gene within the 3-Mb region centered at rs3903072, regressing the gene expression levels against the genotypes at the GWAS SNP rs3903072 as well as the gene CN (Zhang et al., 2018b)

TCGA Survival Analysis
Survival analysis in TCGA ER+ breast cancer patients was performed using the clinical data obtained from TCGA-GDC. The differences in survival rate between the two breast cancer patient groups separated by CTSW median expression level were tested using log-rank test (Supplementary Methods). Survival analysis results were also obtained in endometrial cancer (UCEC), head and neck cancer (HNSC), and renal cancer, all from the human protein atlas (THPA) (Uhlen et al., 2017d), choosing the median expression level as the cutoff threshold for grouping patients (Supplementary Methods).

Tissue Specificity of CTSW in Expression and Promoter Accessibility
CTSW gene expression in a variety of tissues and cell lines was obtained from BioGPS GeneAtlas (Wu et al., 2016), cancer cell line encyclopedia (CCLE) (Barretina et al., 2012), and functional annotation of the mammalian genome (FANTOM) (The Fantom Consortium and the Riken PMI and CLST (DGT) et al., 2014) resources. DNase-seq chromatin accessibility measurements in various tissues were obtained from the Encyclopedia of DNA Elements (ENCODE) (The Encode Project Consortium et al., 2012) and the Roadmap Epigenomics project (Bernstein et al., 2010) (Supplementary Methods).

Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET) Data Analysis
We searched the ENCODE and Gene Expression Omnibus (GEO) (Barrett et al., 2013) databases for available threedimension (3D) chromatin interaction data in lymphocyte cell lines expressing CTSW and found two ChIA-PET datasets in the Jurkat cell line for the proteins SMC1 (GSE68978) (Hnisz et al., 2016) and RAD21 (ENCODE Accession ENCSR361AYD). For the SMC1 ChIA-PET data, we used the significant interactions processed and merged by the authors from GEO. For the RAD21 ChIA-PET data, we collected all the raw sequences and generated the chromatin interactions using ChIA-PET 2 (Li et al., 2017) with default parameters.

Prioritization of Functional SNPs Linked to GWAS SNPs
We first selected all common (minor allele frequency, MAF ≥ 0.05) SNPs from 1000 Genomes Project Phase 3 (The Genomes Project Consortium et al., 2015) in high LD (r 2 ≥ 0.8, EUR population) with rs3903072. To prioritize SNPs located in PREs, we collected DNase I hypersensitive sites (DHS) from ENCODE and the Roadmap Epigenomics Project in lymphocyte-related cells, such as T cells, NK cells, B cells, T helper cells, and common myeloid progenitor cells. The LD SNPs overlapping any of the DHS peaks were prioritized for further investigation. In addition to DHS, we also used H3K4me1 modification (processed wiggle track in Jurkat cells from GSE119439) (Leong et al., 2017) to prioritize SNPs within putative regulatory elements.

The rs3903072 Risk Locus Has Rich Immunoinflammatory Signals in Proximity
We hypothesized that the proximity of a genetic variant associated with cancer to those associated with immunoinflammatory traits might indicate pleiotropy of nearby genes or regulatory variants. In this regard, we examined the NHGRI-EBI GWAS Catalog ( Table 1). The SNP rs3903072 has been found to be associated with ER+ breast cancer in multiple GWAS studies (Michailidou et al., 2013;MacArthur et al., 2017;Michailidou et al., 2017), and lies in close physical distance, but with weak genetic linkage (Supplementary Table 2), to multiple variants associated with immunoinflammatory diseases-such as rs118086960 with psoriasis (an autoimmune disease) (Tsoi et al., 2017), rs77779142 with rosacea symptom (an inflammatory skin condition) (Aponte et al., 2018), rs2231884 with inflammatory bowel disease (Jostins et al., 2012), and rs568617 with psoriasis and Crohn's disease (an inflammatory bowel disease) (Ellinghaus et al., 2016) (Figure 1A;  Table 1). The SNP rs3903072 shows a stronger association with breast cancer (p = 2 × 10 -12 ) than the variant rs617791 (p = 7 × 10 -6 ). (B) The eQTL results for rs3903072. A full list of eQTL genes is in Supplementary Table 3. Left three bar plots show the significance of the contribution from rs3903072 genotype to the expression level of MUS81, CTSW, FIBP and EIF1AD. Positive values represent higher expression as the number of risk allele increases, and negative values represent the opposite trend. CTSW is the only eQTL gene with a significant suppression in the risk genotype group. This negative correlation between CTSW expression level and the genotype status is confirmed in the three independent datasets shown. The right bar plot shows the significance of the contribution from gene copy number to the expression level of each gene. Filled bar plots represent tumor, while transparent plots represent normal tissues; in this paper, we use the cyan color to indicate breast-related cancer or normal cells and the magenta color to indicate blood-related cells. The dotted lines in gray mark the significance threshold of p = 0.05. Table 1). A direct link between this noncoding SNP rs3903072 and its regulatory function in mammary epithelial cells is currently unknown; similarly, it remains uncharacterized how and why the aforementioned SNPs in the region affect diverse immunoinflammatory traits. Discovering the target genes of rs3903072 thus represents a major step towards identifying a potential regulatory mechanism common to both breast cancer susceptibility and immunoinflammatory traits. eQTL and Survival Analysis Demonstrate the Tumor-Suppressive Role of CTSW To identify candidate target genes, we applied the approach of eQTL, quantifying the correlation of messenger RNA (mRNA) levels of nearby genes with the genotype status at rs3903072 (Material and Methods). Using ER+ breast tumor RNA sequencing (RNA-seq) and genotyping data from the BRCA dataset of TCGA, we identified several significant eQTL genes in cis for rs3903072, including CTSW, FIBP, MUS81, and EIF1AD (genotype p-values of a linear model adjusting for gene copy number: p= 3.52 × 10 -5 , p = 3.22 × 10 -5 , p = 1.24 × 10 -4 , p = 3.28 × 10 -3 , respectively ( Figure  1B); a complete list of eQTL genes in Supplementary Table 3), confirming the results previously reported (Michailidou et al., 2013). Notably, CTSW was among the most significant eQTL genes; the negative correlation between CTSW expression and the number of risk alleles indicated a tumor-suppressive role of this gene (Figure 2A). In line with the eQTL result, survival analysis of BRCA patients showed that higher CTSW expression was associated with significantly better survival probability [log-rank p-value with median expression cutoff; p = 0.026, for ER+ breast cancer patients analyzed in this study ( Figure 2B); p = 7.3 × 10 -4 for all BRCA patients in the analysis performed by THPA (Uhlen et al., 2017d), image: (Uhlen et al., 2017a)]. By contrast, according to the TCGA analysis presented in THPA, other eQTL genes were not significantly associated with breast cancer patient survival.

Supplementary
Consistent with our finding, a similar drop in survival probability with lower CTSW expression was also observed in other cancer types including endometrial cancer (UCEC) and head and neck cancer (HNSC), according to THPA (Uhlen et al., 2017d) [log-rank p-values with median expression cutoff; UCEC: p = 4.1 × 10 -4 , image: (Uhlen et al., 2017c); HNSC: p = 1.9 × 10 -2 image: (Uhlen et al., 2017b); Supplementary Methods]. In THPA, although the renal cancer group showed an opposite survival trend when kidney renal clear cell carcinoma (KIRC), kidney renal papillary cell carcinoma (KIRP), and kidney chromophobe (KICH) were combined (THPA p = 1.4 × 10 -4 ; log-rank p-value with median expression cutoff), the trend was significant only in KIRP when each group was checked separately (THPA; Supplementary Methods). Further eQTL analysis confirmed a similar negative correlation between CTSW expression level and the rs3903072 risk genotype in UCEC and HNSC, as well as in LGG, a cancer type not shown in the THPA survival analysis webpage (linear model between expression and rs3903072 genotype; UCEC: p = 1.52 × 10 -3 ; HNSC: p = 5.45 × 10 -3 ; LGG: p = 7.09 × 10 -3 ; Supplementary Figure 2). Together, these results demonstrate that CTSW likely has an important biological function in cancer and that the breast cancer risk allele rs3903072-G is significantly associated with decreased expression of CTSW.

The GWAS-CTSW Association Arises From Tumor-Infiltrating Lymphocytes
Interestingly, the following pieces of evidence show that unlike other eQTL genes, CTSW is specifically expressed and functions in blood cells, particularly in NK cells and T cells. First, CTSW encodes the protein cathepsin W, also named lymphopain, which is a cysteine protease reported to be involved in the cytolytic activity of NK cells and cytotoxic T cells (Wex et al., 2001;O'Leary et al., 2016). Second, the CTSW promoter region is not accessible in normal mammary cells or breast cancer cells (HMEC, MCF-7, T-47D) but is open in CD8+ T cells, CD56+ NK cells, CD34+ common myeloid progenitor cells, and the acute T cell leukemia cell line Jurkat (Figure 2C), according to the DNase I hypersensitive sites sequencing (DNase-seq) data from ENCODE and the Roadmap Epigenomics Project. Third, CTSW is predominantly expressed in blood cell lines but not detectable in human mammary cell lines, according to the gene expression measurements in BioGPS (microarray; Figure 2D) and CCLE (RNA-seq; Supplementary Figure 3). Fourth, the CTSW promoter is actively transcribed in several lymphocytes but not in breast cells, according to the cap analysis of gene expression (CAGE) sequencing data from FANTOM5 (Supplementary Figure 4). Fifth, among different normal tissue types, CTSW expression level measured by GTEx (GTEx Consortium, 2013;Carithers et al., 2015) is highest in whole blood, moderate in lung and spleen, and low or undetectable in other tissues, whereas other eQTL genes such as FIBP, MUS81, and EIF1AD are relatively ubiquitously expressed across different tissue types including the mammary gland (Supplementary Figure 5). Finally, it has been recently shown that an elevated level of CTSW expression is observed in CD8+ T cells with enhanced immunity against bacterial infection and cancer (Oghumu et al., 2015), as well as in renal cancer with high lymphocyte infiltration (Ghatalia et al., 2018). Together, these findings demonstrate the high specificity of CTSW expression to immune cells, indicating that the CTSW mRNA in the TCGA breast tumor bulk RNA-seq has likely arisen from TILs in the heterogeneous tumor microenvironment (Liu and Mardis, 2017). In fact, the expression patterns of immune signature genes in TCGA RNA-seq data have been used to infer the abundance of different immune cells in tumor and quantify immune infiltration levels .
We thus hypothesized that the breast-cancer-associated GWAS variant rs3903072 may regulate CTSW in immune cells within the tumor microenvironment, independent of the other eQTL genes that could potentially be regulated separately in breast cancer cells. Several observations supported this idea. First, CTSW was the only TCGA-BRCA eQTL gene that remained correlated with the GWAS genotype status in the GTEx normal mammary tissue (p = 8.64 × 10 -4 ) and whole blood (p = 0.016; Figure 1B  events abundant in cancer cells. Third, CTSW was the only eQTL gene showing negative correlation with the number of risk alleles at the GWAS SNP, whereas other eQTL genes had the opposite trend, indicating that CTSW may play a tumor-suppressive role in TILs, while others may be involved in promoting cancer progression (linear regression coefficient in BRCA ER + eQTL: CTSW, r = -0.22; FIBP, r = 0.09; MUS81, r = 0.08; EIF1AD, r = 0.05; Figure 1B). Lastly, CTSW was the only gene of known function related to immune cells across the region shown in Figure 1A (Supplementary Table 4), where multiple GWAS associations point to immunoinflammatory traits. Even though we do not exclude the possibility that other eQTL genes may also have important functions in TILs or breast cancer cells, the tissue specificity and the correlation structure of CTSW expression strongly suggest its significant modulation in tumor-infiltrating immune cells by the GWAS SNP rs3903072 itself or a linked genetic variant.

A Putative Regulatory SNP in CTSW Promoter Does Not Solely Explain the Breast Cancer Association
As the GWAS SNP rs3903072 itself did not reside in an open chromatin region in immune cells (Figure 3A), we next searched for putative regulatory variants that could directly control CTSW expression. We first found the SNP rs658524 to be located at the center of a DHS peak in CTSW promoter among several lymphocyte cell lines (Supplementary Figure 6). On the one hand, the SNP rs658524 was simultaneously linked to two of the immunoinflammatory GWAS variants. Namely, the GWAS SNP rs77779142, associated with Rosacea symptoms, was in tight LD with the CTSW promoter SNP rs658524 (r 2 = 0.78 with rs658524; r 2 = 0.17 with rs3903072; 1000 Genomes Phase 3 EUR population), despite being closer to the breast cancer GWAS SNP rs3903072 than to rs658524 in genomic distance (16.6 kb to rs3903072 vs. 47.6 kb to rs658524). Another GWAS SNP rs568617, associated with Psoriasis and Crohn's disease, resided in intron of the gene FIBP next to CTSW and was in high LD with the CTSW promoter SNP rs658524 (r 2 = 0.99 to rs658524; r 2 = 0.19 to rs3903072; Figure 1A). On the other hand, the promoter SNP rs658524 was strongly correlated with CTSW expression, according to our eQTL analysis in TCGA (linear model between expression and rs658524 genotype; BRCA: p = 1.02 × 10 -17 ; UCEC: p = 1.50 × 10 -11 ; HNSC: p = 1.32 × 10 -12 ; LGG: p = 1.43 × 10 -6 ; Supplementary Figure  7A) and GTEx (mammary tissue: p = 2.19 × 10 -11 ; whole blood: p = 8.48 × 10 -5 ; Supplementary Figure 7B), consistent with eQTL results from other immune cell studies (Raj et al., 2014).
We here note that rs658524-A also showed partial association with breast cancer risk, since the haplotypes carrying the rs658524-A allele were found to be largely biased towards the GWAS risk allele rs3903072-G compared to the alternative allele rs3903072-T, despite the balanced MAF of rs3903072 (rs3903072 MAF = 0.46; 188 haplotypes with rs658524-A-rs3903072-G and 3 haplotypes with rs658524-A-rs3903072-T among the 1,006 haplotypes from the 1000 Genomes Project Phase 3 EUR population; Supplementary Figure 6B). In fact, rs658524 was in weak LD with rs3903072 (low r 2 = 0.186, but high D' = 0.966), with the GWAS risk SNP having a much higher allele frequency than the risk promoter SNP (rs3903072-G frequency: 0.54; rs658524-A frequency: 0.19; 1000 Genomes Project Phase 3, EUR). However, the CTSW promoter SNP rs658524 itself did not entirely explain either the GWAS association or the CTSW regulation in this region. A recent study reporting GWAS SNPs with a p < 10 -5 for breast cancer (Michailidou et al., 2017) included the CTSW locus (Manhattan plot in Figure 3A). The top SNP linked to rs658524 was rs12225345 (r 2 = 0.84), which was only moderately associated with breast cancer (p = 1.13 × 10 -6 ), separated from the top GWAS signal cluster represented by rs3903072 (p = 2.25 × 10 -12 ) (Michailidou et al., 2017). Furthermore, a conditional eQTL analysis showed that, within the group of TCGA patients carrying the homozygous genotype rs658524-G/G, the rs3903072 risk allele still displayed a residual negative effect on CTSW expression (Welch t-test, two-sided, p = 6.0 × 10 -4 ; GTEx whole blood data; Supplementary Figure 8). Thus, although the CTSW promoter SNP was in high D′ with the breast cancer GWAS SNP rs3903072, it did not solely explain the breast cancer risk in 11q13.1, and other functional SNPs likely influenced the expression of CTSW.

An Active Distal Enhancer of CTSW Harbors a Candidate Functional Variant Linked to rs3903072
Given that the GWAS SNP rs3903072 was located 64 kb away from CTSW promoter, we tested whether some putative functional SNPs tightly linked to rs3903072 could affect distal enhancer activities modulating CTSW expression. We thus examined all common (MAF ≥ 0.05) SNPs from 1000 Genomes Project Phase 3 EUR population in high LD (r 2 ≥ 0.8) with the GWAS SNP rs3903072 and prioritized the potential functional ones using epigenetic information. In detail, by overlapping the 30 high LD SNPs with DHS of lymphocyte cell lines (Supplementary Table 5), we identified three SNP-containing PREs): PRE1 located 3 kb away from rs3903072, PRE2 at SNX32 promoter, and PRE3 at EFEMP2 promoter (Supplementary Figure 9). Further investigation of the available 3D chromatin interaction data in immune-related cells highlighted PRE1 as the top regulatory element physically interacting with the 3' end of CTSW, as assessed by the structural maintenance of chromosomes protein 1 [SMC1; GSE68978 (Hnisz et al., 2016)] ChIA-PET data ( Figure 3A). Another Jurkat ChIA-PET data for RAD21, a cohesin complex component, showed an indirect interaction linking PRE1 and CTSW mediated through an anchoring element near EIF1AD, suggesting multiway interactions between the several anchoring elements or enhancers (Materials and Methods; Supplementary Figure  9). To contrast the chromatin interaction of PRE1 with CTSW between NK/T cells and mammary cells, we predicted highresolution (5 kb) Hi-C interactions in NK cells, CD8+ αβ T cells, and benign variant human mammary epithelial cells (vHMEC). Using random forest-based regression models trained separately on high-resolution Hi-C data in five different cell lines (Rao et al., 2014) (Supplementary Methods), we predicted the contact counts in the three cell types of interest within 1 Mb from rs3903072. Consistent with the ChIA-PET data, NK cells were found to have the highest predicted contact count for the pair of rs3903072-PRE1 and CTSW in all five models (Supplementary Figure 10), in contrast to vHMEC, which had the lowest predicted contact counts. These findings together suggested that PRE1 linked to the GWAS SNP could function as a distal regulatory element controlling CTSW expression selectively in NK and T cells.
Among the prioritized SNPs residing in the three PREs, we identified rs11227311 in PRE1 as a putative functional SNP (r 2 = 0.89 with rs3903072, 1000 Genomes Phase 3, EUR). More precisely, it not only overlapped a DHS in NK cells, B cells, and type 2 T helper cells (ENCODE accession number: ENCFF933OXV, ENCFF772OPR, ENCFF001WTS, and ENCFF001WTQ) but also FIGURE 3 | Potential regulatory mechanisms for CTSW. (A) The genomic region ranging from the GWAS SNP rs3903072 to CTSW is shown, including the gene track, the GWAS LD SNPs, epigenetics information, and 3D chromatin interaction. The chromatin accessibility data are shown for breast-related cells and blood cells from ENCODE and Roadmap Epigenomics Project. The H3K4me1 modification track and the SMC1 ChIA-PET significant interactions in the Jurkat cell line are from GSE119439 and GSE68978. The bottom Manhattan plot shows the SNPs associated with breast cancer with p < 10 -5 . Three SNPs are marked: the GWAS SNP rs3903072 and the GWAS-linked putative enhancer SNP in blue and the CTSW promoter SNP in magenta. (B, C) Zoomed-in views of the two potential regulatory SNPs. For the GWAS-linked putative enhancer SNP, TCF3 ChIP-seq data in Kasumi1 and TBX21 ChIP-seq data in GM12878 are shown. Two candidate TF motifs predicted to be affected by the SNP are also shown, where the protective allele is marked blue and the risk allele red. For the CTSW promoter SNP, KLF1 ChIP-seq profile in erythroid cells is plotted, along with a KLF family motif (rc, reverse complement).
H3K4me1 modification and the ChIA-PET region interacting with CTSW 3' end in Jurkat (Figure 3A; Supplementary Figure 9). Furthermore, the Roadmap Epigenomics Project annotates the PRE1 region as TSS and weak enhancer in multiple types of primary blood cells (Supplementary Figure 11). To identify candidate TFs in PRE1 potentially affected by rs11227311, we scanned the short sequences around the SNP for TF motifs, using the program FIMO (version 4.12.0) and position weight matrices (PWM) collected from multiple motif databases (Materials and Methods). Using our previously described method for measuring the significance of motif disruption by a SNP, based on simulating null mutations on the PWMs (Zhang et al., 2018b), we identified a list of candidate TF FIGURE 4 | An illustration of our hypothesis. The putative molecular gene-regulation process is shown in the top boxed panels, and the tumor initiation process is illustrated in the two rows below. The left panel shows the process of cancer immune surveillance in people carrying the protective alleles at the GWAS SNP and the predicted functional enhancer SNP. In the top left box, the chromosome carrying the protective alleles produces an abundant CTSW mRNA level; as shown here, CTSW transcription could be elevated through a transcription activator binding the PRE1 SNP. Note that another scenario, not shown in the illustration, is also possible, where the protective allele could disrupt the binding motif of a transcription repressor. Going back to the cell view, the high level of CTSW expression in NK cells or T cells may enhance their cytotoxicity and facilitate their ability to detect and eliminate abnormal cells, such as cancerous mammary epithelial cells that just acquired some oncogenic mutations. This high efficiency of immune surveillance would thus reduce the risk of developing breast cancer. By contrast, the right panel shows the NK/T cells with suppressed CTSW expression associated with the risk alleles, resulting in reduced cytotoxic activities and suppressed immune surveillance efficiency, thereby increasing the risk of developing breast cancer. motifs disrupted by rs11227311, including the TEAD family, TCF family, NR3C1, POU2F1, and ETV5 ( Figure 3B; Supplementary  Figure 9; p-values from neutral mutation simulation: p = 0.0074, p = 0.0086, p = 0.002, p = 0.013, p = 0.045, respectively; Methods). GREAT (McLean et al., 2010) analysis of available ChIP-seq data suggested that some of our candidate TFs might regulate genes closely related to the immune system. For example, gene ontology terms related to interferon-gamma, an important immunoregulatory molecule, and pathways related to T-cell signaling were enriched for TEAD2 (K562 cell line; Supplementary Methods; Supplementary Table 6). It is also known that TCF1, one of the four TCF family members, plays an important role in normal development of natural killer cells (Jeevan-Raj et al., 2017). Although it was difficult to validate which TF can directly bind the PRE1 SNP due to insufficient ChIP-seq data in T/NK cells, we found the PRE1 candidate SNP rs11227311 to be located within a weak TCF3 ChIP-seq peak in Kasumi1 acute myeloid leukemia cell line [GEO GSE43834 (Sun et al., 2013)] (Figure 3B; Materials and Methods). Examination of other ChIPseq data in ENCODE for TFs in lymphocytes also showed that the SNP rs11227311 is at the center of a strong TBX21 ChIP-seq peak in GM12878 (ENCSR739IHN; Figure 3B). TBX21 is a T-box transcription factor controlling important genes in NK cells and type 1 T helper cells (O'Leary et al., 2016), and its binding supports the potential involvement of PRE1 in gene regulation in lymphocytes. In addition, we performed a motif analysis for the CTSW promoter SNP and found that it might disrupt the binding site of the KLF family TFs (p = 0.0086; Materials and Methods), the actual binding of which in this region was supported by a KLF1 ChIP-seq dataset in erythroid cells [GSE43625 (Su et al., 2013); Figure 3C].

DISCUSSION
In this paper, we performed functional characterization of breast cancer-associated GWAS variants and proposed the idea that a noncoding cancer GWAS SNP may regulate gene expression in immune cells within the tumor microenvironment. Figure 4 summarizes our hypothesis that the GWAS-linked SNP rs11227311 may directly affect TF binding affinity at the distal enhancer and regulate CTSW expression in cytotoxic lymphocytes, thereby affecting their ability to eliminate abnormal cells. As a member in the cathepsin family, CTSW is specifically expressed in NK and T cells with a potential role in their cytotoxicity; it can also be strongly induced in NK cells by interleukin-2 (IL-2) (Wex et al., 2001), which is a cytokine controlling T cell growth and NK cell cytotoxicity. Although the function of cathepsin W and its precise relation to lymphocyte cytotoxicity remain under debate (Dalton et al., 2013), the described association between CTSW and breast cancer susceptibility renews the interest in this gene as a component of immune surveillance against cancer. Recent studies have demonstrated that tumor impurity is an important factor to consider in eQTL analysis (Geeleher et al., 2018;Lim et al., 2018). Along this line, our work further highlights the need to examine the effect of GWAS SNPs on gene regulation not only in the cell type of disease but also in surrounding cells that may modulate the progression of pathology.

CONCLUSION
In summary, we have examined effects of cancer-associated risk alleles on tumor-infiltrating lymphocytes in the tumor microenvironment, which is usually neglected in recent functional interpretation studies. We presented evidence that a breast-cancer-associated variant may regulate the expression level of an NK/T cell-specific gene, not in breast cancer cells but in immune cells infiltrating the tumor microenvironment. Our study emphasizes the need to consider effects of cancerassociated germline variants in context of the tumor immune microenvironment, as well as the need to further study the role of CTSW in the interaction between tumor and the immune system.

ETHICS STATEMENT
The usage of NIH controlled-access datasets was approved by the NCBI dbGaP.

AUTHOR CONTRIBUTIONS
JS designed and supervised the study and was a major contributor in editing the manuscript. YZ analyzed and interpreted the data and was a major contributor in writing the manuscript. MM and JY performed analysis and contributed to writing the manuscript. BB, SZ, and SR performed the chromatin structure analysis and contributed to writing the manuscript. All authors read and approved the final manuscript.

ACKNOWLEDGMENTS
This manuscript has been released as a preprint at https://www. biorxiv.org/content/10.1101/493171v1 (Zhang et al., 2018a). The results appearing here are in part based upon the data generated by the TCGA Research Network (http://cancergenome.nih.gov/, dbGaP accession number phs000424.v6.p1 on 05/06/2016) and the GTEx Project, supported by the Common Fund of the Office of the Director of the National Institutes of Health and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. We acknowledge the ENCODE consortium that generated the datasets used in the manuscript.