SNPs in lncRNA Regions and Breast Cancer Risk

Long non-coding RNAs (lncRNAs) play crucial roles in human physiology, and have been found to be associated with various cancers. Transcribed ultraconserved regions (T-UCRs) are a subgroup of lncRNAs conserved in several species, and are often located in cancer-related regions. Breast cancer is the most common cancer in women worldwide and the leading cause of female cancer deaths. We investigated the association of genetic variants in lncRNA and T-UCR regions with breast cancer risk to uncover candidate loci for further analysis. Our focus was on low-penetrance variants that can be discovered in a large dataset. We selected 565 regions of lncRNAs and T-UCRs that are expressed in breast or breast cancer tissue, or show expression correlation to major breast cancer associated genes. We studied the association of single nucleotide polymorphisms (SNPs) in these regions with breast cancer risk in the 122970 case samples and 105974 controls of the Breast Cancer Association Consortium’s genome-wide data, and also by in silico functional analyses using Integrated Expression Quantitative trait and in silico prediction of GWAS targets (INQUISIT) and expression quantitative trait loci (eQTL) analysis. The eQTL analysis was carried out using the METABRIC dataset and analyses from GTEx and ncRNA eQTL databases. We found putative breast cancer risk variants (p < 1 × 10–5) targeting the lncRNA GABPB1-AS1 in INQUISIT and eQTL analysis. In addition, putative breast cancer risk associated SNPs (p < 1 × 10–5) in the region of two T-UCRs, uc.184 and uc.313, located in protein coding genes CPEB4 and TIAL1, respectively, targeted these genes in INQUISIT and in eQTL analysis. Other non-coding regions containing SNPs with the defined p-value and highly significant false discovery rate (FDR) for breast cancer risk association were discovered that may warrant further studies. These results suggest candidate lncRNA loci for further research on breast cancer risk and the molecular mechanisms.


INTRODUCTION
About 70-90% of the human genome is transcribed (Guttman et al., 2009;Mercer et al., 2009). The protein coding RNAs account for only a small fraction of all the transcripts, while non-coding RNAs (ncRNA) cover 95% (Dermitzakis et al., 2005;Kapranov et al., 2007;Mattick, 2009). These include long non-coding RNAs (lncRNAs), defined as ncRNAs with over 200 nucleotides. They participate in various biological processes, including differentiation, immune response and metabolism (Kretz et al., 2013;Hung et al., 2014;Wang et al., 2014) as well as in pathogenic processes, such as the development and progression of cancer (Gupta et al., 2010;Yang et al., 2013;Xing et al., 2014). Their expression exhibits cell type specificity and responds to various stimuli, suggesting a rigorous transcriptional regulation (Wang and Chang, 2011).
A curious subclass of lncRNAs are the ultraconserved regions (UCRs). These are stretches of DNA expanding over 200 nucleotides that are absolutely conserved between orthologous regions in human, mouse and rat (Bejerano et al., 2004). There exist 481 such regions spread across the human genome, and 93% of the UCRs are transcribed in at least one normal human tissue (Calin et al., 2007). However, the study of T-UCR expression is complicated: based on annotation compiled by Mestdagh et al. (2010), 38.7% of the 481 T-UCRs are intergenic and 57.4% of the 481 T-UCRs are located in protein coding genes (42.6% intronic, 4.2% exonic, 5% partly exonic, and 5.6% exon containing), and 3.9% of T-UCRs lack an explicit gene-related annotation, because of the host gene splice variants. For these intragenic T-UCRs, it is difficult to define if the expression signal/measurement comes from the T-UCR or from the host gene. Mestdagh et al. (2010) studied this question in neuroblastoma tissue and found 237 T-UCRs to be independently expressed while the expression of the remaining 244 T-UCRs was inseparable from the host gene expression, either because the T-UCR was expressed as a part of the host gene transcript, or because the T-UCR and host gene expressions correlate for some other reasons. Interestingly, many of the T-UCRs are located in cancer-related regions and fragile sites, and their expression is frequently altered in human cancer (Amos et al., 2017;Fabris and Calin, 2017;Terracciano et al., 2017).
Breast cancer is the most common cancer in women worldwide and the leading cause of female cancer deaths (Torre et al., 2015). Breast cancer risk has a strong hereditary aspect, especially genes encoding tumor suppressors, which play a role in DNA damage response and repair pathways, are mutated in hereditary breast cancer (Goldgar et al., 1994;Lichtenstein et al., 2000;Collaborative Group on Hormonal Factors in Breast Cancer, 2001;Nielsen et al., 2016). BRCA1 and BRCA2 genes carry pathogenic variants of high-penetrance that cover approximately 20% of the familial relative risk (Mavaddat et al., 2015). Other variants, the majority of them with moderate or low penetrance, have been found to cover little over 20%, putting the altogether familial relative risk coverage to approximately 44% . Up to the present, nearly 200 lowpenetrance susceptibility loci have been identified. While highand moderate-penetrance variants are often in protein coding regions, low-penetrance variants are typically located in noncoding regions Michailidou et al., 2017;Milne et al., 2017).
Recently, several studies have shown a link between genetic variants in lncRNA genes and breast cancer risk. Cui et al. (2018) found a SNP 2 kb upstream of H19 transcription start site that was associated with breast cancer risk in estrogen receptor (ER)-positive patients in the Chinese population. Wu et al. (2018) studied risk associations among 22977 cases and 105974 controls of European ancestry and found several novel risk-loci that harbored lncRNA genes. Three of these lncRNAs, and four altogether (ANRIL, H19, PVT1, and IGF2-AS), were reported to have disease association based on SNP-association either with breast cancer or prostate cancer risk or survival (Eeles et al., 2009;Turnbull et al., 2010;Meyer et al., 2011;Riaz et al., 2012). In addition, several lncRNAs have been found to be differentially expressed in various breast cancer subtypes (Mathias et al., 2019). While the precise functionality of lncRNAs in breast cancer remains to be elucidated, they play a role in the regulation of intracellular and intercellular signaling (Klinge, 2018).
The Breast Cancer Association Consortium (BCAC) is an international multidisciplinary consortium with a focus on inherited risk of breast cancer 1 . Their aim is to combine data from many studies to identify genes related to breast cancer risk and, with the world's largest collection of breast cancer case samples and controls, provide a powerful assessment of risk associated with the studied genes. BCAC has the largest genomic breast cancer dataset worldwide. Several papers describe in detail BCAC and genotyping projects using the BCAC dataset (Michailidou et al., 2013. In this study, we look into the breast cancer risk association of SNPs on lncRNAs expressed in mammary tissue or associated with known breast cancer risk genes, as well as SNPs located at the T-UCRs. We carried this out by analyzing the Breast Cancer Association Consortium's (BCAC) GWAS, OncoArray, and iCOGs SNP array summary statistics to find SNPs in or near lncRNAs or T-UCRs that associate with breast cancer risk. The loci with GWAS-significant results have been published recently Milne et al., 2017), and in this study we concentrate on the lncRNA and T-UCR related SNPs with p < 10 −5 to uncover other candidate lncRNA loci for further analysis. The functionality of the SNPs of interest was studied with integrated expression quantitative trait and in silico prediction of GWAS targets (INQUISIT; Michailidou et al., 2017) and eQTL analysis. We found putative breast cancer risk variants associated with the expression of lncRNA GAbinding protein transcription factor beta subunit 1 antisense RNA 1 (GABPB1-AS1), cytoplasmic polyadenylation element binging protein 4 (CPEB4) associated with uc.184, and TIA 1 cytotoxic granule associated RNA binding protein like 1 (TIAL1) associated with uc.313.

MATERIALS AND METHODS
The work flow of the study is presented in Figure 1.

Study Population
The analyses were based on summary results of the Breast Cancer Association Consortium (BCAC). The collaborative dataset of the BCAC contained 122970 female breast cancer case samples and 105974 controls of European ethnicity. Of these, 61282 cases and 45494 controls were genotyped using OncoArray (Amos et al., 2017), and 46785 cases and 42892 controls using iCOGs (Michailidou et al., 2013), while 14910 cases and 17588 controls came from 11 other breast cancer GWAS experiments . All participating studies were approved by their appropriate institutional ethics review board and all subjects provided informed consent. All research was performed in accordance with the relevant guidelines and regulations.

Selection of lncRNA Regions for the Study
We selected 565 regions of lncRNAs and T-UCRs. Following a comprehensive search for relevant lncRNAs we selected altogether 84 lncRNA regions with reported polymorphisms based on multiple criteria including tissue specific expression, positive expression correlation with high and moderate penetrance genes, and known disease associations (Supplementary Table S1). 46 lncRNAs had expression above five tags per million (Gibb et al., 2011) in breast tumor tissue. Ten of these, and 25 other lncRNAs, showed positive expression correlation with high and moderate penetrance genes (ten with BRCA1, three with BRCA2, two with ATM, one with CDH1, three with CHEK2, two with PALB2, thirteen with RAD51C, and one with TP53). Several lncRNAs showed positive correlation with multiple of these genes, but here only the strongest correlations are listed (Supplementary Table S1). For the correlation analysis, we used expression data from GENCODE. The expression data as normalized RPKM (reads per kilobase per million mapped reads) values was retrieved from GENCODE database v7 (Derrien et al., 2012). Twenty-two lncRNAs had a reported disease association defined either by higher expression in a tumor tissue compared to a normal tissue or by chromosomal aberrations in lncRNA regions in samples from breast, ovarian or prostate cancer (data retrieved from Long Non-coding RNA Database (Amaral et al., 2011), LncRNADisease database (Chen et al., 2013) and literature in March 2013) (Supplementary Table S2). Three of these lncRNAs, and four altogether (ANRIL, H19, PVT1, and IGF2-AS) were reported to have disease association based on SNP association either with breast cancer or prostate cancer risk or survival (Eeles et al., 2009;Turnbull et al., 2010;Meyer et al., 2011;Riaz et al., 2012). For these 84 lncRNAs we included SNPs located in exons and 50 kb flanking regions, 5 UTRs, and 150 nucleotides upstream from a transcription starting site. The SNPs in the 84 lncRNA regions were genotyped on the OncoArray genotyping chip (Amos et al., 2017). In addition, we selected 44 T-UCR regions that were either highly expressed in normal breast tissue and/or had a known enhancer activity and/or were located at cancer-associated genomic regions (Calin et al., 2007;Scaruffi, 2011) (Supplementary Table S3). SNPs in these T-UCR loci, including 50 bp extended region on both sides, with 1000 genomes European MAF ≥ 0.0013 were selected for genotyping on the OncoArray.
Here, we have included in the analysis all the genotyped SNPs in the 84 lncRNA regions and 44 T-UCR regions, and extended our study to also include the remaining T-UCR regions resulting in an extensive explorative study of all the 481 T-UCR regions in the genome (Bejerano et al., 2004). While Bejerano et al. (2004) reported no evidence that 256 of these 481 ultraconserved regions were transcribed, Calin et al. (2007) found that 93% of these regions were transcribed in at least one normal human tissue. Thus we decided to include all ultraconserved regions in this study alongside the other lncRNAs, as well as to refer to them as T-UCRs.
The regions of interest that were used to gather SNPs from the BCAC results database were defined as the above mentioned 565 lncRNA or T-UCR of interest, and 50 kb flanking it in both directions.

Genotyping
OncoArray contains approximately 533000 markers, while iCOGS holds 211000 (18,19). Their genotyping and the genotyping of the eleven GWAS in the BCAC has been previously described in detail (Michailidou et al., 2013. All samples were imputed using the version 3 (October 2014) release of the 1000 Genomes Project dataset as the reference panel. For iCOGS, OncoArray, and nine of the eleven GWAS, the imputation was carried out with a 2-stage approach using SHAPEIT2 for phasing and IMPUTE v2 for imputation; the two remaining GWAS were imputed separately using MaCH and Minimac (Howie et al., 2009(Howie et al., , 2012Li et al., 2010;O'Connell et al., 2014). The details of the imputation process have been described previously . Summary statistics used in the study were obtained through BCAC. In this study, we looked at associations in 565 specific regions, and used a p-value of p < 10 −5 as the limit of interrogation.

Target Gene Prediction
The functionality of the putative breast cancer risk variants was assessed by annotating each variant with publicly available genomic data from breast cells and by using a heuristic scoring system (Integrated Expression Quantitative trait and in silico prediction of GWAS targets, INQUISIT) that combines genomic data from multiple sources, including chromatin interactions, computational enhancer-promoter correlations, transcription factor binding chromatin immunoprecipitation followed by sequencing, gene expression and topologically associated domain boundaries, and which is described in detail by Michailidou et al. (2017). For this study, the target gene predictions were made from annotation in MCF7 and HMEC cells, and the prediction methods were chromatin interaction analysis by paired-end tag sequencing (ChIA-PET), integrated methods for predicting enhancer targets (IM-PET) and analysis of super-enhancers as defined by Hnisz et al. (2013).

Expression Quantitative Trait Loci (eQTL) Analysis
The Genotype-Tissue Expression (GTEx) project's breast tissue eQTL results (version 7) were used to detect SNP associations with gene expression. The dataset included 251 normal breast tissue samples. The data used for the analyses in this study were downloaded from the GTEx Portal 2 on February 13th, 2018 (version 7).
In addition, an eQTL analysis of the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC, Curtis et al., 2012) dataset was carried out. The raw genotype data (Affymetrix SNP 6.0 platform) and normalized mRNA expression data (Illumina HT-12 v3 platform) extracted from matched DNA and RNA specimens of tumorous breast tissue were downloaded from the European Genome-phenome Archive 3 . The genotype data was processed with Affymetrix Genotyping Console Software following the practices of the Affymetrix SNP 6.0 analysis workflow. The workflow including a quality control step has been previously described (Jamshidi et al., 2015;Khan et al., 2015). After the quality control, the analysis contained 1328 samples with both genotype and expression data. The analysis was carried out using R-package Matrix eQTL with linear regression model (Shabalin, 2012).

Statistical Analysis
The BCAC summary results included a meta-analysis of OncoArray, iCOGS and 11 GWAS analyses, as well as effect size and standard error and p-value for these analyses for all variants. The meta-analysis has been described in detail in Michailidou et al. (2017) and the summary results are available at 5 . An FDR cut-off of 0.5 was used to evaluate the importance of the findings. FDR was calculated with the Benjamini-Hochberg procedure for all SNPs in the regions of interest using the R 3.5.2 environment (R Core Team, 2013) 6 .
Statistical analysis of INQUISIT is described in detail in Michailidou et al. (2017).
For the eQTL analysis results, a cut-off of nominal p = 0.05 was used. The eQTL data available at the GTEx Portal included p-values, normalized size effect (NES) and standard error to NES. The R-package Matrix eQTL used to carry out METABRIC eQTL analysis also provided FDR p-values. Those were viewed as additional information for the discussion. ncRNA eQTL statistical information included beta, r, and p-values.

RESULTS
In this study, we looked into the breast cancer risk association of SNPs in the regions of breast cancer-relevant lncRNAs and of T-UCRs around the genome in a large cohort of European breast cancer patients. We selected altogether 565 lncRNA regions that included 84 lncRNAs with reported polymorphisms based on multiple criteria, including tissue specific expression, co-expression with high and moderate penetrance genes, and known disease associations, and 481 ultraconserved regions. 153 ultraconserved regions including the 44 T-UCRs selected for the OncoArray were either highly expressed in normal breast tissue and/or had a known enhancer activity and/or were located at cancer-associated genomic regions while no such information was available for the rest of the ultraconserved regions. The regions of interest were defined as the transcribed lncRNA/T-UCR and 50 kb up-and downstream genomic sequence. For the sake of brevity and clarity, we numbered the lncRNA regions and refer to those numbers in this article instead of the subject lncRNA of the region. The nomenclature for the ultraconserved regions came from the article by Bejerano et al. (2004). All regions, genes and the rationales for selecting them for this study can be found in Supplementary Table  S1 and Supplementary Table S2. Positional annotations follow Human GRCh37/Hg19.

SNPs in Seven lncRNAs and Eight T-UCRs Associated With Breast Cancer Risk
We used BCAC summary statistics on risk results from metaanalysis of OncoArray, iCOGS and 11 separate genome-wide association studies (GWAS). The regions of interest included 5401 genotyped and 349112 imputed SNPs. Results with genomewide significance level (p < 5 × 10 −8 ) for five of the lncRNA regions and 18 of the T-UCR regions have previously been published by the BCAC and are listed in Supplementary  Table S4. These regions are undergoing further fine mapping studies by the BCAC. Here, a p-value of <10 −5 and MAF <0.45 was used as the limit of interrogation, resulting in seven lncRNA regions and eight T-UCRs containing three genotyped and 248 imputed SNPs not previously reported by the BCAC (Tables 1, 2). FDR was calculated for all the SNPs in the regions of interest to evaluate the importance of the findings (Supplementary Table S5). None of the SNPs in the T-UCR regions were directly in the T-UCRs themselves, but in the regions flanking them. This is expected due to the nature of ultraconservation, but makes it difficult to analyze the relationship between the SNP and the T-UCR.
In addition to the results from the meta-analysis of breast cancer overall, we interrogated the meta-analysis results from ERnegative and ER-positive patient subgroups separately ( Table 3). Fourteen SNPs (all imputed) were shared between the overall and ER-negative analyses, all located in the uc.147 region, and 5 SNPs had p < 10 −5 in the ER-negative analysis only (all imputed). Nine SNPs were shared between the overall and ER-positive analyses (all imputed) and no SNP gave a p-value under the threshold in ER-positive analysis only. None of the SNPs were shared by all three subgroup analyses.

Integrated Expression Quantitative Trait and in silico Prediction of GWAS Targets (INQUISIT) Predicts Target Genes for 60 SNPs in Two lncRNAs and Four T-UCRs
A heuristic scoring system, INQUISIT , was used to calculate the potential target genes for the 251 SNPs that were associated with breast cancer risk in BCAC analysis (Supplementary Table S6).
For 60 of the 251 SNPs, INQUISIT predicted one or more target genes (Supplementary Table S7). There were 12 genes predicted as targets altogether and each gene had 1-17 SNPs predicting it. The SNPs resided on two lncRNAs regions and on four T-UCRs; the number of SNPs per region ranged from 1-22. The predominant method of prediction was chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) in MCF7. All these SNPs had association (p < 10 −5 ) with overall breast cancer, with FDR <0.005. Four SNPs in uc.313 were also associated with ER-positive breast cancer. It is to be noted that MCF7 is a breast cancer cell line which may cause alteration in its cellular processes and may affect these results.
Overall, the INQUISIT-predicted target genes of the SNPs were not the lncRNA or T-UCR of the SNPs region, but mostly protein-coding genes. The only exception to this was GABPB1-AS1, the subject of the lncRNA-75 region and targeted by three lncRNA-75 SNPs in INQUISIT predictions. Three T-UCRs were located within the gene that the SNPs in their regions targeted by INQUISIT: uc.147 in the intron of LRBA, uc.184 in the 3 UTR of CPEB4 and uc.313 in the intron of TIAL1.
We searched for genes that, in addition to being INQUISIT target genes, show eQTL associations as well (see below). There were three such genes in our data: GABPB1-AS1 (GTEx eQTL analysis), CPEB4 and TIAL1 (METABRIC eQTL analysis) ( Table 4). Only in a few cases, the SNP targeting a gene in INQUISIT predictions was the same SNP that associates to the gene in eQTL (Table 5). However, the majority of these SNPs are linked.
Three SNPs targeted GABPB1-AS1 of lncRNA-75 region (rs1806845, rs71124350, and rs28489579) ( Table 6). All three clustered together approximately 31 kb downstream of the lncRNAs. These SNPs also have additional predicted targets, rs1806845 and rs71124350 target also SLC27A2 and GABPB1 and rs28489579 targets GABPB1 as well. However, none of these other predicted targets show association with p < 0.05 in the eQTL analyses.
CPEB4 was the predicted target of seventeen SNPs in the uc.184 region ( Table 7). The majority of the SNPs as well as uc.184 itself are located in the 3 UTR of the CPEB4. None of the SNPS overlap with uc.184. Two SNPs targeting CPEB4 also had other predicted targets, C5orf47 (rs17695092) and NGS2 (rs55946741).

Two SNPs Targeting GABPB1-AS1 in INQUISIT Also Associate With It in eQTL Analysis of GTEx Dataset
GTEx eQTL association in normal mammary tissue with the limit of p < 0.05 was found for 171 of the 251 SNPs in this study (Supplementary Table S8). All in all, the SNPs had 318 associations with 22 genes. The SNPs were found on seven lncRNA and five T-UCR regions (Supplementary Table S9). Each SNP showed association to 1-4 genes and each gene to 1-48 SNPs. Only one gene, GABPB1-AS1, which was targeted in INQUISIT predictions, also had SNP association in GTEx analysis. GABPB1-AS1 was also the only SNP associated gene that was also the target of a region of interest, lncRNA-75.

CPEB4 and TIAL1 Associate With SNPs Targeting Them in INQUISIT in eQTL Analysis of METABRIC Dataset
Of the 251 SNPs in this study, 20 had eQTL associations with the limit p < 0.05 in METABRIC (Supplementary Table S10). These SNPs were spread on three lncRNA regions and five T-UCRs (Supplementary Table S9). Even though the vast majority of the found associations were in cis, the specific lncRNAs or T-UCRs of the regions of interest were not associated with any of the SNPs. Altogether, the SNPs had 10322 associations ranging from 352 to 1151 associations per SNP. These associations contain 5858 genomic elements, including genes, pseudogenes, and expressed sequence tags. Each genetic element was associated with 1-17 SNPs. We focused on SNP and region associations with genes that were also INQUISIT-predicted target genes. There were two such genes, CPEB4 and TIAL1. Two SNPs, rs17695092 and rs1564823 in region uc.184, associated strongly in cis with CPEB4, p = 7.33 × 10 −61 (after FDR correction 3.76 × 10 −55 ) and p = 3.66 × 10 −64 (after FDR correction 3.75 × 10 −58 ) with beta coefficients of −0.317 and 0.324, respectively ( Table 7). These SNPs have the lowest p-values of the METABRIC analysis and they are in strong linkage disequilibrium (r 2 = 1.000). Both rs17695092 and rs1564823 as well as the T-UCR uc.184 are located within CPEB4 gene: rs17695092 lies in the intron 2, while rs1564823 and uc.184 are situated in the 3 UTR of the gene.
Three SNPS, rs4752331, rs3009879, and rs12569630 in uc.313, associated in cis with TIAL1 in METABRIC (Table 8). Only rs3009879 was predicted to target TIAL1 by INQUISIT. The three SNPs are linked as r 2 between rs3009879 and rs4752331 is 0.681, and between rs3009879 and rs12569630 r 2 = 0.967. Rs3009879 is intronic, located in the TIAL1, while rs4752331 and rs12569630 are located 7.3 kb downstream and 6.1 kb upstream of the TIAL1, respectively. However, while the p-values range from 0.0013 to 0.0078, none survives FDR correction (all FDR corrected are p-values >0.9). The beta coefficient for rs4752331 and rs12569630 variants is 0.04, and for rs3009879 -0.04.

DISCUSSION
In this study, we looked into the connection between lncRNAs and T-UCRs and breast cancer risk. The connection was investigated by identifying putative breast cancer risk SNPs in BCAC data located in or near lncRNAs and T-UCRs, assessing the SNPs' functional effects using heuristic scoring method INQUISIT that predicts target genes for risk SNPs by combining genomic information from multiple sources, and performing eQTL analysis. These analysis methods are especially suitable for gaining insight into the role of SNPs located in the areas flanking the lncRNAs and T-UCRs and not directly affecting their sequence. All the SNPS found in this study to be associated with breast cancer were flanking SNPSs. Of the 1303 breast cancer risk associated SNPs in 12 lncRNAand 26 T-UCR loci in the study, 251 were in loci not previously reported by BCAC (7 lncRNA and 8 T-UCR), and for 60 of these in two lncRNA regions and 4 T-UCR, INQUISIT predicted a target gene. For three of these genes, also an eQTL association was found in METABRIC or GTEx eQTL analysis ( Table 4).
INQUISIT analysis predicted GABPB1-AS1 as the target for two SNPs, rs71124350, and rs28489579, and the same SNPgene association was seen in GTEx eQTL analysis of normal mammary tissue ( Table 5). Results of a query to the ncRNA eQTL database support the eQTL association of GABPB1-AS1, and SNPS rs71124350 and rs28489579, although the database did not include these specific SNPs but others in strong linkage disequilibrium with them. In eQTL analysis of METABRIC breast cancer tissue data, CPEB4 was found to be associated with SNP rs17695092, and the same SNP had CPEB4 also as an INQUISIT target gene. Similarly, rs3008979 and TIAL1 had METABRIC association and were a SNP-predicted target gene pair, although the p-values for TIAL1 eQTL association did not survive FDR correction. In addition to these loci with functional data available, other candidate regions were identified containing SNPs with the defined p-value and highly significant FDR for breast cancer risk association (Supplementary Table S5).
The two GABPB1-AS1 targeting SNPs, rs71124350 and rs28489579, are linked (r 2 = 0.8996) and located near each other. According to a database of human enhancers, between the two SNPs lies a GABPB1-AS1 enhancer site (GeneHancer ID GH15I050390). This site is not a direct enhancer of GABPB1, and concordantly rs71124350 and rs28489579 do not have an eQTL association with GABPB1. As the minor alleles of both rs71124350 and rs28489579 are also associated with a small decrease in breast cancer risk ( Table 6), these findings suggest that the decrease in GABPB1-AS1 expression associates with decreased breast cancer risk. GABPB1-AS1 is an lncRNA located in 15q21.2, partially overlapping GABPB1 read from the opposite stand. There are reports of non-coding RNAs and the protein-coding genes they overlap displaying coordinated expression and function, which can be synergistic or antagonistic (39,40). Commonly, the role of antisense RNAs is to bind the sense-oriented mRNA, and thus block its translation. There are no reports on how GABPB1-AS1 affects the expression of GABPB1, but they share common promotor/enhancer loci according to GeneHancer: of the 18 promoter/enhancer regions associated with GABPB1-AS1, nine were also associated with GABPB1. GABPB1 is a transcription factor and an activator of BRCA1 expression (Atlas et al., 2000). If we assume the antisense -sense relationship between GABPB1-AS1 and GABPB1 to be an antagonistic one, it would suggest that GABPB1-AS1 downregulates GABPB1, which in turn would lead to repression of BRCA1. This would be consistent with the results of this study: SNPs associated with reduced GABPB1-AS1 are also associated with reduced breast cancer risk, and this effect could be the result of the increased GABPB1 expression leading to increased BRCA1 expression. However, GABPB1-AS1 was selected for this study based on positive correlation between GABPB1-AS1 and BRCA1 expression. It is possible that the regulatory relationships are more complex than seen here, and the correlation between overall expression levels may not imply causation. Further research is required to clarify the functional interactions between these genes, as at this point, we can only speculate on the functional role of GABPB1-AS1in breast cancer predisposition.
For the other two discovered loci, the regions were included as T-UCR harboring loci but the discovered risk SNPs were associated in eQTL and INQUISIT analyses with protein coding genes: rs17695092 in uc.184 with CPEB4, and rs3009879 in uc.313 with TIAL1. Uc.184 and uc.313 are located in CPEB4 and TIAL1, respectively. However, T-UCR expression is challenging to study, as they do not appear in expression databases. This is at least partly due to the difficulty in separating intragenic T-UCR expression from the expression of its host gene. Mestdagh et al. (2010) found uc.184 expression to be inseparable from CPEB4 expression, while uc.313 expression was found to be independent of TIAL1 expression. However, Mestdagh et al. looked at the expressions in neuroblastoma and the situation in breast tissue is unknown. Nevertheless, uc.187 and uc.313 are likely to play a substantial role in the correct function of their host genes, as such conservation is unlikely to remain intact by chance. Uc.184 and uc.313 are located in the 3 UTR and in an intron, respectively, and alterations in these regions often have a major regulatory effect on the function of a gene (Li and Yuan, 2017;Park et al., 2018). The fidelity of these regions may be essential to the correct function of the CPEB4 and TIAL1.
The 3 UTR of the CPEB4 contains 13 of the 17 SNPs that target CPEB4 in INQUISIT prediction and one of the two SNPs with CPEB4 expression association in METABRIC. T-UCR uc.184 is also located there. CPEB4 is a member of a CPEB family of proteins that bind RNA in a sequence-specific manner, contain two RNA recognition motifs, two zinc fingers and a regulatory N-terminal region (Hake and Richter, 1994;Fernandez-Miranda and Mendez, 2012). CPEBs regulate translation by controlling the polyadenylation of their target genes (Mendez and Richter, 2001;Richter, 2007). There are no previous reports of CPEB4 affecting breast cancer risk, but overexpression of CPEB4 is reported in breast cancer, and the overall survival of patients with high expression of CPEB4 is shorter (Sun et al., 2015;Lu et al., 2017). Ectopic CPEB4 expression has been suggested to promote EMT, migration and invasion of breast cancer cells, while silencing the expression of CPEB4 reduces these events (Lu et al., 2017). Our results imply that CPEB4 may also play a role in the breast cancer development as the intronic SNP rs17695092 associates with both reduced CPEB4 expression, and reduced breast cancer risk ( Table 7). It is to be noted that the METABRIC dataset consists of breast cancer samples, and the effect is not seen in the eQTL analysis in the GTEx dataset of normal mammary tissue samples. The difference could be due to difference in statistical power, as the METABRIC dataset includes over 1300 breast cancer samples, whereas the GTEx dataset is 251 normal breast cancer tissues. It is notable that the cell line used in the CHiA-PET analysis from which the INQUISIT results for rs17695092 were gathered was MCF7, which is a breast cancer cell line. This requires further research, as does the role of the uc.184 in the 3 UTR of the CPEB4. Uc.313 is located in the intron 5 or 6 of the TIAL1, depending on the transcript (and in a single transcript, NM_001323964.1, out of the eleven UCSC annotations of the RefSeq RNAs, it partially overlaps exon six). Of the twelve SNPs that target TIAL1 in INQUISIT prediction, or as METABRIC association, the majority are located downstream of the gene, three are in the TIAL1, all intronic, and one is located upstream of the gene ( Table 8). The SNP with TIAL1 as both INQUISIT target and METABRIC association, rs3009879, is one of the three intronic variants. Rs3009879 does not appear to overlap any regulatory sequence elements (assessed by using Ensembl genome browser 92 and GeneHancer in GeneCards), but as it does target TIAL1 in INQUISIT, a connection discovered by the CHiA-PET method, it suggests involvement in a chromatin interaction. It is worth noting that in METABRIC eQTL analysis, the significance of rs3009879 association withTIAL1 expression was p = 0.0013, but it did not survive FDR correction. Thus, it is also possible that the eQTL association of this variant withTIAL1 is an artefact.
TIAL1 (also known as TIAR), is a ubiquitously expressed RNA binding protein that contains three N-terminal RNA recognition motifs and a C-terminal glutamine-rich prion-like domain, which is found to aggregate during the formation of cytoplasmic stress granules (Dember et al., 1996;Gilks et al., 2004;Kim et al., 2013). TIAL1 is a negative regulator of BRCA1: it is shown to block translation, and at least in chronic myeloid leukemia cells, reduce the protein expression of BRCA1 which leads to aneuploidy, spindle toxin resistance, and genomic instability (Deutsch et al., 2003;Wolanin et al., 2010;Podszywalow-Bartnicka et al., 2014). If TIAL1 has the same effect on BRCA1 protein expression in breast cancer, it is plausible that SNPs that increase TIAL1 expression also increase breast cancer risk, as is the case with rs3009879 (Supplementary Table S9).
Previously, SNPs with genome-wide significant associations (p < 5 × 10 −8 ) with breast cancer risk have been reported in several genomic regions containing lncRNAs Milne et al., 2017). In this study, we aimed to identify additional candidate loci for further studies. We report here putative breast cancer risk SNPs predicted to functionally target GABPB1-AS1 lncRNA, and associating with its expression, as well as SNPs in two genes, CPEB4 and TIAL1, hosting ultraconserved regions, uc.184 and uc.313, respectively. Further research is needed to validate these findings and candidate genes, and elucidate the functional mechanisms involved. In addition, other regions containing SNPs with the defined p-value and highly significant FDR for breast cancer risk association, but currently lacking the functional data, may warrant further studies.

DATA AVAILABILITY STATEMENT
The analyses were based on summary results of the Breast Cancer Association Consortium (BCAC), available online at: http://bcac. ccge.medschl.cam.ac.uk/.

ETHICS STATEMENT
All participating BCAC studies were approved by their appropriate institutional ethics review boards for the initial BCAC study. This study uses only publicly available BCACsummary data, no individual data.

AUTHOR CONTRIBUTIONS
SK and HN designed the study. MS and SK carried out the data and eQTL analyses, wrote the main manuscript text and prepared the figures and the tables. CB provided clinical expertise and critically reviewed the manuscript. JB and GC-T provided the INQUISIT analysis. All authors contributed to and approved the final manuscript.