Impute.me: an open source, non-profit tool for using data from DTC genetic testing to calculate and interpret polygenic risk scores

To date, interpretation of genomic information has focused on single variants conferring disease risk, but most disorders of major public concern have a polygenic architecture. Polygenic risk scores (PRS) give a single measure of disease liability by summarising disease risk across hundreds of thousands of genetic variants. They can be calculated in any genome-wide genotype data-source, using a prediction model based on genome-wide summary statistics from external studies. As genome-wide association studies increase in power, the predictive ability for disease risk will also increase. While PRS are unlikely ever to be fully diagnostic, they may give valuable medical information for risk stratification, prognosis, or treatment response prediction. Public engagement is therefore becoming important on the potential use and acceptability of PRS. However, the current public perception of genetics is that it provides ‘Yes/No’ answers about the presence/absence of a condition, or the potential for developing a condition, which in not the case for common, complex disorders with of polygenic architecture. Meanwhile, unregulated third-party applications are being developed to satisfy consumer demand for information on the impact of lower risk variants on common diseases that are highly polygenic. Often applications report results from single SNPs and disregard effect size, which is highly inappropriate for common, complex disorders where everybody carries risk variants. Tools are therefore needed to communicate our understanding of genetic predisposition as a continuous trait, where a genetic liability confers risk for disease. Impute.me is one such a tool, whose focus is on education and information on common, complex disorders with polygenetic architecture. Its research-focused open-source website allows users to upload consumer genetics data to obtain PRS, with results reported on a population-level normal distribution. Diseases can only be browsed by ICD10-chapter-location or alphabetically, thus prompting the user to consider genetic risk scores in a medical context of relevance to the individual. Here we present an overview of the implementation of the impute.me site, along with analysis of typical usage-patterns, which may advance public perception of genomic risk and precision medicine.

To date, interpretation of genomic information has focused on single variants conferring disease risk, 21 but most disorders of major public concern have a polygenic architecture. Polygenic risk scores 22 (PRS) give a single measure of disease liability by summarising disease risk across hundreds of 23 thousands of genetic variants. They can be calculated in any genome-wide genotype data-source, 24 using a prediction model based on genome-wide summary statistics from external studies. 25 As genome-wide association studies increase in power, the predictive ability for disease risk will also 26 increase. While PRS are unlikely ever to be fully diagnostic, they may give valuable medical 27 information for risk stratification, prognosis, or treatment response prediction. 28 Public engagement is therefore becoming important on the potential use and acceptability of PRS. 29 However, the current public perception of genetics is that it provides 'Yes/No' answers about the 30 presence/absence of a condition, or the potential for developing a condition, which in not the case for 31 common, complex disorders with of polygenic architecture. 32 Meanwhile, unregulated third-party applications are being developed to satisfy consumer demand for 33 information on the impact of lower risk variants on common diseases that are highly polygenic. Often 34 applications report results from single SNPs and disregard effect size, which is highly inappropriate 35 for common, complex disorders where everybody carries risk variants. 36 Tools are therefore needed to communicate our understanding of genetic predisposition as a 37 continuous trait, where a genetic liability confers risk for disease. Impute.me is one such a tool, 38 whose focus is on education and information on common, complex disorders with polygenetic 39 architecture. Its research-focused open-source website allows users to upload consumer genetics data 40 to obtain PRS, with results reported on a population-level normal distribution. Diseases can only be 41 browsed by ICD10-chapter-location or alphabetically, thus prompting the user to consider genetic 42 risk scores in a medical context of relevance to the individual. 43 Here we present an overview of the implementation of the impute.me site, along with analysis of 44 typical usage-patterns, which may advance public perception of genomic risk and precision medicine. 45 46 1 Introduction 48 49 In clinical genetics, testing for rare strong-effect causal variants is routinely performed in the healthcare system to 50 confirm a diagnosis, or to evaluate individual risk suspected from anamnestic information [Baig et al 2016], and in such instances 51 the use of genome sequencing is expanding [Byrjalsen et al 2018]. Meanwhile, outside of the healthcare system, direct to 52 consumer (DTC) genetics expands rapidly providing the public with access individual genetic data profiles and to interpretation of 53 common genetic variants derived from genotyping microarrays. This is developing as a sprawling industry of consumer services 54 with widely diverging standards, including third-party genome analysis services. These services typically provide individual results 55 from analysis of single, common SNPs with (at best) weak effects. They are therefore severely mis-aligned with current state-of-56 the-art, which at least for common, complex disease is to use polygenic risk scores (PRS) to estimate the combined risk of

58
We believe that the goal of the academic genetics community should extend beyond theory. This means engaging with 59 the public and assisting those that seek information, even when it means helping them to interpret their own genomic data. We 60 therefore developed impute.me as an online web-app for analysis and education in personal genetic analysis. The web-app is 61 illustrated in figure 1. Using any major DTC vendor, a user can download their raw data and then upload it at impute.me.

62
Uploaded files are checked and formatted according to procedures that have been developed to handle most types of microarray-63 based consumer genetics data, including an imputation step. This data is then further subjected to automated analysis scripts 64 including polygenic risk score calculations. This includes more than 2000 traits, browsable in different interface-types (modules).

65
Each module is designed with the goal of putting findings in as relevant a context as possible, prompting users to see common 66 variant genetics as a support tool rather than a diagnosis finder. The aim is to provide information as broadly as possible to offer a 67 real alternative to the wide-spread practice of reporting on weak single SNP genotypes for any trait, even though that entails 68 generation of some reports that are below any sensible threshold for clinical usability. We hope that having this as an open and 69 accessible resource for everyone will be of help to the debate on what exactly constitutes clinical usability beyond high-risk 70 pathogenic variants.

71
In this paper we will describe the i) development and setup, ii) validation and testing and iii) evaluation of usage, and iv) 72 future directions for impute.me. In the section Development and Setup we discuss some of the challenges faced when developing 73 a full personal-genome scoring pipeline. The goal of this section is to motivate and explain the choices made in development. In 74 the second section, Validation and Testing, we use public biobank data from individuals that are consented for genetic research to 75 test the effect of the impute.me scores on known disease outcomes. The purpose of this section is to test and validate scores, as 76 well as to investigate consequences of some of the challenges that were raised in the first section. In the third section, Evaluation 77 of Usage, we evaluate usage-metrics of impute.me users. The goal of this section is to shed light on behaviour patterns of 78 individuals who use DTC genetics for health questions and offer recommendations that may be of use in other personal-genome This is a provisional file, not the final typeset article scoring pipelines. Finally, in the section Future Directions we discuss our views on future directions particularly with respect to 80 improving how genetic findings are presented to people.

92
The second challenge is calculation of robust PRS estimates that are accurate, irrespectively of the source of the data.

93
This is particularly important to an application utilized by people from around the world leveraging data from dozens of different 94 vendors and data types. Importantly, PRS calculated from GWAS of a population of (e.g.) European ancestry, will perform better

103
The third challenge is presentation. For a single rare large-effect variant, such as for the pathogenic variants in the 104 BRCA genes conferring very high risk of cancers (odds-ratio >10; Figure 2A Validation and testing 118 To evaluate pipelines on individuals with known disease outcomes, we investigated 242 samples from the CommonMind 119 data set. The CommonMind data set includes patients with schizophrenia, bipolar disorder and controls, from European ancestry 120 and from African ancestry. For each disorder and each ancestry group the full impute.me pipelines were applied, including 121 imputation and PRS-calculation. Additionally, SNP-sets corresponding to each of three major DTC companies were extracted and 122 re-calculated. This was done to test the hypothesis that PRS calculation in mixed SNP-sets poses particular challenges with 123 regards to missing SNPs. Such sets of genotyped SNPs that are different in each sample, is an unavoidable consequence of 124 working with online data uploads.

125
We found that disease prediction strength, measured as variability explained, corresponded well to theoretical

133
Of importance to this, we found that PRS prediction in mixed samples of non-imputed data causes severe problems.

134
When training PRS algorithms, a SNP set is pre-specified. The pipelines evaluated here were trained with HapMap3 as SNP-set.

135
Similar choices are made in other published PRS. However, such SNP-sets may not match with the SNPs available in 136 downloadable raw data from DTC vendors. We therefore tested what prediction strength would be possible when using raw data 137 directly from DTC vendors, both in a uniform setting (e.g. "all individuals use 23andme v4 data") and in a mixed setting (e.g.

138
"individuals have data from different vendors"). We found that in the uniform setting roughly half the predictive strength remained 139 when using genotype data that is not imputed to match the HapMap3 SNP-sets (figure 3, row 2 and 4). In the mixed setting, 140 virtually no predictive strength remained (figure 3, row 3 and 6). The mixed setting is the reality that is faced, both for third-party 141 analytical services but also for DTC vendors with different chip versions. Imputation is therefore likely to be an essential 142 requirement in such scenarios. This is a provisional file, not the final typeset article To compare these findings with approaches that look at one SNP at the time, we extracted the SNPedia/Promethease 144 SNPs that were indicated as associated with schizophrenia [Cariaso et al 2011]. All cases (n=25) and all controls (n=39) had at 145 least one risk-variant from at least one of the 139 SNPs indicated schizophrenia associated. When focusing on SNPs that had 146 the SNPedia/Promethease-defined "magnitude"-level (sic.) at > 1.5, we found that 80% of the SCZ cases (20 of 25) had at least 147 one SNPedia/Promethease risk variant. Among the healthy controls 84% (33 of 39) had at least one such risk variant (p=0.9 for 148 difference in proportions). In other words, it is not very predictive to know if you have a schizophrenia SNP. This illustrates the 149 importance of considering more than one SNP at the time.

150
Finally, we compared pipeline reproducibility using two genome-data files, one obtained from MyHeritage and one from 151 Ancestry.com, but sampled from the same person. After processing through the impute-me pipelines the correlation between PRS 152 values over 1468 traits was r=0.933 between the two samples. Traits that showed discrepancy between the two data files typically 153 were based on only few SNPs, of which one did not meet imputation quality thresholds for one of the data files.  Finally, we have observed that usage of health genetic data surprisingly often is not just a test-and-forget-event. When 174 plotting query-count as a function of time from first data-access, we find an expected pattern of intense browsing the hours and 175 days after first data access ( Figure 4D). However, many users re-visit their data even months and years after first data access, 176 perhaps implying that results are considered and saved and then re-visited at a later time in a different context.   figure 1C). Currently we have registered the SNP-heritability for 294 of the reported traits, available as an experimental option 199 called 'plot heritability'. We believe that a main future direction is to experiment and expand on how to best communicate this to  Zero-centered-score = Σ Betasnp * Effect-allele-countsnp -Population-scoresnp 267 Z-score = Zero-centered-score / Standard-deviationpopulation 268 Where beta (or log(odds ratio)) is the reported effect size for the SNP effect allele, frequencySNP is the 269 allele frequency for the effect allele, and the Effect-allele-countSNP is the allele count from genotype 270 data (0, 1, or 2). 271 In the all-SNP calculation, the scaling is similar, but done empirically, i.e. based on previous 272 Impute.me users of matching ethnicity. This mode of scaling is also available as an optional 273 functionality in the top-SNP calculations, and generally seems to match well with the default 1000 274 genomes super population scaling. 275 The all-SNP scores were derived using weightings from the LDpred algorithm [Vilhjálmsson et al 276 2015]. This algorithm adjusts the effect of each SNP allele for those of other SNP alleles in linkage 277 disequilibrium (LD) with it, and also takes into account the likelihood of a given allele to have a true 278 effect according to a user-defined parameter, which here was taken as wt1 i.e. the full set of SNPs. 279 The algorithm was directed to use hapmap3 SNPs that had a minor allele frequency >0.05, Hardy-280 Weinberg equilibrium P>1e-05 and genotype-yield >0.95, consistent with our expectation that these 281 would be the best imputed SNPs after full pipeline processing. 282

Pipeline Testing 283
The CommonMind genotypes measured with the microarray of the type H1M were downloaded 284 along with phenotypic information. Each sample was processed through the impute.me pipelines, 285 using the batch-upload functionality. Reported ethnicity was compared with pipeline (genotype) 286 assigned ethnicity and found to be concordant. This is a provisional file, not the final typeset article After pipeline completion, we extracted three PRS for each sample, corresponding to SCZ all-SNP, 288 of customers from different DTC vendors, with distributions re-drawn 100 times. We estimated the 298 predictive ability of the PRS using Nagelkerke's R 2 and AUC. 299

Usage evaluation 300
A log data freeze was performed 2019-06-08, by making a copy of all usage log files and then 301 removing the uniqueID of each user. This was done to prevent it from being linked with the genetic 302 data of that user. The exception was the publicly available permanent test user with ID 303 id_613z86871, which was lifted out before analysis and is not included in other summary statistics. 304 Generally, a user corresponds to an uploaded genome with a unique md5sum. Click-through rates 305 were calculated as fraction of users that performed any query in the module in question, e.g. the 306 precision medicine module was only launched in September 2018 and therefore only counts clicks 307 from people who have used it. Plots were generated using base-R version 3.4.2 and cytoscape version 308 3.71.     The manuscript describes an approach for obtaining polygenic risk scores (PRS) for any individual 392 using direct-to-consumer genetics. PRS are considered state-of-the-art for genetic prediction in common, complex diseases, and the availability of such methods therefore have the potential expand 394 the usage of genetics beyond rare mendelian disorders. At the same time, the usage and interpretation 395 of direct-to-consumer genetics is by many considered a controversial issue that needs more 396 regulatory oversight.