<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Genet.</journal-id>
<journal-title>Frontiers in Genetics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Genet.</abbrev-journal-title>
<issn pub-type="epub">1664-8021</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">818574</article-id>
<article-id pub-id-type="doi">10.3389/fgene.2022.818574</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Genetics</subject>
<subj-group>
<subject>Technology and Code</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Calculating Polygenic Risk Scores (PRS) in UK Biobank: A Practical Guide for Epidemiologists</article-title>
<alt-title alt-title-type="left-running-head">Collister et&#x20;al.</alt-title>
<alt-title alt-title-type="right-running-head">Calculating PRS in UK Biobank</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Collister</surname>
<given-names>Jennifer A.</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1514650/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Liu</surname>
<given-names>Xiaonan</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1565412/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Clifton</surname>
<given-names>Lei</given-names>
</name>
</contrib>
</contrib-group>
<aff>
<institution>Nuffield Department of Population Health</institution>, <institution>University of Oxford</institution>, <addr-line>Oxford</addr-line>, <country>United&#x20;Kingdom</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1010183/overview">Hugues Aschard</ext-link>, Institut Pasteur, France</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/47090/overview">Vincent Frouin</ext-link>, Neurospin, France</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/24184/overview">Wei-Min Chen</ext-link>, University of Virginia, United&#x20;States</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Jennifer A. Collister, Jennifer.collister@ndph.ox.ac.uk</corresp>
<fn fn-type="other">
<p>This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>18</day>
<month>02</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>13</volume>
<elocation-id>818574</elocation-id>
<history>
<date date-type="received">
<day>19</day>
<month>11</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>12</day>
<month>01</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Collister, Liu and Clifton.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Collister, Liu and Clifton</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these&#x20;terms.</p>
</license>
</permissions>
<abstract>
<p>A polygenic risk score estimates the genetic risk of an individual for some disease or trait, calculated by aggregating the effect of many common variants associated with the condition. With the increasing availability of genetic data in large cohort studies such as the UK Biobank, inclusion of this genetic risk as a covariate in statistical analyses is becoming more widespread. Previously this required specialist knowledge, but as tooling and data availability have improved it has become more feasible for statisticians and epidemiologists to calculate existing scores themselves for use in analyses. While tutorial resources exist for conducting genome-wide association studies and generating of new polygenic risk scores, fewer guides exist for the simple calculation and application of existing genetic scores. This guide outlines the key steps of this process: selection of suitable polygenic risk scores from the literature, extraction of relevant genetic variants and verification of their quality, calculation of the risk score and key considerations of its inclusion in statistical models, using the UK Biobank imputed data as a model data set. Many of the techniques in this guide will generalize to other datasets, however we also focus on some of the specific techniques required for using data in the formats UK Biobank have selected. This includes some of the challenges faced when working with large numbers of variants, where the computation time required by some tools is impractical. While we have focused on only a couple of tools, which may not be the best ones for every given aspect of the process, one barrier to working with genetic data is the sheer volume of tools available, and the difficulty for a novice to assess their viability. By discussing in depth a couple of tools that are adequate for the calculation even at large scale, we hope to make polygenic risk scores more accessible to a wider range of researchers.</p>
</abstract>
<kwd-group>
<kwd>polygenic risk score</kwd>
<kwd>UK biobank</kwd>
<kwd>genetic risk score</kwd>
<kwd>worked example</kwd>
<kwd>polygenic score</kwd>
</kwd-group>
<contract-sponsor id="cn001">Oxford University<named-content content-type="fundref-id">10.13039/501100000769</named-content>
</contract-sponsor>
<contract-sponsor id="cn002">Cancer Research UK<named-content content-type="fundref-id">10.13039/501100000289</named-content>
</contract-sponsor>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>A polygenic risk score (PRS), sometimes called polygenic score (PGS) or genetic risk score (GRS), is an estimate of an individual&#x2019;s genetic risk for some trait, obtained by aggregating and quantifying the effect of many common variants (usually defined as minor allele frequency &#x2265;1%) in the genome, each of which can have a small effect on a person&#x2019;s genetic risk for a given disease or condition. A PRS is typically constructed as the weighted sum of a collection of genetic variants, usually single nucleotide polymorphisms (SNPs) defined as single base-pair variations from the reference genome. The resulting score is approximately normally distributed in the general population, with higher scores indicating higher risk (<xref ref-type="fig" rid="F1">Figure&#x20;1</xref>).</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Distribution of a polygenic risk score for breast cancer among individuals with and without registry-verified breast cancer events in UK Biobank (4,789 cases). Score used was 313-SNP PRS from (<xref ref-type="bibr" rid="B30">Mavaddat et&#x20;al., 2019</xref>).</p>
</caption>
<graphic xlink:href="fgene-13-818574-g001.tif"/>
</fig>
<p>The basic equation for the PRS of an individual <inline-formula id="inf1">
<mml:math id="m1">
<mml:mi>j</mml:mi>
</mml:math>
</inline-formula>&#x20;is:</p>
<p>Eq. (1): Standard equation to calculate a weighted polygenic risk score<disp-formula id="equ1">
<mml:math id="m2">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>R</mml:mi>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mi>i</mml:mi>
<mml:mi>N</mml:mi>
</mml:munderover>
<mml:msub>
<mml:mi>&#x3b2;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2217;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>g</mml:mi>
<mml:msub>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</disp-formula>where N is the number of SNPs in the score, <inline-formula id="inf2">
<mml:math id="m3">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b2;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the effect size (or beta) of variant <inline-formula id="inf3">
<mml:math id="m4">
<mml:mi>i</mml:mi>
</mml:math>
</inline-formula> and <inline-formula id="inf4">
<mml:math id="m5">
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>g</mml:mi>
<mml:msub>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the number of copies of SNP <inline-formula id="inf5">
<mml:math id="m6">
<mml:mi>i</mml:mi>
</mml:math>
</inline-formula> in the genotype of individual&#x20;<inline-formula id="inf6">
<mml:math id="m7">
<mml:mi>j</mml:mi>
</mml:math>
</inline-formula>.</p>
<p>The effect sizes, or betas, are often obtained from a genome-wide association study (GWAS) known as the &#x201c;base&#x201d; data (see <xref ref-type="table" rid="T1">Table&#x20;1</xref>: Glossary), wherein each genetic marker in turn is tested for association with the trait/disease of interest, and effect sizes are estimated.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Glossary.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Term</th>
<th align="center">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Allele</td>
<td align="left">An alternative form of a genetic variant</td>
</tr>
<tr>
<td align="left">Alternate id</td>
<td align="left">In the UK Biobank multi-allelic SNPs are represented as multiple SNPs with different alleles but the same rsID and same position on the chromosome. In order to have a unique identifier for each SNP, an &#x201c;alternate_id&#x201d; was created that is typically the rsID, chr:pos or Affymetrix identifier followed by the reference and alternate alleles</td>
</tr>
<tr>
<td align="left">Base data</td>
<td align="left">Typically GWAS summary statistics containing SNP identifiers, risk alleles and effect sizes</td>
</tr>
<tr>
<td align="left">Genome Build</td>
<td align="left">The genome build is a common &#x201c;reference genome&#x201d; developed by combining the sequences most commonly observed across available individual genomes to create a representative genome against which individual genomes can be compared</td>
</tr>
<tr>
<td align="left">Genotype data</td>
<td align="left">Genotyping is the identification of the genetic variants in the DNA of an individual. This is typically done using arrays or chips, which contain probes that target specific locations in the DNA. These locations contain known variants of interest&#x2014;so genotyping is good at identifying which known variants a person has, but not at finding new variants</td>
</tr>
<tr>
<td align="left">Genotype Imputation</td>
<td align="left">Genotype imputation uses a reference panel to estimate genotypes at locations that were not directly called by statistical inference</td>
</tr>
<tr>
<td align="left">Heritability</td>
<td align="left">Heritability is the amount of observable (phenotypic) variation among individuals of a population that is due to genetic variation between the individuals</td>
</tr>
<tr>
<td align="left">Linkage Disequilibrium (LD)</td>
<td align="left">Linkage disequilibrium (LD) is a measure of the correlation between neighbouring genetic variants that are more likely to be inherited together because of their physical proximity, leading to association within a population</td>
</tr>
<tr>
<td align="left">Locus</td>
<td align="left">Physical location of a gene or DNA polymorphism on a chromosome (plural &#x201c;loci&#x201d;)</td>
</tr>
<tr>
<td align="left">Multi-allelic SNPs</td>
<td align="left">When there is more than one possible variant nucleotide (in addition to the reference) at a location, then we say this location is &#x201c;multi-allelic&#x201d;</td>
</tr>
<tr>
<td align="left">Next generation sequencing</td>
<td align="left">Sequencing enables the exact sequence of bases in a length of DNA to be determined. This technique can be used on targeted areas such as the exome, although it is becoming increasingly cost effective to do whole genome sequencing</td>
</tr>
<tr>
<td align="left">Phenotype</td>
<td align="left">The phenotype of an organism is its observable characteristics, for example its physical appearance</td>
</tr>
<tr>
<td rowspan="2" align="left">rsID</td>
<td align="left">The rsID for a SNP is the unique RefSNP ID number identifying the &#x201c;reference SNP cluster&#x201d; containing this SNP in dbSNP. This cluster contains all SNPs that map to the same location on the genome</td>
</tr>
<tr>
<td align="left">Since genome assemblies are still a work in progress, occasionally there will be changes that alter our understanding of where a refSNP is located, so that it may co-locate with another existing refSNP. In these cases, the higher refSNP number is retired and all SNPs are reassigned to the refSNP with the lower number</td>
</tr>
<tr>
<td align="left">Single Nucleotide Polymorphism (SNP)</td>
<td align="left">A single nucleotide polymorphism (or single nucleotide variant) is a location on the genome where a single DNA nucleotide that differs from that in the reference genome has been identified</td>
</tr>
<tr>
<td align="left">Target data</td>
<td align="left">The data in which the PRS is developed, using effect sizes from the base data. Multiple PRS may be calculated, using different thresholds for association, and the one with best performance is selected</td>
</tr>
<tr>
<td align="left">Validation data</td>
<td align="left">The data in which the PRS is calculated and used in analyses. These analyses may validate the association between the PRS and the trait of interest</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In more advanced methods of PRS development, &#x201c;target&#x201d; data might be used to tune parameters or perform model selection (<xref ref-type="bibr" rid="B26">Ma and Zhou, 2021</xref>). These approaches include the construction of multiple PRS based on different threshold values for SNP association with the trait of interest, the shrinkage of betas, and adjustment for linkage disequilibrium using techniques such as pruning and clumping (<xref ref-type="bibr" rid="B7">Choi et&#x20;al., 2020</xref>).</p>
<p>Once a PRS has been developed, it is important for the association between the PRS and the trait of interest to be replicated in an independent sample, referred to as &#x201c;validation&#x201d; data. This is done to guard against overfitting, which can lead to inflated estimates. The PRS can then be calculated in other data-sets and used for a wide range of analyses (<xref ref-type="bibr" rid="B24">Lewis and Vassos, 2020</xref>; <xref ref-type="bibr" rid="B46">Wray et&#x20;al., 2021</xref>).</p>
<p>There is particular interest in adding PRS to existing risk prediction models (<xref ref-type="bibr" rid="B11">Elliott et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B18">Inouye et&#x20;al., 2018</xref>; <xref ref-type="bibr" rid="B21">Lee et&#x20;al., 2019</xref>; <xref ref-type="bibr" rid="B40">Sun et&#x20;al., 2021</xref>), which could allow them to be incorporated into clinical guidelines, enabling clinicians to identify individuals who may be at higher risk of a given condition, or who may benefit from more aggressive treatment to manage the condition.</p>
<p>There has also been increasing use of PRS in Mendelian Randomisation to establish the causal effect of risk factors on clinical outcomes, mainly due to simplicity of use, increased power and avoidance of weak instrument bias (<xref ref-type="bibr" rid="B31">MV et&#x20;al., 2015</xref>; <xref ref-type="bibr" rid="B13">Gajendragadkar et&#x20;al., 2021</xref>; <xref ref-type="bibr" rid="B47">Zekavat et&#x20;al., 2021</xref>).</p>
<p>As increasingly many PRS are developed, initiatives such as the Polygenic Score Catalog<xref ref-type="fn" rid="fn1">
<sup>1</sup>
</xref> and Cancer PRS-Web<xref ref-type="fn" rid="fn2">
<sup>2</sup>
</xref> have begun to host and curate the metadata required to calculate the scores, making them more accessible for future research (<xref ref-type="bibr" rid="B12">Fritsche et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B20">Lambert et&#x20;al., 2021</xref>). Despite this, it seems more common for new scores to be developed, offering only minimal improvements in population level risk prediction, than for existing scores to be used in further analyses.</p>
<p>In this paper, we outline the necessary considerations when selecting an existing PRS from the literature for use in new analyses, including discussion of the information required for the calculation to be reproducible. We provide a step-by-step walkthrough of how to calculate an existing PRS in an independent dataset, from extracting SNPs to the necessary quality control (QC) checks that should be performed prior to calculating the PRS. We focus in particular on imputed data, using UK Biobank v3 imputed data (March 2018) as an example, and we consider only SNPs on autosomes.</p>
<p>After discussing the various steps required to obtain and calculate a PRS, we present a worked example using a PRS for LDL-Cholesterol (LDL-C) and a brief discussion of the statistical considerations when including a PRS in a model. Detailed code examples are provided in the online materials<xref ref-type="fn" rid="fn3">
<sup>3</sup>
</xref> on GitHub, along with notes on technical considerations.</p>
</sec>
<sec id="s2">
<title>2 Materials and Methods</title>
<sec id="s2-1">
<title>2.1 Software Considerations</title>
<p>Genetic data can be stored in a range of different formats, and due to the large size of the data it is often compressed to save space, resulting in files that are not directly human-readable and require dedicated software tools or packages. Many such genetic software are designed to run on Linux and in this paper we will assume access to a Linux system with adequate storage space for the&#x20;data.</p>
<p>Our example data, the UK Biobank v3 imputed data, is made available in BGEN v1.2 format (<xref ref-type="bibr" rid="B2">Band and Marchini, 2018</xref>) which is the format output by the IMPUTE imputation software (<xref ref-type="bibr" rid="B28">Marchini and Howie, 2010</xref>). There are a range of software tools that can be used to read and manipulate this data, and deciding which to use is a combination of computation time, software compatibility and personal preference. In this paper we will focus on three: bgenix,<xref ref-type="fn" rid="fn4">
<sup>4</sup>
</xref> QCTOOL v2<xref ref-type="fn" rid="fn5">
<sup>5</sup>
</xref> and PLINK 2<xref ref-type="fn" rid="fn6">
<sup>6</sup>
</xref> (<xref ref-type="bibr" rid="B5">Chang et&#x20;al., 2015</xref>; <xref ref-type="bibr" rid="B2">Band and Marchini, 2018</xref>), summarized in <xref ref-type="table" rid="T2">Table&#x20;2</xref>.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Comparison between genetic software for various usages.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th colspan="3" align="center">Genetic software</th>
</tr>
<tr>
<th align="left">Usage</th>
<th align="center">bgenix</th>
<th align="center">QCTOOL</th>
<th align="center">PLINK</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Extract SNPs</td>
<td align="left">Yes, very quickly, although can only specify up to 9,980 SNPs by chromosome and position identifier</td>
<td align="left">Yes, and has useful wildcard feature to extract from all chromosome files in one step, but slow</td>
<td align="left">Yes, have to extract per chromosome, slow for BGEN data as it has to auto-convert the entire file not just the required SNPs</td>
</tr>
<tr>
<td align="left">Conduct QC</td>
<td align="left">No</td>
<td align="left">Yes, it computes summary statistics but filtering has to be done in a separate step, and with additional tools (such as awk or R)</td>
<td align="left">Yes, fast, it can compute summary statistics and apply filtering. Not all commands are suitable for use on imputed data</td>
</tr>
<tr>
<td align="left">Compute PRS</td>
<td align="left">No</td>
<td align="left">Yes but poorly documented</td>
<td align="left">Yes, with many options</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Bgenix is a utility that was developed alongside the BGEN file format to index and retrieve subsets from the .bgen data files. The accompanying cat-bgen utility can be used to concatenate BGEN&#x20;files.</p>
<p>QCTOOL v2 was the tool used by UK Biobank to generate the minor allele frequency and imputation information metrics released alongside the imputed data. It can be used to produce per-SNP and per-sample summary statistics, and perform filtering of the dataset. However, it can be slow to run for larger datasets.</p>
<p>A more scalable alternative is PLINK 2 (<xref ref-type="bibr" rid="B5">Chang et&#x20;al., 2015</xref>), which we recommend for the routine quality control (QC) process described in this paper. A selection of PLINK 2 commands useful for such QC are summarized in <xref ref-type="table" rid="T3">Table&#x20;3</xref>. While PLINK 1.9 has a similar feature set and could also be used, it does not directly support the BGEN v1.2 file format, and so an interim conversion step would be required.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>PLINK 2 commands for summary statistics and filtering.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">Function</th>
<th rowspan="2" align="center">Summary statistics</th>
<th colspan="2" align="center">As exclusion criteria</th>
</tr>
<tr>
<th align="center">Option</th>
<th align="center">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Allele frequency</td>
<td align="left">
<ext-link ext-link-type="uri" xlink:href="https://www.cog-genomics.org/plink/2.0/basic_stats">--freq</ext-link>
</td>
<td align="left">--maf [threshold]</td>
<td align="left">Include SNPs with MAF above [threshold] (default &#x3d; 0.01)</td>
</tr>
<tr>
<td align="left">SNP call rate</td>
<td align="left">
<ext-link ext-link-type="uri" xlink:href="https://www.cog-genomics.org/plink/2.0/basic_stats">--missing</ext-link>
</td>
<td align="left">--geno [threshold]</td>
<td align="left">Exclude SNPs with missing call rates exceeding the [threshold] (default &#x3d; 0.1)</td>
</tr>
<tr>
<td align="left">Filter SNPs</td>
<td align="left"/>
<td align="left">--exclude [file]</td>
<td align="left">Exclude SNPs listed in [file]</td>
</tr>
<tr>
<td align="left">Filter samples</td>
<td align="left"/>
<td align="left">--keep [file]</td>
<td align="left">Retains only the samples listed in [file], all others are excluded</td>
</tr>
<tr>
<td align="left">HWE</td>
<td align="left">
<ext-link ext-link-type="uri" xlink:href="https://www.cog-genomics.org/plink/2.0/basic_stats">--hardy</ext-link>
</td>
<td align="left">--hwe [threshold]</td>
<td align="left">Exclude SNPs with <italic>p</italic>-values below [threshold]</td>
</tr>
<tr>
<td align="left">Linkage Disequilibrium (LD)</td>
<td align="left">
<ext-link ext-link-type="uri" xlink:href="https://www.cog-genomics.org/plink/1.9/ld">--r2</ext-link>&#x2a;</td>
<td align="left">
<ext-link ext-link-type="uri" xlink:href="https://www.cog-genomics.org/plink/2.0/ld">--indep-pairwise</ext-link> [window][step][threshold]</td>
<td align="left">Pruning with a [window] size, sliding across the genome with [step] size at a time and filter out any SNPs with LD r<sup>2</sup> higher than [threshold]</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>&#x2a; Command in PLINK 1.9.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>In this paper we demonstrate the actual calculation of the PRS in PLINK 2, but it is numerically straightforward and can be computed in any scripting language such as R if sufficient computer memory is available. Dedicated PRS tools like PRSice-2 (<xref ref-type="bibr" rid="B8">Choi and O&#x2019;Reilly, 2019</xref>) can also be used, but these were designed for those wishing to develop a new PRS from scratch, offering more complex functionalities and assuming a level of domain expertise that may be off-putting for a beginner/casual&#x20;user.</p>
</sec>
<sec id="s2-2">
<title>2.2 Choosing a Polygenic Risk Score</title>
<p>In order to include a polygenic risk score in analyses, the first step is to select an existing PRS for the phenotypic trait or outcome of interest. PRS are sometimes made available in the supplementary materials of the papers where they are derived, but are increasingly being made available in online repositories such as the PGS Catalog (<xref ref-type="bibr" rid="B20">Lambert et&#x20;al., 2021</xref>), which improve discoverability with the intention of improving the reproducibility of genetic research.</p>
<sec id="s2-2-1">
<title>2.2.1 Outcome</title>
<p>The research objective is the first consideration when choosing a PRS. Since any given PRS is associated with a single phenotypic trait (e.g., height, blood pressure) or medical condition/outcome (e.g., breast cancer), when choosing a PRS for use in analysis it is important to select a score that has been derived for an appropriate trait or condition.</p>
<p>When attempting to replicate (or validate) the association found between some given PRS and a trait/outcome then it is important to understand exactly how this trait/outcome was defined in the development of the PRS, as it will need to be defined as similarly as possible within the validation dataset. For measured traits (e.g., cholesterol), attention to units (e.g., mg/dL or mmol/L) and whether adjustments have been made for subgroups (e.g., correcting cholesterol for statin users) are typically required to produce reliable results.</p>
<p>An alternative objective could be to investigate whether a PRS for a trait (for example a measured biomarker such as cholesterol) is associated with an outcome linked with that trait (such as heart disease).</p>
</sec>
<sec id="s2-2-2">
<title>2.2.2 Performance</title>
<p>When going to the trouble of including a PRS in analyses, ideally it should be one that provides as much additional information as possible. The performance of a PRS can be measured in a variety of ways - for example, one could consider the risk ratios between top and bottom percentiles of the PRS and the outcome of interest&#x2014;and its stated performance should be evaluated in the context of the research&#x20;goals.</p>
<p>Metrics commonly used to evaluate a PRS include the pseudo-R<sup>2</sup>, which indicates the amount of phenotypic variance explained by the PRS (<xref ref-type="bibr" rid="B22">Lee et&#x20;al., 2012</xref>), the Brier score, and the area under the ROC curve (AUC). Some PRS repositories are starting to make this information available alongside the scores to facilitate comparison (<xref ref-type="bibr" rid="B12">Fritsche et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B3">Becker et&#x20;al., 2021</xref>).</p>
<p>Larger base/target datasets give more power to detect association of SNPs with the trait of interest, and have been shown to yield scores with higher predictive capability (<xref ref-type="bibr" rid="B23">Lello et&#x20;al., 2019</xref>). In addition, it has been found that aggregating SNPs that are not themselves associated with a trait at a statistically significant <italic>p</italic>-value threshold can still result in a significantly associated score (<xref ref-type="bibr" rid="B1">Agerbo et&#x20;al., 2015</xref>), meaning that PRS are getting larger&#x2014;some contain hundreds of thousands, or even millions of SNPs. While a large PRS including many SNPs contains more information and is likely to have better performance than a smaller PRS, there are diminishing returns here and access to computational resources may impose a practical limit on the size of PRS&#x20;used.</p>
<p>PRS will perform best in populations of the same ancestry as those in which they were derived (<xref ref-type="bibr" rid="B10">Duncan et&#x20;al., 2019</xref>). This is particularly important if the analysis data contains primarily non-White individuals, as although there is ongoing effort to increase the diversity of genetic data, currently most available PRSs are for individuals of White ethnicity. If the analysis population contains a mixture of ancestries we recommend a sensitivity analysis in a subpopulation with genetic ancestry as similar as possible to that in which the PRS was derived.</p>
</sec>
<sec id="s2-2-3">
<title>2.2.3 Technical Considerations</title>
<p>It is important to avoid sample overlap between the data in which the PRS was developed (base and target), and the data in which the PRS will be used in analyses. If the same individuals are present across these datasets this can inflate the observed association between the PRS and the trait of interest&#x2014;this can also occur if the datasets contain closely related individuals.</p>
<p>Since it may not be possible to access raw genetic data from the base/target datasets to check for duplicate or related individuals directly, we recommend that the datasets in which potential scores were developed are reviewed in order to select one where there are unlikely to be duplicated or related individuals in the intended validation&#x20;data.</p>
<p>Finally, if the genomic positions in the GWAS where the SNPs were identified were not assigned on the same genomic build as the intended analysis data then additional software tools, such as LiftOver (<xref ref-type="bibr" rid="B16">Hinrichs et&#x20;al., 2006</xref>), may be required to standardise&#x20;this.</p>
</sec>
<sec id="s2-2-4">
<title>2.2.4 Information Needed From the Original Polygenic Risk Score</title>
<p>At a minimum, the information needed to replicate a PRS is:<list list-type="simple">
<list-item>
<p>&#x2022; The list of SNPs included in the score. These may be given as &#x201c;dbSNP Reference SNP numbers&#x201d; (refSNP or rsID), or as base-pair positions on a chromosome.</p>
</list-item>
<list-item>
<p>&#x2022; The effect (and preferably also the non-effect) allele for each&#x20;SNP.</p>
</list-item>
<list-item>
<p>&#x2022; The effect size (weighting) for each SNP for the condition of interest.</p>
</list-item>
<list-item>
<p>&#x2022; The genome&#x20;build</p>
</list-item>
</list>
</p>
<p>These could be the raw results from a GWAS filtered to SNPs of interest, or may have had further PRS development techniques applied.</p>
<p>The effect size may be given as a beta (weighting) or as an Odds Ratio (OR) or Hazard Ratio (HR), depending on the original analysis and how the authors chose to present the score. It is important to understand the form the weights are provided in to know if any transformation is necessary, and how to interpret the resulting PRS&#x2014;for example, OR and HR will need to be log-transformed to obtain the weights for use in the PRS calculation.</p>
<p>Sometimes additional information such as the effect allele frequency (EAF) is also provided. Ensuring that the allele frequencies in the validation data are consistent with those observed in the base/target data is a good check to perform when such data are available, and it can give greater confidence when dealing with ambiguous SNPs. We will discuss this further in <xref ref-type="sec" rid="s2-4">Section&#x20;2.4</xref>.</p>
<p>When accessing a PRS through an online repository such as PGS Catalog then they may have a schema<xref ref-type="fn" rid="fn7">
<sup>7</sup>
</xref> detailing the possible columns of information available about the score, and will have ensured uniform headings across scores.</p>
</sec>
</sec>
<sec id="s2-3">
<title>2.3 Extracting SNPs</title>
<p>As we mentioned briefly in <xref ref-type="sec" rid="s2-1">Section 2.1</xref>, the data we are discussing in this paper is UKB v3 imputed data, which contains &#x223c;93M autosomal variants for &#x223c;500,000 samples. The data is made available in BGEN v1.2 files, a binary version of the &#x201c;Oxford&#x201d; .gen and .sample file format, where trios of genotype probabilities for each SNP are stored in the .bgen file with a corresponding .bgen.bgi index file, and data about the individuals is stored in a .sample file providing participant IDs unique to each application. The genetic data is split by chromosome in files ranging from 40 to 200&#xa0;GB.</p>
<p>When choosing which software tool to use to extract specific SNPs from the bulk genetic data, two main considerations are speed and compatibility with the data format. While PLINK 2 has support for BGEN v1.2 format, in order to extract a given list of SNPs, it will first auto-convert the entire data file to PLINK 2 binary format (.pgen, .pvar, .psam). This can be time-consuming considering the large size of UKB imputed data and is not lossless&#x2014;PLINK 2 collapses the trios of raw genotype probabilities into single dosages according to a given threshold value (see <xref ref-type="sec" rid="s2-6-2">Section 2.6.2</xref> for more information).</p>
<p>For this reason we recommend bgenix, which was designed for use on BGEN format data and makes use of a SQLITE index file (.bgen.bgi) to quickly filter the required SNPs from the raw UKB imputed data files. Unfortunately one current limitation of bgenix is that while any number of SNPs can be specified by rsID, it is only possible to specify up to 9,980 distinct SNPs by chromosome and position in one command.</p>
<p>Due to differences in genotyping arrays, sometimes some of the SNPs included in the PRS may not be available in the validation data. In this case, it is important to report what proportion were available&#x2014;and if a high proportion are missing it may be worth looking for proxies or considering a different&#x20;PRS.</p>
</sec>
<sec id="s2-4">
<title>2.4 Aligning SNPs Between Base and Validation Data</title>
<p>We have previously mentioned that it is important to be aware of the genome build used in both the validation data and in the data within which the PRS was developed. There are a few other differences that are possible between genetic data-sets&#x2014;they could have been typed using different genotyping platforms or arrays, with different strand orientations, or imputed using different software&#x20;tools.</p>
<p>All of these things can result in slight differences in the way each SNP is labelled and presented, and it is important to ensure that the correct variants have been identified for inclusion in the&#x20;PRS.</p>
<sec id="s2-4-1">
<title>2.4.1 Strand-Flipping</title>
<p>Since the betas of our PRS are an estimate of the effect of one allele (the &#x201c;effect&#x201d; or &#x201c;risk&#x201d; allele) of the SNP compared to the other (&#x201c;non-effect&#x201d; allele), it is important that the dosages we calculate are the number of copies of that effect allele. However, the alleles of any given SNP are not always given consistently between datasets. We illustrate five different situations in <xref ref-type="table" rid="T4">Table&#x20;4</xref>, and describe the methods needed to align or &#x201c;harmonise&#x201d; the&#x20;data.</p>
<table-wrap id="T4" position="float">
<label>TABLE 4</label>
<caption>
<p>Five examples of possible disagreements between PRS and validation data, when data harmonisation may be required. We illustrate five different situations in the table: Perfect agreement, labelling disagreement, strand flip, strand flip and labelling disagreement, palindromic (ambiguous) SNP.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left"/>
<th rowspan="2" align="left"/>
<th colspan="2" align="center">PRS summary data file</th>
<th colspan="2" align="center">Validation data</th>
</tr>
<tr>
<th align="center">Effect allele</th>
<th align="center">Non-effect allele</th>
<th align="center">Effect allele</th>
<th align="center">Non-effect allele</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">1</td>
<td align="left">Expected scenario - perfect agreement</td>
<td align="center">A</td>
<td align="center">C</td>
<td align="center">A</td>
<td align="center">C</td>
</tr>
<tr>
<td align="left">2</td>
<td align="left">PRS and validation data disagree on labelling of effect allele</td>
<td align="center">A</td>
<td align="center">C</td>
<td align="center">C</td>
<td align="center">A</td>
</tr>
<tr>
<td align="left">3</td>
<td align="left">&#x201c;Strand flip&#x201d;</td>
<td align="center">A</td>
<td align="center">C</td>
<td align="center">T</td>
<td align="center">G</td>
</tr>
<tr>
<td align="left">4</td>
<td align="left">Strand flip and labelling disagreement</td>
<td align="center">A</td>
<td align="center">C</td>
<td align="center">G</td>
<td align="center">T</td>
</tr>
<tr>
<td align="left">5</td>
<td align="left">Palindromic</td>
<td align="center">A</td>
<td align="center">T</td>
<td align="center">T</td>
<td align="center">A</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>One convention is for the less frequently occurring allele (minor allele) to be considered the effect allele, since it is a change from the population norm&#x2014;under this labelling, an effect allele could be inversely associated with the condition of interest. An alternative approach is to label the alleles that increase risk of a condition as the effect alleles. Where two datasets have taken different approaches to this labelling, or when the less frequent allele changes between populations, the labels could be inverted between our data sets (see Row 2, <xref ref-type="table" rid="T4">Table&#x20;4</xref>).</p>
<p>When the effect and non-effect allele are inverted between datasets then this can be resolved automatically by some software (e.g., PLINK 2) or manually by relabelling the effect and non-effect allele in the PRS summary data, and inverting the effect size accordingly (since the effect size is the additive effect of each copy of the effect allele compared to the baseline of homozygous non-effect allele, we would multiply by -1 to obtain the inverse effect size).</p>
<p>A more complex situation arises when the datasets were genotyped using different DNA strand conventions. Although recent GWAS reports are almost always in reference to the forward strand as a consequence of imputation to a common reference panel, this is not always the case, and we may need to ensure that our datasets are harmonised prior to analyses (<xref ref-type="bibr" rid="B15">Hartwig et&#x20;al., 2016</xref>).</p>
<p>If one dataset was genotyped in reference to the forward strand and the other in reference to the backward strand then the &#x201c;backward&#x201d; data would list the nucleotides that paired with the bases on the forward stand. Any instance of &#x201c;A&#x201d; on the forward strand would be &#x201c;T&#x201d; on the backward, &#x201c;C&#x201d; on forward would be &#x201c;G&#x201d; on backward and vice versa (see Rows 3 and 4, <xref ref-type="table" rid="T4">Table&#x20;4</xref>).</p>
<p>Some software (e.g., PRSice-2) can handle strand flips automatically, for others (eg PLINK 2) these will need to be identified and resolved manually.</p>
</sec>
<sec id="s2-4-2">
<title>2.4.2 Ambiguous SNPs</title>
<p>Ambiguity arises when the SNP is palindromic (i.e.,&#x20;its alleles are nucleotides that pair with each other in a DNA molecule, such as A/T, see Row 5, <xref ref-type="table" rid="T4">Table&#x20;4</xref>). If the effect allele frequencies (EAFs) from the base data are available then we can compare them to the frequencies in our data and identify the alleles accordingly, but when the EAFs are close to 50% we cannot tell whether the effect and non-effect allele have been inverted, or whether the DNA strand is flipped. In these cases, or when allele frequencies in the base data are not available, then we cannot be certain about applying our weighting in the correct direction and should therefore exclude the&#x20;SNP.</p>
<p>In PLINK 2, this can be achieved by first computing EAFs using the <monospace>--freq</monospace> command then filtering the output (.afreq) using awk to get a list of ambiguous SNPs [e.g., palindromic SNPs with EAF in the range 40 and 60% (<xref ref-type="bibr" rid="B6">Chen et&#x20;al., 2018</xref>)]. Finally, the PLINK 2 command <monospace>--exclude</monospace> can be used to filter out the listed&#x20;SNPs.</p>
</sec>
<sec id="s2-4-3">
<title>2.4.3&#x20;Multi-Allelic SNPs</title>
<p>Multi-allelic SNPs have multiple possible alternate alleles for one reference allele, and these can be represented and identified in different ways in different data formats. In the UKB imputed data, these multi-allelic SNPs have been stored as a series of bi-allelic variants, sharing the same rsID and chromosome position and with the same listed reference allele but different alternate alleles.</p>
<p>The rsID and &#x201c;chr:pos&#x201d; identifiers are therefore not sufficient to uniquely identify one SNP, and allele information must be incorporated. This can be important during SNP extraction and PRS calculation, since we wish to ensure that we are including the correct alleles in our PRS calculation. In addition, many software tools require a unique identifier for each SNP. We discuss this further in the <ext-link ext-link-type="uri" xlink:href="https://2cjenn.github.io/PRS_Pipeline/">Online Materials</ext-link>.</p>
</sec>
<sec id="s2-4-4">
<title>2.4.4 Compare Allele Frequencies</title>
<p>When the source data for the PRS makes effect allele frequencies available, then a good check is to compare the frequencies of these alleles in the validation data. This can be helpful not only for dealing with palindromic SNPs but also as a general sanity&#x20;check.</p>
<p>While allele frequencies are unlikely to be identical between datasets, as the population will contain a different group of individuals and may be of different ancestries, it is reassuring if the frequencies are similar.</p>
</sec>
</sec>
<sec id="s2-5">
<title>2.5 Quality Control</title>
<p>When using an existing PRS, it is important to first ensure that it is of good quality and is appropriate for the analysis data. Errors in genotype data can have many causes, including mix-ups or contamination of the samples, and malfunctions of the genotype probes. Without removing these errors, the resulting analyses may have reduced power and validity.</p>
<p>There are a range of quality control considerations for genetic data that aim to identify and exclude potential data errors. In this section we will discuss these checks and indicate which may be relevant when calculating an existing PRS, outlined in <xref ref-type="fig" rid="F2">Figure&#x20;2</xref>. The threshold values for many of these checks can be arbitrary and will vary depending on the purpose of the analysis, but we will give some examples from the literature.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Summary of quality control and alignment&#x20;steps.</p>
</caption>
<graphic xlink:href="fgene-13-818574-g002.tif"/>
</fig>
<p>The authors who developed the PRS should have provided documentation detailing the quality control (QC) performed on the base and target data, and being able to identify the steps taken is useful for determining if the PRS is suitable for the intended analyses. Since PRS are normally derived from GWAS summary statistics, the data will most likely have been subject to the typical GWAS QC checks, described in detail elsewhere (<xref ref-type="bibr" rid="B36">Reed et&#x20;al., 2015</xref>; <xref ref-type="bibr" rid="B29">Marees et&#x20;al., 2018</xref>; <xref ref-type="bibr" rid="B7">Choi et&#x20;al., 2020</xref>).</p>
<p>Both the genetic variants included in the analysis (SNPs) and the individuals in the analysis population (samples) should have undergone these quality checks. A standard process could involve filtering at the SNP level first, followed by sample level filtering, and finally filtering SNPs based on Hardy-Weinberg equilibrium (HWE), as suggested by (<xref ref-type="bibr" rid="B36">Reed et&#x20;al., 2015</xref>). The rationale for this is that HWE can be influenced by the population structure of the sample, and we will discuss this further in <xref ref-type="sec" rid="s2-5-3">Section 2.5.3</xref>. Alternatively, sometimes SNP and sample filtering are iteratively applied with increasingly stringent thresholds (<xref ref-type="bibr" rid="B29">Marees et&#x20;al., 2018</xref>).</p>
<p>In the case of imputed genotyping data these QC checks are typically performed on the directly called data prior to imputation, which means both that the imputation is conducted using high quality data, and that any lower quality data that was excluded may then be imputed. After imputation, the quality of each imputed variant is calculated, and those that were poorly imputed may then be excluded from further analyses. When using data that has already been imputed it may still be worth running further checks on the data, for example to use more stringent thresholds than were applied prior to imputation, depending on the intended analysis.</p>
<p>The focus of our discussion, the UK Biobank data, was genotyped by Affymetrix, who only provided genotype calls for SNPs and samples that satisfied their QC<xref ref-type="fn" rid="fn8">
<sup>8</sup>
</xref>. UK Biobank then applied a QC pipeline designed to accommodate both the large-scale, diverse population and the broad range of research questions the data would be used for, and made summary statistics available in the Data Showcase to facilitate further QC by researchers (<xref ref-type="bibr" rid="B4">Bycroft et&#x20;al., 2018</xref>). These include variant-level statistics computed in QCTOOL for the imputed data (&#x201c;Imputation MAF &#x2b; info&#x201d; files<xref ref-type="fn" rid="fn9">
<sup>9</sup>
</xref>) and downloadable variables (Category 100313, Genotyping process and sample QC<xref ref-type="fn" rid="fn10">
<sup>10</sup>
</xref>) which indicate lower quality samples.</p>
<sec id="s2-5-1">
<title>2.5.1 SNP QC</title>
<p>The SNP QC required during the development of a PRS is described in detail elsewhere (<xref ref-type="bibr" rid="B36">Reed et&#x20;al., 2015</xref>; <xref ref-type="bibr" rid="B29">Marees et&#x20;al., 2018</xref>; <xref ref-type="bibr" rid="B7">Choi et&#x20;al., 2020</xref>), but we provide a brief overview to give a rough understanding of the rationale behind each&#x20;check.</p>
<p>It is also important to ensure that the SNPs required for our chosen PRS are of sufficient quality in our intended analysis data. Any variants that were poorly genotyped in this data may warrant exclusion so they do not compromise the power of the score. We indicate which quality control metrics should be inspected when calculating an existing PRS, with examples for some that may be of situational interest.</p>
<sec id="s2-5-1-1">
<title>2.5.1.1 Linkage Disequilibrium</title>
<p>Linkage disequilibrium (LD) is a measure of the correlation between neighbouring genetic variants that are more likely to be inherited together because of their physical proximity, leading to association within a population. As in classic statistical modelling, multicollinearity can lead to problems with the model, and so any SNPs in high LD will typically have been identified and removed during the development of the PRS by methods such as &#x201c;pruning&#x201d; or &#x201c;clumping&#x201d; (<xref ref-type="bibr" rid="B35">Priv&#xe9; et&#x20;al., 2019</xref>).</p>
<p>Since patterns of LD may vary among populations, particularly those of different ancestries, it may be of interest to verify that the SNPs in the PRS remain independent in the analysis data (<xref ref-type="bibr" rid="B37">Sawyer et&#x20;al., 2005</xref>).</p>
<p>In addition, when calculating a PRS for a condition such as Alzheimer&#x2019;s disease that has established high-risk variants (APOE e4), one may wish to exclude such variants from the polygenic score in order to consider them separately in the statistical modelling. In this case, we advise checking that no variants in the score are in LD with the high risk variant(s).</p>
<p>Details of how to investigate and filter on LD statistics using PLINK 2 can be found in the appendix of our online materials.<xref ref-type="fn" rid="fn11">
<sup>11</sup>
</xref>
</p>
</sec>
<sec id="s2-5-1-2">
<title>2.5.1.2 Imputation Information</title>
<p>Genotype imputation is the estimation of missing genotype calls by statistical inference. Increasingly, imputation is being used not only to fill in missing data caused by genotyping errors, but also to estimate the genotypes of variants that were not directly assayed, in order to increase the number of SNPs available in the&#x20;data.</p>
<p>The &#x201c;imputation information&#x201d; statistic is a measure of imputation quality which typically takes values between 0 and 1, where 0 indicates complete uncertainty and 1 represents complete certainty about the imputed genotype. Depending on the software used, there are a different few information metrics that can be used to assess the quality of imputed data, but they are generally highly correlated (<xref ref-type="bibr" rid="B28">Marchini and Howie, 2010</xref>).</p>
<p>The UK Biobank carried out imputation on the genotype data using SHAPEIT3 and IMPUTE4 to statistically infer the genotypes of variants that had not been directly called in the genotyping array, and those which were missing or had been set to missing in central UKB quality control. They used QCTOOL (-snp-stats) to calculate the imputation information, and made it available to researchers in the &#x201c;MAF &#x2b; Info&#x201d; files (UKB Resource 1967<xref ref-type="fn" rid="fn12">
<sup>12</sup>
</xref>). Bycroft et&#x20;al. advise that &#x201c;<italic>An information score of &#x3b1; in a sample of M individuals indicates that the amount of data at the imputed marker is approximately equivalent to a set of perfectly observed genotype data in a sample size of &#x3b1;M</italic>&#x201d; and note that an information measure of 0.3 should yield good power to detect association given the large sample size of UKB (<xref ref-type="bibr" rid="B4">Bycroft et&#x20;al., 2018</xref>).</p>
<p>If the PRS was developed on imputed data then the authors will normally have set a threshold imputation information score at which SNPs were eligible for inclusion, however it is possible that a variant that was well imputed in the base/target data was poorly imputed in the intended analysis data, so it is worth checking that all imputed SNPs in the score are good quality.</p>
</sec>
<sec id="s2-5-1-3">
<title>2.5.1.3 Minor Allele Frequency</title>
<p>For a given SNP, the allele which is most common in the population is known as the &#x201c;major&#x201d; allele and the less common allele(s) are &#x201c;minor.&#x201d; The minor allele frequency (MAF) indicates how rare a variant is&#x2014;typically a minor allele with frequency &#x3e;5% is considered &#x201c;common&#x201d; while those between 1 and 5% are &#x201c;low frequency&#x201d; and MAF &#x3c;1% is said to be &#x201c;rare.&#x201d;</p>
<p>If the frequency of the minor allele of a SNP is too low then we will not have adequate power to make meaningful statistical statements. Similarly when using imputed genotyping data, the imputation information of a SNP is likely to be correlated with its MAF, since there is less power available for imputing rare&#x20;SNPs.</p>
<p>It is therefore common for SNPs with MAF below a certain threshold to have been excluded during GWAS and the development of PRS. The threshold for such exclusion varies depending on the aims of the original analysis and the size of the dataset - larger datasets give more power, and allow for the analysis of rarer variants.</p>
<p>Note however that the allele frequency is dependent on the population under study - for example some alleles are more common in individuals of particular ancestry. It is possible some SNPs will be rarer in the intended analysis data than in the data where the PRS was developed, in which case a decision must be made on whether to include&#x20;them.</p>
</sec>
<sec id="s2-5-1-4">
<title>2.5.1.4 SNP Call Rate</title>
<p>The call rate for a SNP is the proportion of individuals with non-missing data for that SNP. If a SNP has a low call rate then it may have been poorly assayed, and including it may result in spurious data (<xref ref-type="bibr" rid="B43">Turner et&#x20;al., 2011</xref>). SNPs with a low call rate are therefore often excluded.</p>
<p>In the case of imputed genotype data, any assayed SNPs with call rate below a chosen threshold are generally considered poor quality and excluded prior to imputation. These excluded SNPs may then have their genotypes imputed, along with any missing calls in other SNPs, resulting in a complete data&#x20;set.</p>
</sec>
</sec>
<sec id="s2-5-2">
<title>2.5.2 Sample QC</title>
<p>The word &#x201c;sample&#x201d; in this context refers to the individuals whose genetic data we are working with (like sample size in statistics). As with the genetic variants, the goal is to make sure that all individuals included in the study have high quality data, and the criteria considered during the calculation of the PRS are typically those used in&#x20;GWAS.</p>
<p>When calculating an existing PRS, the QC again depends on the aims of the analysis. If it is an association analysis for example, evaluating the strength of association between the PRS and some trait or outcome of interest, then the focus is on the data at a population level, and exclusion of related individuals and restriction to a single ethnic group may be desirable, or included in sensitivity analyses. Alternatively, if the goal is to model how the PRS would perform if incorporated into clinical guidance, perhaps simulating a theoretical intervention to be offered at a given risk threshold, then one might wish to calculate the PRS for all individuals except those for whom there is reason to believe there were errors in genotyping.</p>
<p>Within the UK Biobank data, QC was performed to identify a subset of high quality, unrelated samples for use in the calculation of principal components. The details of the principal components analysis (PCA) are beyond the scope of this paper, and are described elsewhere (<xref ref-type="bibr" rid="B4">Bycroft et&#x20;al., 2018</xref>). In short, UK Biobank used them to supplement the ethnic groups self-reported by participants and identify a group of individuals considered to be genetically of &#x201c;White British ancestry&#x201d; This White British ancestry subset is made available to researchers in UKB Data Field 22006.<xref ref-type="fn" rid="fn13">
<sup>13</sup>
</xref>
</p>
<p>In addition, a directly downloadable variable (UKB Data Field 22020<xref ref-type="fn" rid="fn14">
<sup>14</sup>
</xref>) is provided which indicates whether a participant&#x2019;s genetic data met the quality control checks required to be used in the calculation of these principal components (<xref ref-type="bibr" rid="B4">Bycroft et&#x20;al., 2018</xref>). These checks comprised:<list list-type="simple">
<list-item>
<p>&#x2022; Exclude individuals who were outliers for heterozygosity or missing&#x20;rates.</p>
</list-item>
<list-item>
<p>&#x2022; Exclude individuals with a missing rate &#x3e;0.02 on autosomes.</p>
</list-item>
<list-item>
<p>&#x2022; Exclude individuals with sex discordance (between the phenotypic and genetically inferred sex), or for whom genetic sex could not be determined.</p>
</list-item>
<list-item>
<p>&#x2022; Exclude individuals who are not in a maximal set of unrelated individuals up to 3rd degree.</p>
</list-item>
</list>
</p>
<p>We will go through the rationale for each of these exclusions in the following sections.</p>
<sec id="s2-5-2-1">
<title>2.5.2.1 Heterozygosity</title>
<p>Heterozygosity is when an individual has two different alleles at a locus&#x2014;an individual with the same allele on both chromosomes is homozygous at that locus. Heterozygosity is typically higher in individuals from mixed ethnic backgrounds, and lower in individuals whose parents are closely related. Extreme heterozygosity can indicate poor sample quality, and thus outliers are typically excluded.</p>
<p>The UK Biobank has done central checks and identified individuals which extreme heterozygosity that is not explained by ancestry. These outlying individuals, alongside those who were outliers for missing data (see &#x201c;Sample call rate&#x201d;) are listed in UKB Data Field 22027.<xref ref-type="fn" rid="fn15">
<sup>15</sup>
</xref>
</p>
</sec>
<sec id="s2-5-2-2">
<title>2.5.2.2 Sample Call Rate</title>
<p>The sample call rate is defined as the proportion of SNPs with non-missing data for this sample. This is analogous to the SNP call rate, but for individuals instead of SNPs. Individuals with a low call rate have a high proportion of missing genetic data, which could indicate poor quality.</p>
<p>In the UK Biobank central checks, individuals who were outliers for missingness prior to imputation were identified. These individuals, along with those who were outliers for heterozygosity are listed in UKB Data Field 22027.</p>
</sec>
<sec id="s2-5-2-3">
<title>2.5.2.3 Sex Discordance</title>
<p>When the genotype inferred from the X and Y chromosomes doesn&#x2019;t match that reported by the participant then this is known as sex discordance. Although it could be due to gender reassignment or sex-chromosome aneuploidy it could also indicate unreliable data and individuals with sex discordance are therefore generally excluded. The genetically determined sex of individuals in UK Biobank is made available in UKB Data Field 22001<xref ref-type="fn" rid="fn16">
<sup>16</sup>
</xref> and can be compared to the gender reported at baseline, UKB Data Field&#x20;31<xref ref-type="fn" rid="fn17">
<sup>17</sup>
</xref>.</p>
</sec>
<sec id="s2-5-2-4">
<title>2.5.2.4 Relatedness</title>
<p>If the data contains participants who are closely related then their genomes would be more similar than those of unrelated individuals, which can lead to biased estimations in population-level analyses. In the UK Biobank, kinship coefficients were estimated for all pair of individuals using KING software (<xref ref-type="bibr" rid="B27">Manichaikul et&#x20;al., 2010</xref>), and a rough categorisation of relatedness is available in UKB Data Field 22021.</p>
<p>When excluding related individuals, note that only <italic>n</italic>-1 from every cluster of n related individuals needs to be removed in order for the remaining population to be unrelated. The UK Biobank Data Field 22020 restricts to a maximal subset of unrelated (to the 3rd degree) individuals who were not sex discordant or outliers for missingness or heterozygosity. This is the subset of participants that was used by UK Biobank to calculate the genetic principal components, and the algorithm by which they were selected is discussed in detail in (<xref ref-type="bibr" rid="B4">Bycroft et&#x20;al., 2018</xref>).</p>
<p>Note that while for many analyses the subset identified by UK Biobank is adequate and convenient, it did not take disease status into account when removing related individuals. For rare outcomes it may be advisable to construct a new maximal unrelated subpopulation that preferentially retains individuals with the condition of interest.</p>
</sec>
</sec>
<sec id="s2-5-3">
<title>2.5.3&#x20;Hardy-Weinberg Equilibrium</title>
<p>The Hardy-Weinberg Equilibrium (HWE) is a principle that states that allele and genotype frequencies in a stable population without evolutionary influences will stay constant between generations. Deviation from HWE indicates that genotype frequencies differ significantly from their expected values which could indicate genotyping errors, such variants are therefore often excluded from analyses (<xref ref-type="bibr" rid="B29">Marees et&#x20;al., 2018</xref>; <xref ref-type="bibr" rid="B48">Zhao et&#x20;al., 2018</xref>). Note that HWE is sensitive to population structure if allele frequencies differ between subpopulations, so the population should be stratified by ethnicity prior to testing&#x20;HWE.</p>
<p>In the UK Biobank genotyping data, variants were tested for HWE within each genotyping batch among individuals of homogeneous European ancestry (computed via PCA), and were set to missing at a threshold of <italic>p</italic>&#x20;&#x3c; 10<sup>&#x2212;12</sup> prior to imputation.</p>
<p>It is important to be aware that HWE is an assumption of many genotype imputation methods, including the IMPUTE2 program (<xref ref-type="bibr" rid="B17">Howie et&#x20;al., 2009</xref>). If such methods have been used, it may then not be appropriate to test whether the resulting imputed variants conform to&#x20;HWE.</p>
<p>The PLINK 2 command <monospace>--hwe</monospace> will filter out variants which deviate from HWE with a <italic>p</italic>-value beyond the given threshold (<xref ref-type="bibr" rid="B45">Wigginton et&#x20;al., 2005</xref>; <xref ref-type="bibr" rid="B14">Graffelman and Moreno, 2013</xref>). Note that the HWE test used in PLINK 2 does not appropriately account for the uncertainty in imputed data (<xref ref-type="bibr" rid="B38">Shriner, 2011</xref>, <xref ref-type="bibr" rid="B39">2013</xref>).</p>
</sec>
</sec>
<sec id="s2-6">
<title>2.6 Calculating Dosages</title>
<p>Imputed genotypes are generally given probabilistically, rather than as discrete values. For example, for a particular SNP with alleles A and B, is represented in. bgen as the trio of genotype probabilities <inline-formula id="inf7">
<mml:math id="m8">
<mml:mrow>
<mml:mi mathvariant="normal">&#xa0;&#x2119;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf8">
<mml:math id="m9">
<mml:mrow>
<mml:mi mathvariant="normal">&#x2119;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>A</mml:mi>
<mml:mi>B</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf9">
<mml:math id="m10">
<mml:mrow>
<mml:mi mathvariant="normal">&#x2119;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>B</mml:mi>
<mml:mi>B</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> for each individual.</p>
<p>A directly genotyped SNP will have probability 1 of one genotype and 0 for the others, but at an imputed SNP an individual might have, for example, a 90% probability of being homozygous for allele A (genotype AA) and a 10% probability of being heterozygous (genotype&#x20;AB).</p>
<p>To calculate a PRS, we want to convert this information on genotype probabilities into a single number per SNP giving the &#x201c;dosage&#x201d; of the effect allele. We are assuming additive genetic effects, where the phenotypic expression increases for each copy of the effect allele.</p>
<p>There are two main ways of doing this - allelic or hard-call dosages. The method used should be reported to allow for replication of the PRS and any results.</p>
<sec id="s2-6-1">
<title>2.6.1 Allelic Dosages</title>
<p>The allelic dosages are real numbers, <inline-formula id="inf10">
<mml:math id="m11">
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>g</mml:mi>
<mml:msub>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> calculated as the expected number of copies of the effect allele<disp-formula id="equ2">
<mml:math id="m12">
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>c</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="normal">&#x2119;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>B</mml:mi>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mi mathvariant="normal">&#xa0;&#x2119;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>where A is the non-effect allele and B is the effect allele.</p>
<p>Although it is obviously not biologically plausible for an individual to actually have fractional copies of a variant, this provides a dosage value that incorporates some of the uncertainty of the imputed genotype&#x20;calls.</p>
<p>See the PLINK 2 command <monospace>--export A</monospace> for exporting allelic dosage into a separate file, which can be read in R for easy inspection.</p>
</sec>
<sec id="s2-6-2">
<title>2.6.2&#x20;Hard-Called Dosages</title>
<p>Hard-called, or thresholded, dosages are integer values, <inline-formula id="inf11">
<mml:math id="m13">
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>g</mml:mi>
<mml:msub>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> for SNP <inline-formula id="inf12">
<mml:math id="m14">
<mml:mi>i</mml:mi>
</mml:math>
</inline-formula> in individual <inline-formula id="inf13">
<mml:math id="m15">
<mml:mi>j</mml:mi>
</mml:math>
</inline-formula>, that are obtained by choosing a threshold value at which to round the expected (allelic) dosage to a whole number.</p>
<p>For example, if we set threshold as 0.1 in PLINK 2 using <monospace>--hard-call-threshold 0.1</monospace>, the hard-call dosage will be assigned as follows:<disp-formula id="equ3">
<mml:math id="m16">
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>l</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>c</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mn>0.0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mn>0.1</mml:mn>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>c</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mn>0.9</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mn>1.1</mml:mn>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>c</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mn>1.9</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mn>2.0</mml:mn>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>g</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>While this provides us with data that looks the same as directly called genotypes, and can be stored in the same file formats, it is also losing information, and if we convert our entire dataset to hard-calls under a given threshold then we would not be able to recover our original information or change the hard-call threshold&#x20;used.</p>
<p>Note also that once the genotype probabilities have been collapsed into a single expected dosage, we can get the same hard-call dosage value for two genotype probability trios that convey very different certainty about the underlying genotype (see <xref ref-type="table" rid="T5">Table&#x20;5</xref>).</p>
<table-wrap id="T5" position="float">
<label>TABLE 5</label>
<caption>
<p>Hard-call vs. allelic dosages: genotype probability trios and allelic and hard-called dosages for 2 SNPs of a theoretical individual.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th align="center">P (AA)</th>
<th align="center">P (AB)</th>
<th align="center">P(BB)</th>
<th align="center">Allelic Dosage (B)</th>
<th align="center">Hard-call Dosage (B)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">SNP1</td>
<td align="char" char=".">0.22</td>
<td align="char" char=".">0.50</td>
<td align="char" char=".">0.28</td>
<td align="char" char=".">1.06</td>
<td align="char" char=".">1</td>
</tr>
<tr>
<td align="left">SNP2</td>
<td align="char" char=".">0.02</td>
<td align="char" char=".">0.90</td>
<td align="char" char=".">0.08</td>
<td align="char" char=".">1.06</td>
<td align="char" char=".">1</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In this example, the individual has an allelic dosage of 1.06 copies of allele B for both SNPs, which would result in them being categorised as heterozygous when using hard-call dosages with a threshold of 0.1. However, their imputed probability of having the heterozygous genotype for SNP1 is much lower than it is for&#x20;SNP2.</p>
<p>See the PLINK 2 command <monospace>--import-dosage-certainty</monospace> to use hard-called dosages and discard the values with low certainty.</p>
</sec>
</sec>
<sec id="s2-7">
<title>2.7 Calculating the Polygenic Risk Score</title>
<p>While occasionally a risk score may be computed as the unweighted sum of effect allele dosages (&#x201c;allele count model&#x201d;), the most common approach is to weight each allele dosage by its effect size, as described in Eq. (1), and that is the method we will focus on&#x20;here.</p>
<p>The actual calculation of a PRS is numerically straightforward and can be computed directly in any standard scripting language, such as R or SAS, as a matrix multiplication of SNP dosages per individual by betas per SNP. Recall that if the effect sizes in the PRS were given as odds ratios or hazard ratios, they will need to be log-transformed at this&#x20;point.</p>
<p>However, for large scores it can be more convenient to use genetics tools such as PLINK 2, which uses the <monospace>--score</monospace> command to calculate linear risk scores for each individual and has some configuration options built in to handle missing data and standardisation of the&#x20;score.</p>
<sec id="s2-7-1">
<title>2.7.1 Missing Genotype Data</title>
<p>Although this guide primarily deals with imputed genotype data and advocates the use of allelic dosages, we will briefly outline some of the techniques used to handle missing data in the calculation of a&#x20;PRS.</p>
<p>Directly genotyped data, or imputed data that has been hard-called, may contain missing data and although individuals and SNPs with a high proportion of missingness are typically excluded as part of the quality control, there can still be some genotypes missing for some individuals.</p>
<p>One common approach to dealing with missing data for a SNP is to use the effect allele frequency in the population in place of the missing dosage for the individual (analogous to mean imputation in statistical analyses). This is the default approach in PLINK 2, but can be disabled by using the <monospace>--no-mean-imputation</monospace> modifier.</p>
<p>Alternatively missing genotypes can be ignored, and any SNPs for which an individual is missing a dosage value will not contribute to the score. In this case, it is advisable to find the average PRS per individual by dividing by the number of non-missing SNP dosages. This prevents scores of individuals with missing genetic data from being consistently lower than scores of individuals with complete data, which would result in bias towards lower&#x20;risk.</p>
<p>Since each individual (i.e.,&#x20;sample) could be missing a different number of SNPs, each participant&#x2019;s total PRS should be divided by their number of non-missing alleles; our averaged PRS is calculated as<disp-formula id="equ4">
<mml:math id="m17">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>R</mml:mi>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mi>i</mml:mi>
<mml:mi>N</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>&#x3b2;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2217;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>g</mml:mi>
<mml:msub>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo>&#x2217;</mml:mo>
<mml:msub>
<mml:mi>M</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>where <inline-formula id="inf14">
<mml:math id="m18">
<mml:mi>P</mml:mi>
</mml:math>
</inline-formula> is the ploidy of the individual (2 in this case since human autosomes are diploid), and <inline-formula id="inf15">
<mml:math id="m19">
<mml:mrow>
<mml:msub>
<mml:mi>M</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the number of non-missing variants observed for individual&#x20;<inline-formula id="inf16">
<mml:math id="m20">
<mml:mi>j</mml:mi>
</mml:math>
</inline-formula>.</p>
<p>This averaging approach is also the default in PLINK 2, and the resulting averaged PRS is output in the &#x201c;&#x3c;Score name&#x3e;_AVG&#x201d; column of a PLINK format sample score file (.sscore). If a non-averaged PRS is preferred, then the <monospace>cols &#x3d; scoresums</monospace> modifier can be specified.</p>
</sec>
<sec id="s2-7-2">
<title>2.7.2 Transforming the Polygenic Risk Score for Use in Analyses</title>
<p>Once the PRS has been computed, there are a variety of transformations that can be applied for either comparison to other scores or to produce easily interpretable results in analyses.</p>
<p>As the number of SNPs included in a PRS increases, so does the theoretical range of the score. For example, a hypothetical individual who was homozygous for all risk alleles (dosage &#x3d; 2) could have a score of 20 for a 100 SNP PRS where all betas were 0.1, but a score of 200,000 for a 1,000,000 SNP PRS with betas of 0.1. This means we cannot directly compare the scores for PRS containing different numbers of&#x20;SNPs.</p>
<p>In order to compare such scores we may therefore wish to average the total PRS by the number of SNPs which ensures a similar scale regardless of the number of SNPs used. Be aware, however, that by discarding the absolute value of the PRS, we compromise our ability to identify outliers, compare the PRS across samples, or detect the effect of natural selection (<xref ref-type="bibr" rid="B7">Choi et&#x20;al., 2020</xref>).</p>
<p>For use in association studies, one common approach is to categorise PRS into percentiles for ease of interpretation. Often tertiles, quartiles, quintiles, or deciles are used, or the top 1% are compared to the middle quintile. This allows easy comparison of &#x201c;high risk&#x201d; individuals to &#x201c;average&#x201d; ones&#x2014;especially given that there&#x2019;s currently no well-established cut-off threshold to define a &#x201c;high PRS&#x201d; (<xref ref-type="bibr" rid="B9">Cupido et&#x20;al., 2021</xref>).</p>
<p>In order to include a PRS as a continuous variable in regression models, it is often standardised to a normal distribution with mean &#x3d; 0 and SD &#x3d; 1, so that the effect in the model can be given in units of 1 SD of the PRS. This transformation also serves as a pre-processing step when combining multiple PRS into one. For example, we might wish to average PRS for similar traits (e.g., systolic blood pressure, diastolic blood pressure and pulse pressure) into one combined &#x201c;blood pressure&#x201d; risk score for analysis as demonstrated in (<xref ref-type="bibr" rid="B32">Pazoki et&#x20;al., 2018</xref>), or construct a &#x201c;meta&#x201d; PRS combining multiple PRS for one trait across studies (<xref ref-type="bibr" rid="B18">Inouye et&#x20;al., 2018</xref>).</p>
<p>The PRS is also generally kept as a continuous variable when it is incorporated in risk prediction models, as we see in (<xref ref-type="bibr" rid="B11">Elliott et&#x20;al., 2020</xref>) (A. <xref ref-type="bibr" rid="B21">Lee et&#x20;al., 2019</xref>). It is still necessary to assess the linearity assumption in the model building stage (i.e.,&#x20;linear association between PRS and outcome), as outlined in (<xref ref-type="bibr" rid="B40">Sun et&#x20;al., 2021</xref>).</p>
<p>Each transformation has its own limitations, we advise readers to carefully choose one based on their analysis objective.</p>
</sec>
</sec>
<sec id="s2-8">
<title>2.8 PRS in Statistical Models</title>
<p>One of the general statistical considerations when incorporating PRS in a model is to account for population genetic structures to avoid bias, which can be achieved by adjusting for genetic principal components (PC) in the model (<xref ref-type="bibr" rid="B33">Price et&#x20;al., 2006</xref>) or by more advanced methods such as mixed models (<xref ref-type="bibr" rid="B34">Price et&#x20;al., 2010</xref>). Typically, the first 10 genetic PCs are considered as possible confounders, this number is routine but arbitrary (<xref ref-type="bibr" rid="B36">Reed et&#x20;al., 2015</xref>). Even when the analysis population is restricted to a single ethnic group, the genetic PCs can capture population structure that is not available in self-reported ethnicity. In UKB, the first 40&#xa0;PCs are available for researchers to download under (Data Field 22009<xref ref-type="fn" rid="fn18">
<sup>18</sup>
</xref>) (<xref ref-type="bibr" rid="B4">Bycroft et&#x20;al., 2018</xref>).</p>
<p>Similarly, bias can arise when the data was genotyped using different arrays or across multiple batches&#x2014;which is increasingly common as the size of studies increases (<xref ref-type="bibr" rid="B43">Turner et&#x20;al., 2011</xref>). It is therefore standard practice to adjust for genotyping array (<xref ref-type="bibr" rid="B18">Inouye et&#x20;al., 2018</xref>). In UK Biobank the first &#x223c;50,000 people were genotyped using the UK BiLEVE Axiom Array, while the rest of the cohort were genotyped using the UK BioBank Axiom Array. Genotyping was performed in 106 batches of about 4,700 individuals, using a custom genotype calling pipeline developed by Affymetrix. Information on both the array and batch number for each participant is made available for researchers (Data Field 22000<xref ref-type="fn" rid="fn19">
<sup>19</sup>
</xref>), and UK Biobank internal quality control of the data was performed within batches to account for any batch-level discrepancies.</p>
</sec>
</sec>
<sec id="s3">
<title>3 Results</title>
<p>We have developed a pipeline that, when supplied with a list of SNPs and betas, can extract required SNPs, apply chosen QC and calculate a PRS using bgenix and PLINK 2. For the full code, and additional documentation of technical aspects, see Online Materials: PRS Pipeline on GitHub.<xref ref-type="fn" rid="fn20">
<sup>20</sup>
</xref>
</p>
<sec id="s3-1">
<title>3.1 Worked Example</title>
<p>We chose the PRS for low-density lipoprotein cholesterol developed by <xref ref-type="bibr" rid="B19">Klarin et&#x20;al. (2018)</xref> in the Million Veteran Program data, because it is a relatively recent PRS that provides a comprehensive selection of SNPs in the context of the current literature. It consists of 223&#x20;lipid-associated SNPs with weights derived in the 2017 Global Lipids Genetics Consortium (GLGC) exome array analysis (<xref ref-type="bibr" rid="B25">Liu et&#x20;al., 2017</xref>), in association analyses that were adjusted for sex, age, age squared and up to four principal components.</p>
<p>In addition, previous work has already been done applying this PRS within the UK Biobank (<xref ref-type="bibr" rid="B41">Trinder et&#x20;al., 2020a</xref>; <xref ref-type="bibr" rid="B42">Trinder et&#x20;al., 2020b</xref>) and these results have been returned to the UKB and made available, so we are able to validate our results against theirs.</p>
<p>The SNP list and betas for LDL-C were obtained from Supplementary Table&#x20;11 of Klarin et&#x20;al., and were labelled under genome build GRCh37.75. The PRS is also available from the PGS Catalog with polygenic score ID PGS000115<xref ref-type="fn" rid="fn21">
<sup>21</sup>
</xref> (<xref ref-type="bibr" rid="B20">Lambert et&#x20;al., 2021</xref>).</p>
<sec id="s3-1-1">
<title>3.1.1 Validation Data</title>
<p>Our validation dataset is the UK Biobank (UKB), a prospective cohort study of &#x223c;500,000 volunteers of middle and old age (40&#x2013;69&#xa0;years) in the UK. All UKB participants were genotyped, yielding directly called data for around 850,000 genetic variants. Variants that failed quality control were excluded, and data for a further &#x223c;9 million genetic variants was then imputed. Variant IDs were assigned according to the Genome Reference Consortium Human Build 37 (GRCh37) reference genome (<xref ref-type="bibr" rid="B4">Bycroft et&#x20;al., 2018</xref>), and the data was aligned such that the first allele given in the. bgen files is the reference allele on the forward strand (UK Biobank Resource 531<xref ref-type="fn" rid="fn22">
<sup>22</sup>
</xref>).</p>
<p>Note that individuals who have withdrawn from the UKB cohort have had their IDs replaced with negative numbers in the sample file. This maintains the order of the remaining IDs, so they still line up with the genetic data, but enforces exclusion of withdrawn participants, as they can no longer be joined to the phenotypic&#x20;data.</p>
<p>In the GLGC exome array analysis where the weights for the LDL-C PRS were derived, LDL-cholesterol was measured in mg/dL, and therefore the weights <inline-formula id="inf17">
<mml:math id="m21">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b2;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represent the increase of LDL-C in mg/dL for each unit increase in dosage of <inline-formula id="inf18">
<mml:math id="m22">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>N</mml:mi>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. In the UK Biobank, LDL-cholesterol in mmol/L was measured in each participant at baseline, by blood samples taken for assays. We therefore convert the LDL-C measurements from mmol/L to mg/dL by multiplying by&#x20;38.67.</p>
</sec>
<sec id="s3-1-2">
<title>3.1.2 SNP Extraction and Review of QC</title>
<p>We used bgenix (<xref ref-type="bibr" rid="B2">Band and Marchini, 2018</xref>) to extract SNPs for the PRS from UKB imputation data. All 223 variants were available in the UKB imputed genetic data, and there were no multi-allelic or ambiguous SNPs. We verified that the allele frequencies of the SNPs were similar (within 0.1 percentage point) in our data to those reported in the supplementary materials of (<xref ref-type="bibr" rid="B19">Klarin et&#x20;al., 2018</xref>).</p>
<p>In the GLGC analysis where the weights were derived, the quality control conducted centrally across 73 contributing studies included removal of ambiguous variants, exclusion of variants with call rate &#x3c;0.9 or HWE <italic>p</italic> value &#x3c;1 &#xd7; 10<sup>&#x2212;7</sup> (<xref ref-type="bibr" rid="B25">Liu et&#x20;al., 2017</xref>). In the MVP data where the PRS was developed, the threshold values used for imputation information and minor allele frequency were 0.3 and 0.0003 respectively (<xref ref-type="bibr" rid="B19">Klarin et&#x20;al., 2018</xref>).</p>
<p>We chose to exclude SNPs with an imputation information &#x3c;0.4 within the UK Biobank data (<italic>n</italic>&#x20;&#x3d; 1), since this is a common threshold used in literature (<xref ref-type="bibr" rid="B49">Zheng et&#x20;al., 2012</xref>). We also excluded rare SNPs with MAF &#x3c;0.005 (<italic>n</italic>&#x20;&#x3d; 4). After these exclusions, we had 228 SNPs remaining (<xref ref-type="fig" rid="F3">Figure&#x20;3</xref>).</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Flowchart showing quality control exclusions in worked example of LDL-C PRS in UK Biobank data.</p>
</caption>
<graphic xlink:href="fgene-13-818574-g003.tif"/>
</fig>
<p>When investigating the impact of these exclusions (<xref ref-type="fig" rid="F4">Figure&#x20;4</xref>), we saw that the SNPs we excluded due to MAF included the SNPs with the lowest remaining imputation information - this is unsurprising since SNPs with lower MAF are generally less well imputed. In addition, we observed that these SNPs had some of the larger absolute effect&#x20;sizes.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Imputation information against beta of each SNP in LDL-C PRS. Navy dashed line is our imputation information threshold of 0.4, and SNPs are coloured by our MAF threshold of 0.005.</p>
</caption>
<graphic xlink:href="fgene-13-818574-g004.tif"/>
</fig>
<p>We excluded participants (<italic>n</italic>&#x20;&#x3d; 80,296) according to UK Biobank Data Field 22020, which indicates the subset of participants that met quality control for use in the calculation of principal components.</p>
</sec>
<sec id="s3-1-3">
<title>3.1.3 Polygenic Risk Score Calculation and Validation</title>
<p>We calculated the PRS using allelic dosages in PLINK 2 with the <monospace>cols &#x3d; scoresums</monospace> option to get the raw (non-averaged) values.</p>
<p>Since the PRS was developed among primarily White individuals, we restricted our validation population to UK&#x20;Biobank participants of genetically White British ancestry (using UKB Data Field 22006). Among this population the PRS was approximately normally distributed (<xref ref-type="fig" rid="F5">Figure&#x20;5</xref>).</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Histogram of LDL-C PRS with overlaid density&#x20;plot.</p>
</caption>
<graphic xlink:href="fgene-13-818574-g005.tif"/>
</fig>
<p>Plotting the PRS against baseline LDL-C (<xref ref-type="fig" rid="F6">Figure&#x20;6</xref>) we saw good association between the PRS and the measured LDL-C (<italic>R</italic>
<sup>2</sup> &#x3d;&#x20;0.27).</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Association between LDL-C PRS and measured LDL-C at baseline among genetically White British UK Biobank participants.</p>
</caption>
<graphic xlink:href="fgene-13-818574-g006.tif"/>
</fig>
<p>We compared our calculated PRS with the one returned to the UK Biobank by Trinder et&#x20;al. (UKB Return 2142<xref ref-type="fn" rid="fn23">
<sup>23</sup>
</xref>) and found almost perfect correlation (<italic>R</italic>
<sup>2</sup> &#x3d; 0.99). However, when inspecting a scatterplot of the scores (<xref ref-type="fig" rid="F7">Figure&#x20;7</xref>) we observed differences in the raw values.<list list-type="simple">
<list-item>
<p>&#x2022; We had allowed the betas to be either positive or negative, while in the calculation of the returned score all SNPs had been aligned such that the betas were positive. This resulted in our scores being consistently smaller.</p>
</list-item>
<list-item>
<p>&#x2022; We had used allelic dosages, while the returned score had used hard-called dosages. This led to the parallel banding effect on the&#x20;plot.</p>
</list-item>
<list-item>
<p>&#x2022; Our quality control metrics differed slightly from those used in Trinder et&#x20;al., leading to slightly different exclusions.</p>
</list-item>
</list>
</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Comparison of PRS calculated using allelic and hard-call dosages (Pearson&#x2019;s correlation coefficient &#x3d; 0.99). PRS used was 223-SNP score from (<xref ref-type="bibr" rid="B19">Klarin et&#x20;al., 2018</xref>), with hard-called dosage approach from (<xref ref-type="bibr" rid="B42">Trinder et&#x20;al., 2020b</xref>).</p>
</caption>
<graphic xlink:href="fgene-13-818574-g007.tif"/>
</fig>
<p>While both approaches are completely reasonable, the resulting scores are not directly comparable. This demonstrates the importance of carefully reading the methods used in the initial calculation of the PRS, in particular if the intent is to compare the performance or association in a new dataset with the initial publication.</p>
</sec>
</sec>
<sec id="s3-2">
<title>3.2 Time and Computation Requirements</title>
<p>For this 223 SNP PRS, we ran each part with each of the three software tools discussed in this paper where possible, as a comparison we also ran a 118,388 SNP PRS for breast cancer (PGS000511<xref ref-type="fn" rid="fn24">
<sup>24</sup>
</xref>) (<xref ref-type="bibr" rid="B12">Fritsche et&#x20;al., 2020</xref>). The computation times are presented in <xref ref-type="table" rid="T6">Table&#x20;6</xref>, and are not intended as an overall performance analysis of each tool, but rather as an indication of their relative speeds and scalability to larger datasets.</p>
<table-wrap id="T6" position="float">
<label>TABLE 6</label>
<caption>
<p>Comparison of times taken. Please note absolute times may vary depending on the computation power of the system used, our interest is in the relative performance of the&#x20;tools.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th align="center">bgenix</th>
<th align="center">QCTOOL v2</th>
<th align="center">PLINK 2</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td colspan="4" align="left">223 variants</td>
</tr>
<tr>
<td align="left">&#x2003;SNP extraction</td>
<td align="center">53&#xa0;s</td>
<td align="center">2,696&#xa0;s</td>
<td align="center">18,403&#xa0;s</td>
</tr>
<tr>
<td align="left">&#x2003;QC</td>
<td align="center">&#x2014;</td>
<td align="center">795&#xa0;s</td>
<td align="center">7&#xa0;s</td>
</tr>
<tr>
<td align="left">&#x2003;PRS calculation</td>
<td align="center">&#x2014;</td>
<td align="center">&#x2014;</td>
<td align="center">1&#xa0;s</td>
</tr>
<tr>
<td colspan="4" align="left">100&#xa0;k variants</td>
</tr>
<tr>
<td align="left">&#x2003;SNP extraction</td>
<td align="center">2,681&#xa0;s</td>
<td align="center">&#x3e;108&#xa0;k s (exceeded 30&#xa0;h limit)</td>
<td align="center">20,821&#xa0;s</td>
</tr>
<tr>
<td align="left">&#x2003;QC</td>
<td align="center">&#x2014;</td>
<td align="center">7,942&#xa0;s</td>
<td align="center">76&#xa0;s</td>
</tr>
<tr>
<td align="left">&#x2003;PRS calculation</td>
<td align="center">&#x2014;</td>
<td align="center">&#x2014;</td>
<td align="center">256&#xa0;s</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Bgenix is the clear leader in terms of SNP extraction speed from BGEN files, as it was designed for this file format and takes advantage of the index file. While QCTOOL v2 offers a convenient wildcard feature to read from all the chromosome files in one command, it takes a long time to read the data and does not scale well to larger scores. PLINK 2 can rapidly extract data from its native. pgen file format, but in order to manipulate BGEN files it first auto-converts them to. pgen which takes approximately 25&#xa0;min per chromosome on the full imputed&#x20;data.</p>
<p>QCTOOL can calculate per-SNP or per-sample summary statistics quickly for small numbers of SNPs, but this scales poorly for large scores. In addition, some external tool (e.g., awk or R) is then needed to filter the resulting statistics by the desired exclusion thresholds, and then a separate extraction step must be used to apply these filters, which has not been included in our timings.</p>
<p>As previously discussed, PLINK 2 needs to convert the dataset to pgen format the first time it is read, but this only needs to be done once for a given score. Once the data has been converted, PLINK 2 can compute summary metrics and apply quality control thresholds in a single command, and does this rapidly even for large datasets.</p>
<p>Although the QCTOOL list of options includes the -risk-score command for PRS calculation, this is poorly documented and we have not explored it here. PLINK 2 can calculate even large PRS within a reasonable&#x20;time.</p>
</sec>
</sec>
<sec id="s4">
<title>4 Discussion</title>
<p>The continual hunt for &#x201c;novel&#x201d; variants associated with any given trait means new PRS are constantly being developed, using variants and effect sizes identified in GWAS conducted on ever-growing meta-analyses of multiple data-sets. This results in a wide array of scores for any given trait, with only minor improvements in predictive power beyond some threshold number of variants included.</p>
<p>However, the more data sets were used to contribute to the development of a PRS, the fewer datasets remain in which the score can be validated and used. We argue that there is value to be gained from using existing PRS in analyses, to validate and replicate the association and to investigate the potential for incorporating such scores in clinical practice. A PRS that has been incorporated in many analyses may become an &#x201c;industry standard&#x201d; score, and will result in more comparable research outputs than if many different scores were&#x20;used.</p>
<p>Authors who develop PRS clearly hope that these scores will be used by others, and initiatives like the PGS Catalog and the Genetic Risk Prediction Studies (GRIPS) Statement have gone a long way towards making this possible by homogenising the reporting of the necessary information for replicating a PRS (<xref ref-type="bibr" rid="B20">Lambert et&#x20;al., 2021</xref>; <xref ref-type="bibr" rid="B44">Wand et&#x20;al., 2021</xref>).</p>
<p>Indeed, recent work (<xref ref-type="bibr" rid="B3">Becker et&#x20;al., 2021</xref>) has made existing PRS even more accessible by arranging to make a selection of pre-calculated scores available for download within large datasets such as the UK Biobank. However, while this may offer a simple way for non-genetics focussed researchers to easily include PRS in their analyses, we should be wary that convenience does not overtake the need to critically evaluate the appropriateness of the score and the quality control applied.</p>
<p>In addition, even though the UK Biobank requests that all derived outputs are returned to them to be made available for other researchers to download, calculated PRS are not always returned and thus retrievable. Researchers who hope to use the same score are thus often obliged to reproduce the calculation, since direct sharing of UK Biobank data between studies is not permitted.</p>
<p>In this paper, we outlined the background concepts of PRS, compared genetic software tools for particular usage scenarios, and discussed the various QC metrics commonly used when working with genetic data, highlighting ways to best utilise resources provided by UKB. We provide our &#x201c;PRS pipeline,&#x201d;<xref ref-type="fn" rid="fn25">
<sup>25</sup>
</xref> an easily modifiable and reusable script that takes an input file of betas and calculates the&#x20;PRS.</p>
<p>In addition, we point out details which are often neglected in the reporting of existing literature but are crucial for reproducible work, such as different approaches to dosage computation. Finally, we discussed considerations of how PRS are computed and transformed to make sure they are appropriate for the research objective and statistical analyses.</p>
<sec id="s4-1">
<title>4.1 Limitations</title>
<p>In this paper, we have focussed on the calculation of existing PRS for use in statistical analyses and modelling, and have not discussed techniques used to develop a new PRS or &#x201c;real-world&#x201d; applications of PRS in a clinical context. If PRS development is of interest, we recommend published guides for conducting GWAS and developing a PRS such as (<xref ref-type="bibr" rid="B7">Choi et&#x20;al., 2020</xref>) and (<xref ref-type="bibr" rid="B29">Marees et&#x20;al., 2018</xref>). Both provide online tutorials<xref ref-type="fn" rid="fn26">
<sup>26</sup>
</xref>,<xref ref-type="fn" rid="fn27">
<sup>27</sup>
</xref> using either simulated or publicly available data (e.g., HapMap). Many applications have been proposed based on the analysis of PRS and these are discussed and showcased elsewhere, from exploring association of PRS with traits/outcomes, to assessing whether PRS improves existing risk prediction models (<xref ref-type="bibr" rid="B11">Elliott et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B18">Inouye et&#x20;al., 2018</xref>; <xref ref-type="bibr" rid="B21">Lee et&#x20;al., 2019</xref>; <xref ref-type="bibr" rid="B40">Sun et&#x20;al., 2021</xref>), and investigating causal inference via Mendelian Randomisation (<xref ref-type="bibr" rid="B19">Klarin et&#x20;al., 2018</xref>; <xref ref-type="bibr" rid="B24">Lewis &#x26; Vassos, 2020</xref>; <xref ref-type="bibr" rid="B46">Wray et&#x20;al., 2021</xref>).</p>
<p>We also concentrated on the UK Biobank imputed data; while the methods we outlined are more generally applicable our assessment of the available software tools is specific to the BGEN v1.2 format. The UK Biobank is a large-scale, widely used cohort study, and is one of the most comprehensive genetic and health data resources currently available.</p>
<p>While the UK Biobank is launching a Research Analysis Platform (RAP) for online data access, the methods discussed in this paper will still be applicable for users who choose to download the data to work locally rather than incurring computation fees in the cloud. In addition, it is possible that the tools described in this guide may be made available on the platform.</p>
</sec>
</sec>
</body>
<back>
<sec id="s5">
<title>Data Availability Statement</title>
<p>This research has been conducted using the UK Biobank Resource under Application Number 33952. Requests to access the data should be made via application to UK Biobank.</p>
</sec>
<sec id="s6">
<title>Author Contributions</title>
<p>JC, LC and XL wrote the manuscript which was conceived by LC. JC designed the pipeline. JC and XL developed the code and produced the online tutorial.</p>
</sec>
<sec id="s7">
<title>Funding</title>
<p>The UK Biobank study was supported by the Wellcome Trust, Medical Research Council, Department of Health, Scottish government, and Northwest Regional Development Agency. It has also received funding from the Welsh Assembly government and British Heart Foundation. The analyses here were funded by the Cancer Research UK (grant no C16077/A29186), and supported by the Nuffield Department of Population Health, Oxford University.</p>
</sec>
<sec sec-type="COI-statement" id="s8">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<ack>
<p>The authors are grateful to UK Biobank participants and the study team for making the data available. We thank Prof. D. J. Hunter for his advice and support. Computation used the Oxford Biomedical Research Computing (BMRC) facility, a joint development between the Wellcome Centre for Human Genetics and the Big Data Institute supported by Health Data Research UK and the NIHR Oxford Biomedical Research Centre. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.</p>
</ack>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<fn-group>
<fn id="fn1">
<label>1</label>
<p>https<ext-link ext-link-type="uri" xlink:href="https://www.pgscatalog.org/">://www.pgscatalog.org/</ext-link>
</p>
</fn>
<fn id="fn2">
<label>2</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://prsweb.sph.umich.edu:8443/">https://prsweb.sph.umich.edu:8443/</ext-link>
</p>
</fn>
<fn id="fn3">
<label>3</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://2cjenn.github.io/PRS_Pipeline/">https://2cjenn.github.io/PRS_Pipeline/</ext-link>
</p>
</fn>
<fn id="fn4">
<label>4</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://enkre.net/cgi-bin/code/bgen/doc/trunk/doc/wiki/bgenix.md">https://enkre.net/cgi-bin/code/bgen/doc/trunk/doc/wiki/bgenix.md</ext-link>
</p>
</fn>
<fn id="fn5">
<label>5</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://www.well.ox.ac.uk/%7Egav/qctool_v2/">https://www.well.ox.ac.uk/&#x223c;gav/qctool_v2/</ext-link>
</p>
</fn>
<fn id="fn6">
<label>6</label>
<p>http<ext-link ext-link-type="uri" xlink:href="https://www.cog-genomics.org/plink/2.0/">s://www.cog-genomics.org/plink/2.0/</ext-link>
</p>
</fn>
<fn id="fn7">
<label>7</label>
<p>http<ext-link ext-link-type="uri" xlink:href="https://www.pgscatalog.org/downloads/#scoring_columns">s://www.pgscatalog.org/downloads/#scoring_columns</ext-link>
</p>
</fn>
<fn id="fn8">
<label>8</label>
<p>https<ext-link ext-link-type="uri" xlink:href="https://biobank.ndph.ox.ac.uk/showcase/ukb/docs/affy_data_generation2017.pdf">://biobank.ndph.ox.ac.uk/showcase/ukb/docs/affy_data_generation2017.pdf</ext-link>
</p>
</fn>
<fn id="fn9">
<label>9</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/ukb_genetic_data_description.txt">https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/ukb_genetic_data_description.txt</ext-link>
</p>
</fn>
<fn id="fn10">
<label>10</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=100313">https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id&#x3d;100313</ext-link>
</p>
</fn>
<fn id="fn11">
<label>11</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://2cjenn.github.io/PRS_Pipeline/#Appendices">https://2cjenn.github.io/PRS_Pipeline/&#x23;Appendices</ext-link>
</p>
</fn>
<fn id="fn12">
<label>12</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=1967">https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id&#x3d;1967</ext-link>
</p>
</fn>
<fn id="fn13">
<label>13</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=22006">https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id&#x3d;22006</ext-link>
</p>
</fn>
<fn id="fn14">
<label>14</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=22020">https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id&#x3d;22020</ext-link>
</p>
</fn>
<fn id="fn15">
<label>15</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=22027">https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id&#x3d;22027</ext-link>
</p>
</fn>
<fn id="fn16">
<label>16</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=22001">https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id&#x3d;22001</ext-link>
</p>
</fn>
<fn id="fn17">
<label>17</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=31">https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id&#x3d;31</ext-link>
</p>
</fn>
<fn id="fn18">
<label>18</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=22009">https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id&#x3d;22009</ext-link>
</p>
</fn>
<fn id="fn19">
<label>19</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=22000">https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id&#x3d;22000</ext-link>
</p>
</fn>
<fn id="fn20">
<label>20</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://2cjenn.github.io/PRS_Pipeline/">https://2cjenn.github.io/PRS_Pipeline/</ext-link>
</p>
</fn>
<fn id="fn21">
<label>21</label>
<p>http<ext-link ext-link-type="uri" xlink:href="https://www.pgscatalog.org/score/PGS000115/">s://www.pgscatalog.org/score/PGS000115/</ext-link>
</p>
</fn>
<fn id="fn22">
<label>22</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://biobank.ctsu.ox.ac.uk/crystal/ukb/docs/ukb_genetic_data_description.txt">https://biobank.ctsu.ox.ac.uk/crystal/ukb/docs/ukb_genetic_data_description.txt</ext-link>
</p>
</fn>
<fn id="fn23">
<label>23</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://biobank.ndph.ox.ac.uk/ukb/dset.cgi?id=2142">https://biobank.ndph.ox.ac.uk/ukb/dset.cgi?id&#x3d;2142</ext-link>
</p>
</fn>
<fn id="fn24">
<label>24</label>
<p>https<ext-link ext-link-type="uri" xlink:href="https://www.pgscatalog.org/score/PGS000511/">://www.pgscatalog.org/score/PGS000511/</ext-link>
</p>
</fn>
<fn id="fn25">
<label>25</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://2cjenn.github.io/PRS_Pipeline/">https://2cjenn.github.io/PRS_Pipeline/</ext-link>
</p>
</fn>
<fn id="fn26">
<label>26</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://choishingwan.github.io/PRS-Tutorial/">https://choishingwan.github.io/PRS-Tutorial/</ext-link>
</p>
</fn>
<fn id="fn27">
<label>27</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://github.com/MareesAT/GWA_tutorial/">https://github.com/MareesAT/GWA_tutorial/</ext-link>
</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Agerbo</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Sullivan</surname>
<given-names>P. F.</given-names>
</name>
<name>
<surname>Vilhj&#xe1;lmsson</surname>
<given-names>B. J.</given-names>
</name>
<name>
<surname>Pedersen</surname>
<given-names>C. B.</given-names>
</name>
<name>
<surname>Mors</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>B&#xf8;rglum</surname>
<given-names>A. D.</given-names>
</name>
<etal/>
</person-group> (<year>2015</year>). <article-title>Polygenic Risk Score, Parental Socioeconomic Status, Family History of Psychiatric Disorders, and the Risk for Schizophrenia</article-title>. <source>JAMA Psychiatry</source> <volume>72</volume> (<issue>7</issue>), <fpage>635</fpage>&#x2013;<lpage>641</lpage>. <pub-id pub-id-type="doi">10.1001/JAMAPSYCHIATRY.2015.0346</pub-id> </citation>
</ref>
<ref id="B2">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Band</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Marchini</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2018</year>). <source>BGEN : A Binary File Format for Imputed Genotype and Haplotype Data</source>. <publisher-name>BioRxiv</publisher-name>, <fpage>1</fpage>&#x2013;<lpage>6</lpage>. <pub-id pub-id-type="doi">10.1101/308296</pub-id>
<article-title>BGEN: a Binary File Format for Imputed Genotype and Haplotype Data</article-title> </citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Becker</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Burik</surname>
<given-names>C. A. P.</given-names>
</name>
<name>
<surname>Goldman</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Jayashankar</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Bennett</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Resource Profile and User Guide of the Polygenic Index Repository</article-title>. <source>Nat. Hum. Behav.</source> <volume>5</volume>, <fpage>1744</fpage>&#x2013;<lpage>1758</lpage>. <pub-id pub-id-type="doi">10.1038/s41562-021-01119-3</pub-id> </citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bycroft</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Freeman</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Petkova</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Band</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Elliott</surname>
<given-names>L. T.</given-names>
</name>
<name>
<surname>Sharp</surname>
<given-names>K.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>The UK Biobank Resource with Deep Phenotyping and Genomic Data</article-title>. <source>Nature</source> <volume>562</volume> (<issue>7726</issue>), <fpage>203</fpage>&#x2013;<lpage>209</lpage>. <pub-id pub-id-type="doi">10.1038/s41586-018-0579-z</pub-id> </citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chang</surname>
<given-names>C. C.</given-names>
</name>
<name>
<surname>Chow</surname>
<given-names>C. C.</given-names>
</name>
<name>
<surname>Tellier</surname>
<given-names>L. C.</given-names>
</name>
<name>
<surname>Vattikuti</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Purcell</surname>
<given-names>S. M.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>J.&#x20;J.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Second-generation PLINK: Rising to the challenge of Larger and Richer Datasets</article-title>. <source>GigaSci</source> <volume>4</volume> (<issue>1</issue>), <fpage>7</fpage>. <pub-id pub-id-type="doi">10.1186/s13742-015-0047-8</pub-id> </citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>L. M.</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Garg</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Nguyen</surname>
<given-names>T. T. T.</given-names>
</name>
<name>
<surname>Pokhvisneva</surname>
<given-names>I.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>PRS-on-Spark (PRSoS): A Novel, Efficient and Flexible Approach for Generating Polygenic Risk Scores</article-title>. <source>BMC Bioinformatics</source> <volume>19</volume> (<issue>1</issue>), <fpage>1</fpage>&#x2013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1186/S12859-018-2289-9</pub-id> </citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Choi</surname>
<given-names>S. W.</given-names>
</name>
<name>
<surname>Mak</surname>
<given-names>T. S.-H.</given-names>
</name>
<name>
<surname>O&#x2019;Reilly</surname>
<given-names>P. F.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Tutorial: a Guide to Performing Polygenic Risk Score Analyses</article-title>. <source>Nat. Protoc.</source> <volume>15</volume> (<issue>99</issue>), <fpage>2759</fpage>&#x2013;<lpage>2772</lpage>. <pub-id pub-id-type="doi">10.1038/s41596-020-0353-1</pub-id> </citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Choi</surname>
<given-names>S. W.</given-names>
</name>
<name>
<surname>O&#x27;Reilly</surname>
<given-names>P. F.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data</article-title>. <source>GigaScience</source> <volume>8</volume> (<issue>7</issue>), <fpage>1</fpage>&#x2013;<lpage>6</lpage>. <pub-id pub-id-type="doi">10.1093/gigascience/giz082</pub-id> </citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cupido</surname>
<given-names>A. J.</given-names>
</name>
<name>
<surname>Tromp</surname>
<given-names>T. R.</given-names>
</name>
<name>
<surname>Hovingh</surname>
<given-names>G. K.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>The Clinical Applicability of Polygenic Risk Scores for LDL-Cholesterol: Considerations, Current Evidence and Future Perspectives</article-title>. <source>Curr. Opin. Lipidol.</source> <volume>32</volume> (<issue>2</issue>), <fpage>112</fpage>&#x2013;<lpage>116</lpage>. <pub-id pub-id-type="doi">10.1097/MOL.0000000000000741</pub-id> </citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Duncan</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Gelaye</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Meijsen</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ressler</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Feldman</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Analysis of Polygenic Risk Score Usage and Performance in Diverse Human Populations</article-title>. <source>Nat. Commun.</source> <volume>10</volume> (<issue>1</issue>), <fpage>1</fpage>&#x2013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1038/s41467-019-11112-0</pub-id> </citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Elliott</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bodinier</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Bond</surname>
<given-names>T. A.</given-names>
</name>
<name>
<surname>Chadeau-Hyam</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Evangelou</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Moons</surname>
<given-names>K. G. M.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Predictive Accuracy of a Polygenic Risk Score-Enhanced Prediction Model vs a Clinical Risk Score for Coronary Artery Disease</article-title>. <source>Jama</source> <volume>323</volume> (<issue>7</issue>), <fpage>636</fpage>&#x2013;<lpage>645</lpage>. <pub-id pub-id-type="doi">10.1001/jama.2019.22241</pub-id> </citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fritsche</surname>
<given-names>L. G.</given-names>
</name>
<name>
<surname>Patil</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Beesley</surname>
<given-names>L. J.</given-names>
</name>
<name>
<surname>VandeHaar</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Salvatore</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Cancer PRSweb: An Online Repository with Polygenic Risk Scores for Major Cancer Traits and Their Evaluation in Two Independent Biobanks</article-title>. <source>Am. J.&#x20;Hum. Genet.</source> <volume>107</volume> (<issue>5</issue>), <fpage>815</fpage>&#x2013;<lpage>836</lpage>. <pub-id pub-id-type="doi">10.1016/j.ajhg.2020.08.025</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gajendragadkar</surname>
<given-names>P. R.</given-names>
</name>
<name>
<surname>Von Ende</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Ibrahim</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Valdes-Marquez</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Camm</surname>
<given-names>C. F.</given-names>
</name>
<name>
<surname>Murgia</surname>
<given-names>F.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Assessment of the Causal Relevance of ECG Parameters for Risk of Atrial Fibrillation: A Mendelian Randomisation Study</article-title>. <source>Plos Med.</source> <volume>18</volume> (<issue>5</issue>), <fpage>e1003572</fpage>. <pub-id pub-id-type="doi">10.1371/JOURNAL.PMED.1003572</pub-id> </citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Graffelman</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Moreno</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>The Mid P-Value in Exact Tests for Hardy-Weinberg Equilibrium</article-title>. <source>Stat. Appl. Genet. Mol. Biol.</source> <volume>12</volume> (<issue>4</issue>), <fpage>433</fpage>&#x2013;<lpage>448</lpage>. <pub-id pub-id-type="doi">10.1515/sagmb-2012-0039</pub-id> </citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hartwig</surname>
<given-names>F. P.</given-names>
</name>
<name>
<surname>Davies</surname>
<given-names>N. M.</given-names>
</name>
<name>
<surname>Hemani</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Davey Smith</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Two-sample Mendelian Randomization: Avoiding the Downsides of a Powerful, Widely Applicable but Potentially Fallible Technique</article-title>. <source>Int. J.&#x20;Epidemiol.</source> <volume>45</volume>, <fpage>1717</fpage>&#x2013;<lpage>1726</lpage>. <pub-id pub-id-type="doi">10.1093/ije/dyx028</pub-id> </citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hinrichs</surname>
<given-names>A. S.</given-names>
</name>
<name>
<surname>Karolchik</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Baertsch</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Barber</surname>
<given-names>G. P.</given-names>
</name>
<name>
<surname>Bejerano</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Clawson</surname>
<given-names>H.</given-names>
</name>
<etal/>
</person-group> (<year>2006</year>). <article-title>The UCSC Genome Browser Database: Update 2006</article-title>. <source>Nucleic Acids Res.</source> <volume>34</volume>, <fpage>D590</fpage>&#x2013;<lpage>D598</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkj144</pub-id> </citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Howie</surname>
<given-names>B. N.</given-names>
</name>
<name>
<surname>Donnelly</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Marchini</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-wide Association Studies</article-title>. <source>Plos Genet.</source> <volume>5</volume> (<issue>6</issue>), <fpage>e1000529</fpage>. <pub-id pub-id-type="doi">10.1371/JOURNAL.PGEN.1000529</pub-id> </citation>
</ref>
<ref id="B18">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Inouye</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Abraham</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Nelson</surname>
<given-names>C. P.</given-names>
</name>
<name>
<surname>Wood</surname>
<given-names>A. M.</given-names>
</name>
<name>
<surname>Sweeting</surname>
<given-names>M. J.</given-names>
</name>
<name>
<surname>Dudbridge</surname>
<given-names>F.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <source>Genomic Risk Prediction of Coronary Artery Disease in Nearly 500,000 Adults: Implications for Early Screening and Primary Prevention</source>. <publisher-name>BioRxiv</publisher-name>, <fpage>1</fpage>&#x2013;<lpage>22</lpage>. <pub-id pub-id-type="doi">10.1101/250712</pub-id> </citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Klarin</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Damrauer</surname>
<given-names>S. M.</given-names>
</name>
<name>
<surname>Damrauer</surname>
<given-names>S. M.</given-names>
</name>
<name>
<surname>Cho</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>Y. V.</given-names>
</name>
<name>
<surname>Teslovich</surname>
<given-names>T. M.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Genetics of Blood Lipids Among &#x223c;300,000&#x20;Multi-Ethnic Participants of the Million Veteran Program</article-title>. <source>Nat. Genet.</source> <volume>50</volume> (<issue>11</issue>), <fpage>1514</fpage>&#x2013;<lpage>1523</lpage>. <pub-id pub-id-type="doi">10.1038/s41588-018-0222-9</pub-id> </citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lambert</surname>
<given-names>S. A.</given-names>
</name>
<name>
<surname>Gil</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Jupp</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Ritchie</surname>
<given-names>S. C.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Buniello</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>The Polygenic Score Catalog as an Open Database for Reproducibility and Systematic Evaluation</article-title>. <source>Nat. Genet.</source> <volume>53</volume> (<issue>4</issue>), <fpage>420</fpage>&#x2013;<lpage>425</lpage>. <pub-id pub-id-type="doi">10.1038/s41588-021-00783-5</pub-id> </citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Mavaddat</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Wilcox</surname>
<given-names>A. N.</given-names>
</name>
<name>
<surname>Cunningham</surname>
<given-names>A. P.</given-names>
</name>
<name>
<surname>Carver</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Hartley</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>BOADICEA: a Comprehensive Breast Cancer Risk Prediction Model Incorporating Genetic and Nongenetic Risk Factors</article-title>. <source>Genetics Medicine</source> <volume>21</volume> (<issue>8</issue>), <fpage>1708</fpage>&#x2013;<lpage>1718</lpage>. <pub-id pub-id-type="doi">10.1038/s41436-018-0406-9</pub-id> </citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>S. H.</given-names>
</name>
<name>
<surname>Goddard</surname>
<given-names>M. E.</given-names>
</name>
<name>
<surname>Wray</surname>
<given-names>N. R.</given-names>
</name>
<name>
<surname>Visscher</surname>
<given-names>P. M.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>A Better Coefficient of Determination for Genetic Profile Analysis</article-title>. <source>Genet. Epidemiol.</source> <volume>36</volume> (<issue>3</issue>), <fpage>214</fpage>&#x2013;<lpage>224</lpage>. <pub-id pub-id-type="doi">10.1002/gepi.21614</pub-id> </citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lello</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Raben</surname>
<given-names>T. G.</given-names>
</name>
<name>
<surname>Yong</surname>
<given-names>S. Y.</given-names>
</name>
<name>
<surname>Tellier</surname>
<given-names>L. C. A. M.</given-names>
</name>
<name>
<surname>Hsu</surname>
<given-names>S. D. H.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Genomic Prediction of 16 Complex Disease Risks Including Heart Attack, Diabetes, Breast and Prostate Cancer</article-title>. <source>Sci. Rep.</source> <volume>9</volume> (<issue>1</issue>), <fpage>1</fpage>&#x2013;<lpage>16</lpage>. <pub-id pub-id-type="doi">10.1038/s41598-019-51258-x</pub-id> </citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lewis</surname>
<given-names>C. M.</given-names>
</name>
<name>
<surname>Vassos</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Polygenic Risk Scores: From Research Tools to Clinical Instruments</article-title>. <source>Genome Med.</source> <volume>12</volume> (<issue>1</issue>), <fpage>1</fpage>&#x2013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1186/s13073-020-00742-5</pub-id> </citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>D. J.</given-names>
</name>
<name>
<surname>Peloso</surname>
<given-names>G. M.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Butterworth</surname>
<given-names>A. S.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Mahajan</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>Exome-wide Association Study of Plasma Lipids in &#x3e;300,000 Individuals</article-title>. <source>Nat. Genet.</source> <volume>49</volume> (<issue>12</issue>), <fpage>1758</fpage>&#x2013;<lpage>1766</lpage>. <pub-id pub-id-type="doi">10.1038/ng.3977</pub-id> </citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ma</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Genetic Prediction of Complex Traits with Polygenic Scores: a Statistical Review</article-title>. <source>Trends Genet.</source> <volume>37</volume> (<issue>xx</issue>), <fpage>995</fpage>&#x2013;<lpage>1011</lpage>. <pub-id pub-id-type="doi">10.1016/j.tig.2021.06.004</pub-id> </citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Manichaikul</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Mychaleckyj</surname>
<given-names>J.&#x20;C.</given-names>
</name>
<name>
<surname>Rich</surname>
<given-names>S. S.</given-names>
</name>
<name>
<surname>Daly</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Sale</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>W.-M.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Robust Relationship Inference in Genome-wide Association Studies</article-title>. <source>Bioinformatics</source> <volume>26</volume> (<issue>22</issue>), <fpage>2867</fpage>&#x2013;<lpage>2873</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btq559</pub-id> </citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marchini</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Howie</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Genotype Imputation for Genome-wide Association Studies</article-title>. <source>Nat. Rev. Genet.</source> <volume>11</volume> (<issue>7</issue>), <fpage>499</fpage>&#x2013;<lpage>511</lpage>. <pub-id pub-id-type="doi">10.1038/nrg2796</pub-id> </citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marees</surname>
<given-names>A. T.</given-names>
</name>
<name>
<surname>de Kluiver</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Stringer</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Vorspan</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Curis</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Marie-Claire</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>A Tutorial on Conducting Genome-wide Association Studies: Quality Control and Statistical Analysis</article-title>. <source>Int. J.&#x20;Methods Psychiatr. Res.</source> <volume>27</volume> (<issue>2</issue>), <fpage>e1608</fpage>. <pub-id pub-id-type="doi">10.1002/mpr.1608</pub-id> </citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mavaddat</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Michailidou</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Dennis</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Lush</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Fachal</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Polygenic Risk Scores for Prediction of Breast Cancer and Breast Cancer Subtypes</article-title>. <source>Am. J.&#x20;Hum. Genet.</source> <volume>104</volume> (<issue>1</issue>), <fpage>21</fpage>&#x2013;<lpage>34</lpage>. <pub-id pub-id-type="doi">10.1016/j.ajhg.2018.11.002</pub-id> </citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mv</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Fw</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Tm</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Mb</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Cp</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Ce</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2015</year>). <article-title>Mendelian Randomization of Blood Lipids for Coronary Heart Disease</article-title>. <source>Eur. Heart J.</source> <volume>36</volume> (<issue>9</issue>), <fpage>539</fpage>&#x2013;<lpage>550</lpage>. <pub-id pub-id-type="doi">10.1093/EURHEARTJ/EHT571</pub-id> </citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pazoki</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Dehghan</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Evangelou</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Warren</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Caulfield</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Genetic Predisposition to High Blood Pressure and Lifestyle Factors</article-title>. <source>Circulation</source> <volume>137</volume> (<issue>7</issue>), <fpage>653</fpage>&#x2013;<lpage>661</lpage>. <pub-id pub-id-type="doi">10.1161/CIRCULATIONAHA.117.030898</pub-id> </citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Price</surname>
<given-names>A. L.</given-names>
</name>
<name>
<surname>Patterson</surname>
<given-names>N. J.</given-names>
</name>
<name>
<surname>Plenge</surname>
<given-names>R. M.</given-names>
</name>
<name>
<surname>Weinblatt</surname>
<given-names>M. E.</given-names>
</name>
<name>
<surname>Shadick</surname>
<given-names>N. A.</given-names>
</name>
<name>
<surname>Reich</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2006</year>). <article-title>Principal Components Analysis Corrects for Stratification in Genome-wide Association Studies</article-title>. <source>Nat. Genet.</source> <volume>38</volume> (<issue>8</issue>), <fpage>904</fpage>&#x2013;<lpage>909</lpage>. <pub-id pub-id-type="doi">10.1038/ng1847</pub-id> </citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Price</surname>
<given-names>A. L.</given-names>
</name>
<name>
<surname>Zaitlen</surname>
<given-names>N. A.</given-names>
</name>
<name>
<surname>Reich</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Patterson</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>New Approaches to Population Stratification in Genome-wide Association Studies</article-title>. <source>Nat. Rev. Genet.</source> <volume>11</volume> (<issue>7</issue>), <fpage>459</fpage>&#x2013;<lpage>463</lpage>. <pub-id pub-id-type="doi">10.1038/nrg2813</pub-id> </citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Priv&#xe9;</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Vilhj&#xe1;lmsson</surname>
<given-names>B. J.</given-names>
</name>
<name>
<surname>Aschard</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Blum</surname>
<given-names>M. G. B.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Making the Most of Clumping and Thresholding for Polygenic Scores</article-title>. <source>Am. J.&#x20;Hum. Genet.</source> <volume>105</volume> (<issue>6</issue>), <fpage>1213</fpage>&#x2013;<lpage>1221</lpage>. <pub-id pub-id-type="doi">10.1016/J.AJHG.2019.11.001</pub-id> </citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Reed</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Nunez</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Kulp</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Qian</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Reilly</surname>
<given-names>M. P.</given-names>
</name>
<name>
<surname>Foulkes</surname>
<given-names>A. S.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>A Guide to Genome&#x2010;wide Association Analysis and post&#x2010;analytic Interrogation</article-title>. <source>Statist. Med.</source> <volume>34</volume> (<issue>28</issue>), <fpage>3769</fpage>&#x2013;<lpage>3792</lpage>. <pub-id pub-id-type="doi">10.1002/sim.6605</pub-id> </citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sawyer</surname>
<given-names>S. L.</given-names>
</name>
<name>
<surname>Mukherjee</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Pakstis</surname>
<given-names>A. J.</given-names>
</name>
<name>
<surname>Feuk</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Kidd</surname>
<given-names>J.&#x20;R.</given-names>
</name>
<name>
<surname>Brookes</surname>
<given-names>A. J.</given-names>
</name>
<etal/>
</person-group> (<year>2005</year>). <article-title>Linkage Disequilibrium Patterns Vary Substantially Among Populations</article-title>. <source>Eur. J.&#x20;Hum. Genet.</source> <volume>13</volume> (<issue>5</issue>), <fpage>677</fpage>&#x2013;<lpage>686</lpage>. <pub-id pub-id-type="doi">10.1038/sj.ejhg.5201368</pub-id> </citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shriner</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Approximate and Exact Tests of Hardy-Weinberg Equilibrium Using Uncertain Genotypes</article-title>. <source>Genet. Epidemiol.</source> <volume>35</volume> (<issue>7</issue>), <fpage>632</fpage>&#x2013;<lpage>637</lpage>. <pub-id pub-id-type="doi">10.1002/GEPI.20612</pub-id> </citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shriner</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Impact of Hardy-Weinberg Disequilibrium on post-imputation Quality Control</article-title>. <source>Hum. Genet.</source> <volume>132132</volume> (<issue>99</issue>), <fpage>1073</fpage>&#x2013;<lpage>1075</lpage>. <pub-id pub-id-type="doi">10.1007/S00439-013-1336-X</pub-id> </citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Pennells</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Kaptoge</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Nelson</surname>
<given-names>C. P.</given-names>
</name>
<name>
<surname>Ritchie</surname>
<given-names>S. C.</given-names>
</name>
<name>
<surname>Abraham</surname>
<given-names>G.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Polygenic Risk Scores in Cardiovascular Risk Prediction: A Cohort Study and Modelling Analyses</article-title>. <source>Plos Med.</source> <volume>18</volume> (<issue>1</issue>), <fpage>e1003498</fpage>&#x2013;<lpage>22</lpage>. <pub-id pub-id-type="doi">10.1371/JOURNAL.PMED.1003498</pub-id> </citation>
</ref>
<ref id="B41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Trinder</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Francis</surname>
<given-names>G. A.</given-names>
</name>
<name>
<surname>Brunham</surname>
<given-names>L. R.</given-names>
</name>
</person-group> (<year>2020a</year>). <article-title>Association of Monogenic vs Polygenic Hypercholesterolemia with Risk of Atherosclerotic Cardiovascular Disease</article-title>. <source>JAMA Cardiol.</source> <volume>5</volume> (<issue>4</issue>), <fpage>390</fpage>&#x2013;<lpage>399</lpage>. <pub-id pub-id-type="doi">10.1001/jamacardio.2019.5954</pub-id> </citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Trinder</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Walley</surname>
<given-names>K. R.</given-names>
</name>
<name>
<surname>Boyd</surname>
<given-names>J.&#x20;H.</given-names>
</name>
<name>
<surname>Brunham</surname>
<given-names>L. R.</given-names>
</name>
</person-group> (<year>2020b</year>). <article-title>Causal Inference for Genetically Determined Levels of High-Density Lipoprotein Cholesterol and Risk of Infectious Disease</article-title>. <source>Atvb</source> <volume>40</volume> (<issue>1</issue>), <fpage>267</fpage>&#x2013;<lpage>278</lpage>. <pub-id pub-id-type="doi">10.1161/ATVBAHA.119.313381</pub-id> </citation>
</ref>
<ref id="B43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Turner</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Armstrong</surname>
<given-names>L. L.</given-names>
</name>
<name>
<surname>Bradford</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Carlson</surname>
<given-names>C. S.</given-names>
</name>
<name>
<surname>Crawford</surname>
<given-names>D. C.</given-names>
</name>
<name>
<surname>Crenshaw</surname>
<given-names>A. T.</given-names>
</name>
<etal/>
</person-group> (<year>2011</year>). <article-title>Quality Control Procedures for Genome-wide Association Studies</article-title>. <source>Curr. Protoc. Hum. Genet.</source> <volume>Chapter 1</volume>, <fpage>Unit1</fpage>&#x2013;<lpage>19</lpage>. <pub-id pub-id-type="doi">10.1002/0471142905.hg0119s68.Quality</pub-id> </citation>
</ref>
<ref id="B44">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wand</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lambert</surname>
<given-names>S. A.</given-names>
</name>
<name>
<surname>Tamburro</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Iacocca</surname>
<given-names>M. A.</given-names>
</name>
<name>
<surname>O&#x2019;Sullivan</surname>
<given-names>J.&#x20;W.</given-names>
</name>
<name>
<surname>Sillari</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Improving Reporting Standards for Polygenic Scores in Risk Prediction Studies</article-title>. <source>Nature</source> <volume>591</volume>, <fpage>211</fpage>&#x2013;<lpage>219</lpage>. <pub-id pub-id-type="doi">10.1038/s41586-021-03243-6</pub-id> </citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wigginton</surname>
<given-names>J.&#x20;E.</given-names>
</name>
<name>
<surname>Cutler</surname>
<given-names>D. J.</given-names>
</name>
<name>
<surname>Abecasis</surname>
<given-names>G. R.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>A Note on Exact Tests of Hardy-Weinberg Equilibrium</article-title>. <source>Am. J.&#x20;Hum. Genet.</source> <volume>76</volume> (<issue>5</issue>), <fpage>887</fpage>&#x2013;<lpage>893</lpage>. <pub-id pub-id-type="doi">10.1086/429864</pub-id> </citation>
</ref>
<ref id="B46">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wray</surname>
<given-names>N. R.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Austin</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>McGrath</surname>
<given-names>J.&#x20;J.</given-names>
</name>
<name>
<surname>Hickie</surname>
<given-names>I. B.</given-names>
</name>
<name>
<surname>Murray</surname>
<given-names>G. K.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>From Basic Science to Clinical Application of Polygenic Risk Scores</article-title>. <source>JAMA Psychiatry</source> <volume>78</volume> (<issue>Issue 1</issue>), <fpage>101</fpage>&#x2013;<lpage>109</lpage>. <pub-id pub-id-type="doi">10.1001/jamapsychiatry.2020.3049</pub-id> </citation>
</ref>
<ref id="B47">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zekavat</surname>
<given-names>S. M.</given-names>
</name>
<name>
<surname>Honigberg</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Pirruccello</surname>
<given-names>J.&#x20;P.</given-names>
</name>
<name>
<surname>Kohli</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Karlson</surname>
<given-names>E. W.</given-names>
</name>
<name>
<surname>Newton-Cheh</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Elevated Blood Pressure Increases Pneumonia Risk: Epidemiological Association and Mendelian Randomization in the UK Biobank</article-title>. <source>Med</source> <volume>2</volume> (<issue>2</issue>), <fpage>137</fpage>&#x2013;<lpage>148</lpage>. <comment>e4</comment>. <pub-id pub-id-type="doi">10.1016/J.MEDJ.2020.11.001</pub-id> </citation>
</ref>
<ref id="B48">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Jing</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Samuels</surname>
<given-names>D. C.</given-names>
</name>
<name>
<surname>Sheng</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Shyr</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Strategies for Processing and Quality Control of Illumina Genotyping Arrays</article-title>. <source>Brief. Bioinform.</source> <volume>19</volume> (<issue>5</issue>), <fpage>765</fpage>&#x2013;<lpage>775</lpage>. <pub-id pub-id-type="doi">10.1093/bib/bbx012</pub-id> </citation>
</ref>
<ref id="B49">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zheng</surname>
<given-names>H.-F.</given-names>
</name>
<name>
<surname>Ladouceur</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Greenwood</surname>
<given-names>C. M. T.</given-names>
</name>
<name>
<surname>Richards</surname>
<given-names>J.&#x20;B.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Effect of Genome-wide Genotyping and Reference Panels on Rare Variants Imputation</article-title>. <source>J.&#x20;Genet. Genomics</source> <volume>39</volume> (<issue>10</issue>), <fpage>545</fpage>&#x2013;<lpage>550</lpage>. <pub-id pub-id-type="doi">10.1016/J.JGG.2012.07.002</pub-id> </citation>
</ref>
</ref-list>
</back>
</article>