Genetic Analyses of Common Infections in the Avon Longitudinal Study of Parents and Children Cohort

The burden of infections on an individual and public health is profound. Many observational studies have shown a link between infections and the pathogenesis of disease; however a greater understanding of the role of host genetics is essential. Children from the longitudinal birth cohort, the Avon Longitudinal Study of Parents and Children, had 14 antibodies measured in plasma at age 7: Alpha-casein protein, beta-casein protein, cytomegalovirus, Epstein-Barr virus, feline herpes virus, Helicobacter pylori, herpes simplex virus 1, influenza virus subtype H1N1, influenza virus subtype H3N2, measles virus, Saccharomyces cerevisiae, Theiler’s virus, Toxoplasma gondii, and SAG1 protein domain, a surface antigen of Toxoplasma gondii measured for greater precision. We performed genome-wide association analyses of antibody levels against these 14 infections (N = 357 – 5010) and identified three genome-wide signals (P < 5×10-8), two associated with measles virus antibodies and one with Toxoplasma gondii antibodies. In an association analysis focused on the human leukocyte antigen (HLA) region of the genome, we further detected 15 HLA alleles at a two-digit resolution and 23 HLA alleles at a four-digit resolution associated with five antibodies, with eight HLA alleles associated with Epstein-Barr virus antibodies showing strong evidence of replication in UK Biobank. We discuss how our findings from antibody levels complement other studies using self-reported phenotypes in understanding the architecture of host genetics related to infections.


INTRODUCTION
The individual and public health burden of infectious diseases can be substantial. Beyond the immediate impact of infections, exposure to infections has been associated with the development of noncommunicable diseases such as cardiovascular disease, cancer, and autoimmune disease. For example, common infections such as Epstein-Barr virus has been implicated with nasopharyngeal carcinoma (1) and multiple sclerosis (2,3); Helicobacter pylori has been linked to myocardial infarction (4,5) and ischaemic heart disease (6); and cytomegalovirus has been shown to have a role in the development of atherosclerosis (7). To date, examples of large genome-wide association studies (GWAS) of infections have been performed in COVID-19 Host Genetics investigating people with measured SARS-CoV-2 infection (8), 23andMe examining common infections using retrospective self-reporting (9), UK Biobank using serological measurements of antibody response and seropositivity to antigens (10), and similarly in the Rotterdam Study and Study of Health in Pomerania cohorts using serological measures for people infected with Helicobacter pylori (11). These differences in defining infection are important to note for their merits and limitations. For example, serological infection studies (8,(10)(11)(12) tend to use a continuous measure of antibody levels. Variations in antibody levels can be illustrated graphically by plotting the distribution of antibody responses across individual, and there is an inherent difficulty in determining binary infection status. Higher antibody levels could plausibly be due to response to infection, or naturally high levels in unexposed individuals. In contrast, low antibody levels could indicate no infection, or poor or waning antibody response to infection in exposed individuals. Serological measurements can also be limited by the lack of specificity of antibody response to the antigen of interest. Self-report infection studies (9,13) can also be limited by inaccurate infection measurements as people might erroneously think they have had an infection or be unaware that they had been infected. If the measurement error in this case is random then that will dilute the statistical power of the study. However, if study participants are mistaking one infection for another infection, then this can introduce bias and heterogeneity into the study. These self-report genome-wide association studies have contributed to providing insight into host susceptibility to infection and the interplay between immune response and host-pathogen interactions (14), but it is important to triangulate their results against contrasting designs such as objective serological measurements in order to determine reliability.
In the current study, we expand on previous GWAS of serological infection measures by investigating the genetic architecture of antibodies against 14 infections in children using the Avon Longitudinal Study of Parents and Children (ALSPAC) cohort by: (1) Investigating the options for analysing antibody titers to model antibody levels in terms of whether using measures as a continuous variable or thresholding to define infection status is most appropriate; (2) Identifying genome-wide single nucleotide polymorphisms (SNPs) strongly associated with the antibodies; (3) Identifying HLA alleles strongly associated with the antibodies; (4) Assessing consistency of genetic signals associated with the antibodies at different time points in ALSPAC and in an independent cohort (UK Biobank). (5) Evaluating whether any of the genetic signals identified overlap with SARS-CoV-2 infection (COVID-19 Host Genetics Initiative). This study extends the scope of understanding infection susceptibility and host genetics by increasing the range of infections examined and investigates infections in children.

Study Sample
Pregnant women resident in Avon, UK with expected dates of delivery 1st April 1991 to 31st December 1992 were invited to take part in the ALSPAC study (15,16). The initial number of pregnancies enrolled is 14,541 (for these at least one questionnaire has been returned or a "Children in Focus" clinic had been attended by 19/07/99). Of these initial pregnancies, there was a total of 14,676 foetuses, resulting in 14,062 live births and 13,988 children who were alive at one year of age.
When the oldest children were approximately seven years of age, an attempt was made to bolster the initial sample with eligible cases who had failed to join the study originally. As a result, when considering variables collected from the age of seven onwards (and potentially abstracted from obstetric notes) there are data available for more than the 14,541 pregnancies mentioned above. The number of new pregnancies not in the initial sample (known as Phase I enrolment) that are currently represented on the built files and reflecting enrolment status at the age of 24 is 913 (456, 262 and 195 recruited during Phases II, III and IV respectively), resulting in an additional 913 children being enrolled. The phases of enrolment are described in more detail in the cohort profile paper and its update (15,16). The total sample size for analyses using any data collected after the age of seven is therefore 15,454 pregnancies, resulting in 15,589 foetuses. Of these 14,901 were alive at one year of age.
A 10% sample of the ALSPAC cohort, known as the Children in Focus (CiF) group, attended clinics at the University of Bristol at various time intervals between 4 to 61 months of age. The CiF group were chosen at random from the last 6 months of ALSPAC births (1432 families attended at least one clinic). Excluded were those mothers who had moved out of the area or were lost to follow-up, and those partaking in another study of infant development in Avon.
The study website contains details of all the data that is available through a fully searchable data dictionary and variable search tool (http://www.bristol.ac.uk/alspac/researchers/ourdata/). Ethical approval for the study was obtained from the ALSPAC Ethics and Law Committee and the Local Research Ethics Committees. Consent for biological samples has been collected in accordance with the Human Tissue Act (2004).

Phenotype Measurement
ALSPAC children were invited to participate in clinical assessments at around seven years of age ("Focus @ 7") between September 1998 and October 2000 (17). Attendees at this clinic were invited to give blood samples. Additional blood samples were also available from the "Children in Focus" (CiF) group, a 10% randomly selected subset of ALSPAC children at around five years of age between September 1998 and October 2000, as well as "Focus @ 11+" at 11 years collected between January 2003 and January 2005, and "TeenFocus 3" at 15 years collected between October 2006 and November 2008 (17). Antibodies against 14 infections were assessed at these four clinical time points.
Whole blood samples were collected and processed by centrifugation at 3500rpm, for 10 minutes at 4-5°C (17). Subsequently, the plasma fraction from the whole blood was aliquoted out and temporarily stored at -20°C before being stored long term at -70/80°C. In preparation for analysis, EDTA plasma samples were plated out into 96-well plates. Enzyme-linked immunosorbent assay (ELISA) was performed using a specific antigen for each infection of interest to measure IgG antibody levels or IgA antibody levels, in the case for Sacchraromyces cerevisiae.
All target antigens and source of target antigens for each infection of interest have been detailed elsewhere (17). Serum antibodies were measured using solid-phase enzyme immunoassay protocols derived from Dickerson et al., 2013 (18). Briefly, assays were performed using microtiter plates coated with antigens immobilized onto a solid-phase surface reacted sequentially with diluted aliquots of participant and standard control serum samples, enzyme-labelled anti-human IgG and lastly enzyme substate, with plate washing separating each step. Positive and negative controls were run on each assay plate and the source of antigen varied by assay. The standard control serum samples were defined by low but detectable levels of antibodies to the antigen of interest. The amount of colour that was generated by the reaction between the soluble substrate and the antigen-bound enzyme was quantitated using a microplate colorimeter at a 450nm wavelength. Assays for the same antigen were standardized by using different microtiter plates and generating standard curves from standard samples run on each assay plate. Antibody level was expressed as a ratio of optical density of the tested sample to that of the standard sample.
For each antibody, measurements of optical density were read directly from the ELISA plate. The ratio to standards were then derived from the standards measured on each plate. These ratios were then standardised to produce z-scores with a mean of two and a standard deviation of one per plate (Ratio to standard minus the mean ratio to standard then divided by the standard deviation per plate, plus two) (17). In addition, the data were further transformed using the rank-based inverse normal approach to generate normal distributions.
Not all participants were measured for all 14 antibodies at every clinical time point. The antibody levels for each antigen at the sevenyear time point were analysed in the primary analysis as they represented the largest sample size, and the distribution of z-scores for each infection did not largely differ across time points.
Genotyping and Imputation ALSPAC children were genotyped by 23andMe using the Illumina HumanHap550 quad chip genome-wide SNP genotyping (Illumina, Inc., San Diego, CA) subcontracting from the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. Genotypes were called with Illumina GenomeStudio and PLINK (v1.07) (19) was used to perform quality control measures. These measures were performed on an initial set of 9,912 individuals, including children who participated at the Focus @ 7 clinic, and 609,203 directly genotyped SNPs. Individuals were removed from further analysis if they had extreme autosomal heterozygosity, >3% missingness, undetermined X chromosome heterozygosity, and insufficient sample replication [<0.8 identityby-descent (IBD)]. In addition, population stratification was assessed using multidimensional scaling of genome-wide identity-by-state pairwise distances using HapMap v2 (release 22) European (CEU), Han Chinese (CHB), Japanese (JPT) and Yoruba (YRI) populations as references, with individuals with non-European ancestry excluded. SNPs were removed if they had a minor allele frequency less than 1%, a call rate of <95%, or displayed a Hardy-Weinberg equilibrium P value of less than 5×10 -7 . Cryptic relatedness was assessed as the proportion of identity by descent (IBD >0.1) described previously (20).
After quality control steps, a total of 9,115 individuals and 500,527 SNPs passed these filters. ALSPAC children were phased using ShapeIt v2 (21) to phase the HRC panel (39,235,157 SNPs). Genotype imputation was performed with Michigan Imputation Server (22) using the Haplotype Reference Consortium (HRCr1.1) panel of approximately 31,00 phased whole genomes.
Two software packages, HLA*IMP:03 (23) and SNP2HLA (24), were used to impute classical human leukocyte antigens for within the major histocompatibility complex (MHC) located on chromosome six with a range of approximately 4 Mb. A comparison of the imputations was performed by comparing the allele dosages obtained for classical two-and four-digit resolution HLA alleles (for HLA-A, HLA-B, HLA-C, HLA-DPB1, HLA-DRB1) and subsequent association analyses using antibodies measured in ALSPAC children.
For HLA*IMP:03 (v.0.1.0) (23), HLA imputation was performed using its online imputation service. Prior to upload into the automated online imputation service, ALSPAC genotype data was converted to phased haplotype files in five batches due to memory size limitations. Outputs from the online service were converted into a dosage format, similar to the SNP2HLA dosage output, for association testing by calculating the expected dosage of each allele. This was performed on each individual by summing the two posterior probabilities calculated for a given classical allele.
Concordance of HLA imputation using the two packages were compared by assessing the difference in dosages by providing a call threshold, T, to determine concordance (-0.5 ≤ T ≤ 0.5) or discordance (-0.5 ≥ T ≥ 0.5) between the two approaches. The proportion of concordant and discordant HLA alleles at each given HLA loci were then tabulated. In addition, allele frequencies for each HLA loci were calculated for each HLA imputation approach, and values were compared to the T1DGC reference panel.

Statistical Analysis
Comparison of Treating Antibody Level Data as Continuous or Thresholded Variables It would be desirable to be able to stratify individuals into seropositive and seronegative for each antibody analysed. However, determining where to call the threshold for continuous antibody data is complex due to the unavailability of reliable assays on reliably determined controls to establish uninfected population baseline distributions ( Figure 1). For example, cases could be defined as: (1) Infected, if it is assumed that everyone who is exposed produces an antibody response; (2) High responders, which includes only a subset of people exposed, and controls are a mixture of low responders and unexposed; (3) A mixture of exposed and unexposed individuals if background levels of unexposed vary substantially. We sought to investigate this challenge by examining whether we could estimate thresholds that best represented seropositivity using several approaches, (1) Ratio to antigen/plate standards ≥ 1, as suggested by Barnes et al., 2015 (21); (2) A subjective threshold based on the selecting the trough between two obvious peaks where observed for the ratio to standard distributions; and two methods that used the results of genetic association analyses to determine the best threshold: (3) The threshold that exhibited the most signal for each antibody (as assessed by GWAS Q-Q plots of seven arbitrary thresholds: 5% cases 95% control; 10% cases 90% control; 25% cases 75% control; 50% cases 50% control; 75% cases 25% control; 90% cases 10% control; 95% cases 5% control), and (4) The threshold that best recapitulated GWAS results (at a suggestive p-value of 1x10 -6 ) from an analysis where the antibody measures were treated as a continuous variable. (NB -This method was selected in place of performing genetic correlation analyses as genetic correlations could not distinguish between thresholds, and many antibodies lacked the sample size requirements to perform the analyses).

Association Testing
Genome-wide association analyses were performed using PLINK (v2.0) (www.cog-genomics.org/plink/2.0/) (25) for each of the 14 ALSPAC antibodies using rank-based inverse normal transformed continuous z-score measures at age seven. In addition, logistic regression was performed applying the various thresholds previously described, for the 14 ALSPAC antibodies at age seven. In addition, linear regression was implemented using PLINK (v2.0) (www.cog-genomics.org/ plink/2.0/) (25) on all four ALSPAC clinical time points for all measured antibodies, and these continuous z-score measures were transformed using rank-based inverse normal transformation to make them comparable. Furthermore, stratified association analyses was performed based on measles, mumps and rubella vaccination status using antibodies against measles virus of children at the seven-year clinic. All analyses were carried out using unrelated individuals (IBD < 0.1, corresponds to a relatedness at a first-cousin level) that had available phenotypic and genetic data. Analyses were adjusted for age, sex and the first 10 principal components, and assumed an additive model for genetic effects. SNPs were removed from analyses if they had a minor allele frequency less than 1% and imputation quality (INFO) score of less than 0.8. Genome-wide results were clumped to identify independent loci (using an r2 threshold of 0.1 and a distance of 250 kB in PLINK's clumping procedure). Genome-wide significance was considered to be a threshold of P < 5 × 10 -8 , and we also considered a suggestive threshold of P < 1 × 10 -6 .
Association testing of the HLA alleles were performed using the dosage outputs from both imputation methods, HLA*IMP:03 (23) and SNP2HLA (24), for each of the 14 antibodies measured in ALSPAC also using the rank-based inverse normal continuous z-score variables from age 7. To aid comparison of methods, classical HLA alleles were only included with the binary markers indicating the presence of the HLA allele or absence of the HLA allele. Probabilistic genotypes were used for each marker to account for uncertainty in imputation. Linear regression Blue represents the distribution of uninfected individuals; Red represents the distribution of infected individuals. The green, orange, and purple dashed lines represent cut-off points. This figure depicts the theoretical overlap in antibody distribution of uninfected individuals (represented in blue) and infected individuals (represented in red), and the coloured dashed lines denotes how cut-offs may be selected. For example, the green threshold would include all infected individuals as cases, but also include a proportion of uninfected individuals as cases. Alternatively, the purple threshold would result in no uninfected individuals included as cases, but miss some that are infected. The orange threshold may be ideal as it would minimise the proportion of individuals misclassified as either cases or controls. However, the true underlying two distributions are not separately observed and so can only ever be estimated from the overall distribution.
analyses was performed using PLINK (v1.90) (www.coggenomics.org/plink/1.9/) (25) and adjusted for age, sex and the first 10 principal components. To calculate the HLA-wide multiple-testing correction threshold, the Type 1 Diabetes Genetics Consortium (N = 5,225) genetic data was used to produce a correlation matrix of all HLA alleles and MatSpDlite (http://neurogenetics.qimrberghofer.edu.au/matSpDlite/) (26) was used to determine the effective number of HLA alleles. Furthermore, Equation five from Li and Ji et al., 2005 (27) was employed to determine the HLA-wide correction threshold required to keep the Type 1 error rate at 5%.

UK Biobank
Comprehensive description of UK Biobank's cohort profile, serological measures and analyses performed are available in Supplementary Material. In brief, antibodies against five infections were available for replication: Cytomegalovirus, Epstein-Barr virus, herpes simplex virus 1, Helicobacter pylori, and Toxoplasma gondii. Analyses were performed using the initial assessment time point for all antigens (N = 9,430). Antibody measurements were rank-based inverse normal transformed for all GWAS and HLA association analyses and beta estimates presented are on this scale. To give an idea of the magnitude of these effects, the main results are also presented using the z-scores of the ratio-to-standard measures, where beta estimates are a change in standard deviation per allele.

COVID-19 Host Genetics Initiative
To investigate genetic overlap, ALSPAC antibody association results and the SARS-CoV-2 GWAS meta-analysis findings (C2_ALL_eur) (N = 1,683,784) (8) were compared by performing a look-up of genome-wide and suggestive SNPs from ALSPAC, and genome-wide SNPs from the SARS-CoV-2 GWAS meta-analyses. As the association analyses are presented on different scales, only direction of association and P-values were compared.

Descriptive Characteristics
Distribution of the ratio to standards of the 14 antibodies are shown in Figure 2. Six of the 14 infections had a sample size >1000 individuals (Cytomegalovirus, Epstein-Barr virus, feline herpes virus, Helicobacter pylori, Toxoplasma gondii, SAG1 protein domain), and eight had <700 individuals (Alpha-casein protein, Beta-casein protein, herpes simplex virus 1, influenza virus subtype H1N1, influenza virus subtype H3N2, measles virus, Saccharomyces cerevisiae) ( Table 1). Differences in sample size were because not all individuals were measured for all 14 antibodies. Two of the 14 antibodies showed bimodal z-score distributions (Cytomegalovirus and Epstein-Barr virus, Figure 2), and all antibody distributions showed a strong positive skew. Participants were on average 90 months old (Range: 82 -113 months) when measures were taken, and approximately half (47-52% dependent on infection) were female (17).

Antibody Thresholds
We explored the utility of analysing the 14 antibodies as continuous or thresholded variables in ALSPAC children. Figure 2 and Table 1 shows summary results of each of the tested thresholding approaches (comprehensive results shown in Supplementary Table 1 and Supplementary Figure 1).
Four threshold approaches (as described in the methods) were used: (1) Ratio to standards greater than or equal to 1; (2) A subjective threshold based visually identifying the trough in distributions that were clearly bimodal; (3) The threshold which demonstrated the most signal in a GWAS analysis; (4) The threshold that best recapitulated continuous GWAS results. These were also compared to proportions of seroprevalence measures reported in the literature ( Table 1).
Thresholds suggested by the four methods were inconsistent for all antibody measures and the proportions of seropositive individuals according to the ratio to standards thresholds were not comparable to published seroprevalences in the United Kingdom for each infection (Table 1), with the exception of cytomegalovirus and Epstein-Barr virus. Only two measured antibodies had obvious bimodal distributions (Cytomegalovirus and Epstein-Barr virus, Table 1 and Figure 2), and the suggested antibody thresholds overlap with the published United Kingdom seroprevalences (17% versus 15-25% for cytomegalovirus, and 36% versus 35-54% for Epstein-Barr virus) ( Table 1). Seven of the 14 antibodies had most GWAS signals with the rank-based inverse normal transformed continuous measures, as opposed to any of the thresholds ( Table 1 and Supplementary Figure 1). Where a thresholded GWAS gave more signal, the suggested thresholds were again inconsistent with the other approaches and published seroprevalences, with the exception of cytomegalovirus that demonstrated a slight overlap between thresholded GWAS with the most signal and published seroprevalence (25-95% versus 15-25%) ( Table 1 and Supplementary Figure 1).
We compared the top SNP associations (P < 1.0×10 -6 ) from the continuous GWASs across the thresholded GWASs for each specific antibody, to see if any thresholds gave similar association results on an individual locus basis. The thresholds which gave the most consistent results (to continuous) are shown in Table 1. Again, the suggested thresholds were rarely consistent with any of the other thresholding approaches. The inconsistencies observed between the various methods for selecting a threshold, and the fact that most had more genetic signal when a continuous measure was used, suggest that selecting a threshold to define seropositivity might not be the best way to analyse the data. Therefore, in subsequent analyses we use antibody measures as continuous variables.

Genome-Wide Association Analyses of Antigen-Specific Antibodies
We performed genome-wide association analyses of all continuous antibody measures at the seven-year clinic in ALSPAC and identified a total of three SNPs with strong associations (P < 5 × 10 -8 ) to two antibodies (Measles virus or T. gondii) and 26 suggestive SNPs (P < 1 × 10 -6 ) associated  Tables 2, 3, respectively, with Manhattan plots and Q-Q plots illustrated in Supplementary Figure 1 and Figure 2. Furthermore, one association, between rs36020612 and the T. gondii antibody, was retained after adjusting for the 14 antibodies tested (P < 3.6×10 -9 ). The strongest association for measles virus was rs506576 (EA = T, P = 1.0×10 -8 , b = 0.49, 95% CI = 0.33, 0.66), an intergenic SNP located between LOC105370271 and SCEL on chromosome 13q22.3. Followed by rs28617484 (EA = C, P = 3.5×10 -8 , b = -0.51, 95% CI = -0.69, -0.33), also an intergenic SNP, between LINC01920 and LINC02612 on chromosome 2q23.3. In addition, stratified analysis by measles, mumps and rubella (MMR) vaccination status was performed on the genome-wide and suggestive SNPs. Findings in Table 4 demonstrated that children that had been vaccinated did not contribute largely to the genetic signals identified (P > 0.001), with environmental exposure of the measles virus showing greater genetic signal (P > 2.6×10 -6 ).
We then assessed the consistency of genetic signals from the seven-year clinic at the five-year clinic, 11-year clinic, and the 15year clinic time points. In summary, six SNPs showed good evidence for association at other timepoints (P < 4.3 ×10 -4 , adjusted for the three additional timepoints and the 29 associations tested), with all estimates showing consistency in the direction of effects ( Table 5, with comprehensive results of all antibodies shown in Supplementary Table 2).
The strongest associations at different time points were demonstrated between rs36020612 (GRAMD1B) and antibodies against T.gondii, with effect sizes similar to that observed at the 7-year clinic (b = 0.20) also seen at the 11-year clinic (b = 0.20, P = 1.6 × 10 -9 ) and the 15-year clinic (b = 0.21, P = 3.4 × 10 -8 ), but this association was not seen at the earlier five-year clinic (P = 0.45).
Furthermore, rs35030589 was shown to be associated with antibodies against H. pylori at the seven-year clinic (b = -0.17) and the eleven-year clinic (b = -0.15, P = 1.9×10 -4 ), with comparable effect sizes shown at the two time points. These associations however were not able to be replicated at the fiveyear and 15-year clinic.
For antibodies against herpes simplex virus, rs10961934 showed the strongest associations at different timepoints. The effect size at the seven-year clinic (b = 0.53) were similarly demonstrated at the 11-year clinic (b = 0.48, P = 5.6×10 -5 ) and the 15-year clinic (b = 0.53, P = 1.4×10 -4 ), although this association was not observed at the five-year clinic (P < 0.07).
Lastly, replication of rs1634083 associated with antibodies against measles virus was shown at the 11-year clinic (b = 0.29, P = 8.9 × 10 -5 ), and the effect size was similar to the seven-year clinic (b = 0.33). Neither of the SNPs that associated strongly with measles showed good evidence for association at other timepoints, with the effect appearing to attenuate with age beyond seven-years.
Furthermore, we attempted to replicate the associations identified in the ALSPAC seven-year clinic in an independent  (Table 6). One association, between rs186721582 and antibodies against EBV showed strong evidence of replication (EA = G, P = 7.3×10 -10 , b = -0.09, 95% CI = 0.06, 0.12), an intronic variant located in genes SUGCT and LOC105375245. However, the effect size was in the opposite direction.
We also conducted the analysis in the opposite direction, with a GWAS in UK Biobank as a discovery cohort and ALSPAC sevenyear clinic as replication. Five antibodies in UK Biobank overlapped with ALSPAC (Cytomegalovirus, Epstein-Barr virus, Helicobacter pylori, herpes simplex virus 1, and Toxoplasma gondii), and findings are shown in Table 7. In total, 12 associations met the genome-wide threshold (P < 5×10 -8 ) with Epstein-Barr virus (10 associations),   herpes simplex virus 1 (one association) and T. gondii (one association) in UK Biobank ( Table 7). In ALSPAC, five of the associations with antibodies against Epstein-Barr virus replicated (P < 0.001): rs9264759, rs1043620, rs2523502, rs3117139 and rs3096695, with effects in the same direction compared to UK Biobank, with the exception of rs1043620 in which the effect was in the opposite direction ( Table 7). The remaining two EBV antibody associations did not replicate (P > 0.485). In addition, three EBV antibody associations were not able to be replicated as no suitable LD proxy SNP was identified (R 2 < 0.6) in ALSPAC, and similarly none of the associations with antibodies against herpes simplex virus 1 or T. gondii were able to be replicated.

Evaluation of Genetic Overlap of ALSPAC Antibodies With Published SARS-CoV-2 GWAS Meta-Analyses
We performed a lookup of genetic signals identified to be genomewide and suggestively associated with the ALSPAC seven-year clinic antibodies ( Tables 2, 3 Table 4) and only 11 SNPs exhibited the same direction of effect.  We also evaluated the seven SNPs identified as strongly associated with SARS-CoV-2 antibodies (8) in the ALSPAC seven-year clinic antibody GWASs. However, there was little evidence for associations between these seven SNPs and any of the 14 antibodies tested in ALSPAC (only three of 98 tested had P < 0.05, in line with chance expectations (Supplementary Table 5).

Association Between HLA Alleles and Antigen Specific Antibodies
Imputation of the HLA region was conducted in ALSPAC using two methods (HLA : IMP*03 and SNP2HLA). We found very high concordance between posterior probabilities of imputed genotypes between the two methods at two-and four-digit resolutions (Supplementary Tables 6, 7). In addition, very high concordance was shown between allele frequencies for HLA loci measured using the two approaches and compared to the Type 1 Diabetes Genetics Consortium (T1DGC) HLA reference panel (Supplementary Tables 8, 9).
We performed HLA association analyses for 14 antibodies at two-and four-digit resolutions using both HLA imputation methods in ALSPAC. The associations that exceeded the HLAwide corrected -value thresholds at a 2-digit resolution (P < 6.2 × 10 -4 ) and at a 4-digit resolution (P < 5.0 × 10 -4 ) are shown in Table 8 (Full results and Q-Q plots illustrated in Supplementary  Tables 10, 11 and Supplementary Figures 3, 4). Associations for five antibodies with 15 HLA alleles were identified at a two-digit resolution ( Table 8), and four antibodies with 23 HLA alleles at a four-digit resolution ( Table 8). Findings using the two HLA imputation methods, HLA : IMP*03 and SNP2HLA, produced similar results (Supplementary Tables 10, 11).
We then attempted replication of the ALSPAC HLA associations in UK Biobank (Supplementary Methods) for EBV and H. pylori antibodies (other infections were not available). All eight Epstein Barr virus antibodies associations showed strong replication signals with the same direction of effect, with P <1×10 -9 . However, the four associations with H. pylori did not replicate ( Table 8).
We also performed HLA-wide association testing in UK Biobank with replication in ALSPAC for five overlapping antibodies: Cytomegalovirus, Epstein-Barr virus, H. pylori, herpes simplex virus 1, and SAG1 protein domain. Full UK Biobank results are shown in Supplementary Tables 12, 13 and Q-Q plots are illustrated in Supplementary Figure 5. In summary, 22 HLA associations were identified with four antibodies at a 2-digit resolution (P < 7.1×10 -5 ), and 17 HLA associations were identified with five antibodies at a four-digit resolution (P < 2.8×10 -5 ) in UK Biobank (Supplementary  Table 14). Four HLA alleles at a 2-digit resolution (P < 0.003) and five HLA alleles at a four-digit resolution (P < 0.003) associated with Epstein-Barr virus antibodies replicated in ALSPAC, with all effect estimates consistently showing the same direction of effects (Supplementary Table 14).

DISCUSSION
In this study, we investigated host genetic contributions to measured antibodies in the ALSPAC cohort. In determining how to analyse antibodies, we found little evidence to support thresholding in order to categorise individuals into seropositive and seronegative. Our analysis suggested that analysing the antibody data as continuous variables is more appropriate.
point. Associations linking these loci with their respective antibodies have not been previously reported. Assessing the 26 associations that met a suggestive threshold (P < 1 × 10 -6 ), the association between rs186721582 and Epstein-Barr virus antibodies replicated in UK Biobank. The associations between rs36020612 and T. gondii did not replicate (P > 5.9 × 10 -4 ) and antibody data on the measles virus was not available. Stratifying individuals by MMR vaccine status for SNPs associated with antibodies against measles virus at the seven-year clinic showed stronger signal for individuals with an unknown status (Table 4), with comparable distributions between status. This suggests that antibody response to measles virus is more likely through environmental exposure or could be attributable to issues when using ELISAs such as high background and weak signal intensity leading to stochastic antibody levels, or potentially due to the larger sample size in the 'unknown' group. Replication of the SNPs at several clinical time points in ALSPAC (i.e. five, seven-, 11-, and 15years) showed no associations observed at age five, but several associations persisted beyond age seven (FHV, H. pylori, HSV1, measles virus, T. gondii). For FHV antibodies, H. pylori and measles virus, the observed associations disappeared at the 15year time point, which could be attributable to the reduction in sample size, lack of phenotypic variance, and possible false-positive associations identified at the seven-year time point.
Of particular interest in the main ALSPAC GWAS, was the association between rs36020612 located on GRAMD1B with antibodies against T. gondii ( Table 2). Expression of this gene has been related to lymphocyte, erythrocyte and antibody traits which could suggest a link to immune response. Specifically, chronic lymphocytic leukaemia (36,(38)(39)(40), lymphocyte TABLE 8 | HLA alleles at a 2-digit resolution (P < 6.2×10 -4 ) and 4-digit resolution (P < 5.0×10 -4 ) associated with antibodies measured at the ALSPAC seven-year clinic and replication in UK Biobank (P < 0.004). percentage and count (41), mean corpuscular haemoglobin concentration (42,43), erythrocyte distribution width (42,44), blood protein levels (45), and immunoglobulin M (IgM) antibody levels (35,46). However, while this association persisted at later time points in ALSPAC, it did not replicate in the UK Biobank. In addition, rs36020612 was not shown to be associated with antibodies for SAG1 protein domain (P = 0.78), an antigen of T. gondii¸with the effect size in the opposite direction. This lack of association could be due to the difference in statistical power contributing to the sample size of SAG1 protein domain (N = 1228) compared to T. gondii (N = 5010). We also specifically investigated associations between the major histocompatibility complex (MHC) class II region in ALSPAC and identified 20 associations of interest with antibodies against beta-casein protein, Epstein-Barr virus, feline herpes virus, H. pylori, and S. cerevisiae ( Table 8).
Associations with antibodies against Epstein-Barr virus again replicated in UK Biobank.
The strongest association we observed in the HLA region in ALSPAC and UK Biobank was between HLA-DRB1*15:01 and Epstein-Barr virus antibodies, which has been previously reported (47) and is a well-established genetic risk factor associated with increased risk for MS (48)(49)(50). HLA-DRB1:15*01 positive humanized mice with EBV infection were similarly shown to demonstrate dysregulation in immune response through attenuated CD4 + and CD8 + T cell activation and decreased efficiency to control EBV viral loads (51). Of the other five 4-digit HLA alleles that showed strong association with EBV, HLA-DRB5*01:01 and HLA : DQB1*06:02 are in strong linkage disequilibrium with the previously discussed HLA : DRB1*15:01 (52). HLA-B*07:02 has also been investigated for its association with EBV and the latent-specific CD8 + T cell response as well as it link to MS (53,54). However, no current literature has researched HLA-C*07:02 and HLA-DRB5*99:01 in respect to their relationship with EBV.
Four HLA alleles at a four-digit resolution were associated with H. pylori antibodies, with HLA-DQB1*05:01 showing the strongest association followed by HLA-DQA1*01:01, HLA-DRB1*01:01, and lastly HLA-DQB1*03:01. All four HLA alleles were unable to be replicated in UK Biobank (P < 5.0 × 10 -4 ) which could possibly be a result of false-positive associations in ALSPAC at the seven-year clinic, and lack of phenotypic variance. To date, no studies have investigated the association of these HLA alleles and H. pylori in individuals of European ancestry. Studies using other ancestral populations have shown that HLA-DQB1*05:01 strongly increased the risk of gastric cancer in individuals of Mexican descent (55), and similarly in a Chinese population HLA-DRB1*01 was demonstrated to strongly increase in the risk of gastric cancer in H. pylori-positive individuals (56). This type of cancer is commonly associated with H. pylori infection, which is the is the strongest known risk factor for the development of this disease (57,58). At present, no literature has not shown a relationship between both HLA-DQA1*01:01 and HLA-DQB1*03:01 with H. pylori.
There are several limitations to this study. Firstly, the antibody levels used for each antigen of interest could be limited by their indication of infection as varying levels of antibody response could be indicative of actual antibody response to the antigen of interest, cross-reactivity of unrelated antibodies, naturally high antibody levels, low antibody levels due to poor antibody response to infection, or low antibody levels as a result of no infection. In particular, in the case of FHV and Theiler's virus, the issue of cross-reactivity of unrelated antibodies is plausible as these viruses are not known to be zoonotic. Secondly, we were unable to identify all published seroprevalences of our infections of interest in Table 1 that were comparable to children of approximately seven-years old in the United Kingdom around 1998-2000, in order for similar comparisons to the time period and geographical location that blood samples were obtained from the ALSPAC children at seven-years of age. Only two seroprevalences, CMV and EBV, were comparable, while published seroprevalences were not identified for six infections, and the remaining infections were identified from adult populations or national records. Thirdly, the potential limited exposure to the infections of interest in the ALSPAC sample of children as a result of their young age may have reduced power to detect associations, as viruses such as cytomegalovirus, Epstein-Barr virus and herpes simplex virus 1 show an increase in prevalence into adulthood (28,(59)(60)(61). We attempted to address this issue by assessing genetic signals at later clinical time points, and replicate the findings in the adult cohort, UK Biobank. Fourthly, the poor evidence in some instances of genetic signal replication at different timepoints in the ALSPAC data may be owing to the small sample sizes of some of the measured antibodies. Eight antigens had a sample size of less than 700 individuals at the seven-year clinic which is the time point with the largest sample size for all measured antibodies. As a result, identified top hits associated with antibodies against these infections may potentially be false positive findings. An approach to maximise sample size could have been to keep all individuals with antibody levels measured against infections at all clinical time points and remove any duplicate individuals retaining only individuals with antibody levels measured at the latest time point. This approach could increase potential cases as older individuals may be more likely, by virtue of time, to become infected through environmental exposure, however this approach could potentially introduce heterogeneity. In addition, not all infections of interest had measured antibodies in UK Biobank to test for replication, and furthermore there were differences in assays used to measure antibody response between cohorts. Lastly, the interpretation of the genetic signals is difficult to disentangle using association studies as these signals could be related to susceptibility to infection and persistence of the antibodies. For example, studies have shown a reduction in antibody levels to influenza virus subtype H1N1 a year to 18 months after initial vaccination (62)(63)(64)(65). In contrast, the two-dose vaccination programme of the measles virus has been demonstrated to produce a sustained protective antibody persistence (66)(67)(68). In addition, if a genetic effect is shown to be associated with higher antibody levels, is the interpretation that it influences the chance of being infected, or that it influences the antibody response to infection, or potentially that it influences a long-lasting antibody response. To disentangle this challenge better approaches to measuring antibody responses against infection are required, such as the development of reliable uninfected baseline population distributions. Furthermore, analyses from in vivo and candidate gene studies to provide stronger evidence in support of these findings, and to examine the possible functional roles of the genetic variants and genes to elucidate the underlying biological mechanisms, is required.
In summary, our study has confirmed known HLA allele associations with Epstein-Barr virus in a younger age group (seven-years) and replicated results in the adult population, UK Biobank. In addition, in the discovery phase we have identified four potentially novel genetic associations, with rs36020612 showing strong evidence of association with T. gondii in children at the seven-year clinic, and at later clinical timepoints. The location of this SNP on GRAMD1B is of particular interest as expression of this gene has been shown to be related to lymphocyte traits. We also found strong evidence of two SNPs (rs506576 and rs28617484) associated with antibodies against measles virus, however we were unable to test for replication in UK Biobank due to data unavailability. Furthermore, we observed strong evidence of replication of the suggestive SNP (rs186721582) associated with Epstein-Barr virus antibodies in UK Biobank, suggesting the genetic effect is stronger in adults. Our study provides a useful resource for future studies looking to use ALSPAC antibody measurements as well as HLA alleles. It indicates that for future use of the measured antibodies in ALSPAC the most appropriate approach is to use the antibody level measures as a continuous variable. This study has highlighted the potential for identification of host genetic risk factors for several common infections, and demonstrates that if similar data is collected in other cohorts, future meta-analysis GWAS is likely to be a fruitful endeavour in uncovering the genetic and biological mechanisms of infection susceptibility.

DATA AVAILABILITY STATEMENT
The ALSPAC study website contains details of all the data that is available through a fully searchable data dictionary and variable search tool (http://www.bristol.ac.uk/alspac/researchers/ourdata/). Access to ALSPAC research data must be requested using the formal procedures and is subject to eligibility, the ALSPAC funder's terms and conditions and University of Bristol policies and procedures. Requests to access these datasets should be directed to alspac-exec@bristol.ac.uk.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by ALSPAC Ethics and Law Committee and the Local Research Ethics Committees. Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.