Quantitative Human Paleogenetics: What can Ancient DNA Tell us About Complex Trait Evolution?

Genetic association data from national biobanks and large-scale association studies have provided new prospects for understanding the genetic evolution of complex traits and diseases in humans. In turn, genomes from ancient human archaeological remains are now easier than ever to obtain, and provide a direct window into changes in frequencies of trait-associated alleles in the past. This has generated a new wave of studies aiming to analyse the genetic component of traits in historic and prehistoric times using ancient DNA, and to determine whether any such traits were subject to natural selection. In humans, however, issues about the portability and robustness of complex trait inference across different populations are particularly concerning when predictions are extended to individuals that died thousands of years ago, and for which little, if any, phenotypic validation is possible. In this review, we discuss the advantages of incorporating ancient genomes into studies of trait-associated variants, the need for models that can better accommodate ancient genomes into quantitative genetic frameworks, and the existing limits to inferences about complex trait evolution, particularly with respect to past populations.


INTRODUCTION
The last decade has seen dramatic advances in our understanding of the genetic architecture of polygenic traits (Visscher et al., 2017). The advent of genome-wide association studies (GWAS), with large sample sizes and deep phenotyping of individuals, has led to the identification of thousands of loci associated with complex traits and diseases (MacArthur et al., 2017;Bycroft et al., 2018;Buniello et al., 2019). The resulting associations, and their inferred effect sizes, have enabled the development of so-called polygenic risk scores (PRS), which summarise either the additive genetic contribution of single nucleotide polymorphisms (SNPs) to a quantitative trait (e.g., height), or the increase in probability of a binary trait (e.g., major coronary heart disease) (Dudbridge, 2013). For some well-characterised medical traits, like cardiovascular disease, the predictive value of PRS has led to their adoption in clinical settings (Knowles and Ashley, 2018); however, the accuracy of PRS remains limited to populations closely related to the original GWAS cohort  and can vary within populations due to age, sex and socioeconomic status (Mostafavi et al., 2020). Ancient genomics has yielded considerable insights into natural selection on large-effect variants (Malaspinas, 2016;Dehasque et al., 2020), and an increasing number of studies are also now utilizing ancient genomes to learn about polygenic adaptation; the process by which natural selection acts on a trait with a large number of genetic loci, leading to changes in allele frequencies at many sites across the genome. Among these studies, the most commonly inferred complex traits are pigmentation and standing height.

ANCIENT DNA AND COMPLEX TRAIT GENOMICS
Skin, hair and eye pigmentation are among the least polygenic complex traits; though more than a hundred pigmentationassociated loci have been found, their heritability is largely dominated by large-effect common SNPs (Sulem et al., 2007;Eiberg et al., 2008;Han et al., 2008;Sturm et al., 2008;Hider et al., 2013;Liu et al., 2015;O'Connor et al., 2019). Additionally, several of these variants have signatures of past selective sweeps detectable in present-day genomes (Lao et al., 2007;Sabeti et al., 2007;Pickrell et al., 2009;Rocha, 2020). Nevertheless, genomic analyses in previously understudied populations-like sub-Saharan African groups-suggest that perhaps hundreds of skin pigmentation alleles of small effect remain to be found (Martin et al., 2017b). Similarly, recent studies have shown that eye pigmentation is far more polygenic than previous thought (Simcoe et al., 2021). Recent quantitative and molecular genomic studies are painting an increasingly complex picture of the architecture of these traits, featuring more considerable roles for epistasis, pleiotropy and small-effect variants than were previously assumed (for an extensive review of skin pigmentation, see Quillen et al., 2019).
Recently, ancient DNA (aDNA) studies have attempted to reconstruct pigmentation phenotypes in ancient human populations, although the extent to which these predictions are accurate remains uncertain. These reconstructions have been mostly focused on ancient individuals from Western Eurasia, due to the relatively higher abundance of SNP-phenotype associations from European-centric studies, and the poor portability of gene-trait associations to more distantly related populations (Martin et al., 2017a. For example, Olalde et al. (2014) queried pigmentation-associated SNPs in genomes of Mesolithic hunter-gatherer remains from western and central Eurasia, and suggested that the lighter skin colour characteristic of Europeans today was not widely present in the continent before the Neolithic. González-Fortes et al. (2017) analysed Mesolithic and Eneolithic genomes from central Europe, and inferred dark hair, brown eyes and dark skin pigmentation for the Mesolithic individuals and dark hair, light eyes, and lighter skin pigmentation for an Eneolithic individual. Similarly, Brace et al. (2019) inferred pigmentation phenotypes for Mesolithic and Neolithic genomes from western Europe, and reported that the so-called "Cheddar Man, " a Mesolithic individual found in England, had blue/green eyes and dark to black skin, in contrast to later Neolithic individuals with dark to intermediate skin pigmentation. Contrastingly, Günther et al. (2018) found elevated frequencies of light skin pigmentation alleles in individuals from the Scandinavian Mesolithic, suggestive of early environmental adaptation to life at higher latitudes. These reconstructions have also been carried out in individuals with no skeletal remains; for example, Jensen et al. (2019) used pigmentation-associated SNPs to infer the skin, hair and eye colour of a female individual whose DNA was preserved in a piece of birch tar "chewing gum." Some aDNA studies have sought to systematically investigate how pigmentation-associated variants were introduced and evolved in the European continent. Wilde et al. (2014) was one of the first studies to provide aDNA-based evidence that skin, hair, and eye pigmentation-associated alleles have been under strong positive selection in Europe over the past 5,000 years. The first large-scale population genomic studies (Allentoft et al., 2015;Haak et al., 2015;Mathieson et al., 2015) showed that major effect alleles associated with light eye colour likely rose in frequency in Europe before alleles associated with light skin pigmentation. More recently, Ju and Mathieson (2021) argued that the increase in light skin pigmentation in Europeans was primarily driven by strong selection at a small proportion of pigmentationassociated loci with large effect sizes. When testing for polygenic adaptation using an aggregation of all known pigmentationassociated variants, they did not detect a statistically significant signature of selection.
The other trait that has shared comparable prominence with pigmentation in the aDNA literature is standing height. In contrast to pigmentation, the genetic architecture of height is highly polygenic (Yang et al., 2015;Bycroft et al., 2018;Yengo et al., 2018). The heritability of this trait is dominated by a large number of alleles with small effect sizes, and shows strong evidence for negative selection in present-day populations . Studies of the genetic component of height in ancient populations have shown that ancient West Eurasian populations were, on average, more highly differentiated for this trait than present-day West Eurasian populations, and more so than one would predict from genetic drift alone (Mathieson et al., 2015;Martiniano et al., 2017;Cox et al., 2019). Cox et al. (2019) compared predicted genetic changes in height in ancient populations to inferred height changes estimated via skeletal remains. They concluded that the changes in inferred standing height were partially predicted by genetics; with both measures remaining relatively constant between the Mesolithic and Neolithic, and increasing between the Neolithic and Bronze Age. A follow-up study by Cox et al. (2021) used polygenic scores for height to show that PRS predicts 6.8% of the observed variance in femur length in ancient skeletons, after controlling for other variables. This is approximately one quarter of the predictive accuracy of PRS in present-day populations; which the authors attribute to the low-coverage aDNA data used in their study. Contrastingly, Marciniak et al. (2021) used the discordance between PRS for height, calculated from aDNA, and height inferred from the corresponding skeletal remains, to argue that Neolithic individuals were shorter than expected due to either poorer nutrition or increased disease burden, relative to hunter-gatherer populations.
However, the inference of standing height from skeletal remains is not without its own problems. Both Cox et al. (2021) and Marciniak et al. (2021) used the method developed by Ruff et al. (2012) to estimate stature from skeletal remains. Nevertheless, their respective estimates of stature-based on femur length-varied between some of the individuals included in both studies. Where multiple skeletal elements were available for ancient individuals, Marciniak et al. (2021) also produced separate stature estimates from femur, tibia, humerus and radius length, which varied substantially within some individuals; highlighting the uncertainty in estimates of stature from skeletal remains.

INFERRING COMPLEX TRAITS IN ARCHAIC HOMINIDS
The availability of genome sequences from archaic humans, like Neanderthals and Denisovans, has greatly expanded our understanding of their demographic history and interactions with modern humans (Meyer et al., 2012;Prüfer et al., 2014Prüfer et al., , 2017. However, little is known about complex traits in archaic humans, besides what can be inferred directly from their skeletal remains. In the case of Denisovans, such remains are presently limited to a few teeth, a mandible and other small bone fragments, making it difficult to make confident inferences of their biology (Meyer et al., 2012;Sawyer et al., 2015;Slon et al., 2017;Chen et al., 2019). However, past admixture events with archaic human groups have left a genetic legacy in present-day people, providing a possible inroad to study archaic human biology . Today, around 2% of the genomes of non-African humans are known to be descended from Neandertals, and an additional ∼5% of the genomes of people in Oceania can be traced back to Denisovans (Sankararaman et al., , 2016Vernot and Akey, 2014;Vernot et al., 2016).
Knowledge about admixture between archaic and modern humans has led to a recent flurry of exploratory studies concerning the potential impact of archaic variants on complex traits in present-day populations. Various approaches have been used to identify introgressed archaic DNA putatively under positive selection in modern humans (Khrameeva et al., 2014;Sankararaman et al., 2014Sankararaman et al., , 2016Vernot and Akey, 2014;Perry et al., 2015;Gittelman et al., 2016;Vernot et al., 2016;Racimo et al., 2017b). Overall, these studies have shown that archaic DNA is linked to pathways related to metabolism, as well as skin and hair morphology. Via association studies, Neanderthal variants in specific loci have been shown to influence several disease and immune traits, as well as skin and hair colour, behavioural traits, skull shape, pain perception and reproduction Dannemann et al., 2016;Sams et al., 2016;Gunz et al., 2019;Skov et al., 2020;Pääbo, 2020, 2021;Zeberg et al., 2020a,b).
Additionally, comparisons between the combined phenotypic effects of Neandertal variants and frequency-matched nonarchaic variants have revealed that Neanderthal DNA is overproportionally associated with neurological and behavioural phenotypes, as well as viral immune responses and type 2 diabetes (Quach et al., 2016;Simonti et al., 2016;Dannemann and Kelso, 2017;Dannemann, 2021). These groups of phenotypes may be linked to environmental factors, such as ultraviolet light exposure, pathogen prevalence and climate, that substantially differed between Africa and Eurasia. It has been suggested that the over-proportional contribution of Neandertal DNA to immunity and behavioural traits in present-day humans might be a reflection of adaptive processes in Neandertals to these environmental differences. In comparison, much less is known about the impact of Denisovan DNA on complex traits, because limited phenotypic data are presently available from presentday populations. However, individual Denisovan-like haplotypes found in high frequencies in some human populations have been associated with high altitude adaptation and fat metabolism (Huerta-Sánchez et al., 2014;Racimo et al., 2017a).
One key limitation to these approaches is that only about 40-50% of the Neandertal genome can be recovered in presentday humans, and therefore discoverable in such analyses Vernot and Akey, 2014;Skov et al., 2020). Furthermore, the majority of tested cohorts used for such studies are of European ancestry, which limits analyses to archaic variants present in these populations. This is particularly notable since Neandertal phenotype associations in European and Asian populations have been shown to contain populationspecific archaic variants (Dannemann, 2021). It has also been shown that negative selection, soon after admixture, has played an important role in removing some of the missing segments of archaic DNA (Harris and Nielsen, 2016;Juric et al., 2016;Petr et al., 2019). It is therefore possible that missing segments of archaic DNA had strong phenotypic effects. For archaic DNA that does persist in present-day populations, much of it is segregating at low allele frequencies, making it difficult to confidently link it to phenotypic effects.
Furthermore, it remains questionable how transferable any phenotypic associations are between modern and archaic humans, given the difficulties of transferring associations between present-day populations (Martin et al., 2017a;Duncan et al., 2019). All of the above studies have used gene-trait association information from analyses carried out in modern humans. It remains undetermined if the phenotypic effects of archaic DNA in present-day populations are a reliable proxy for phenotypic effects in archaic humans themselves.
Recent studies have also aimed to predict the phenotypic effects of archaic DNA without relying on introgression in present-day populations (see Figure 1). Colbran et al. (2019) used a machine learning algorithm, trained on genetic variation in present-day humans, to infer putative regulatory effects on variation present only in Neandertal genomes. Gokhman et al. (2020a,b) used aDNA damage patterns to infer a DNA methylation map of the Denisovan genome, and linked the inferred regulatory patterns to loss-of-function phenotypes, in order to predict their skeletal morphology and vocal and facial anatomy. It remains to be seen how successful these approaches are at predicting archaic human phenotypes. A possible inroad into validation could rest on functional assays for testing and evaluating the phenotypic impact of archaic DNA (Dannemann et al., 2020;Dannemann and Gallego Romero, 2021;Trujillo et al., 2021).

THE CHALLENGE OF DETECTING POLYGENIC ADAPTATION IN ANCIENT POPULATIONS
Perhaps the most fascinating question about the evolution of complex traits in humans is whether they were subject to natural selection. Current methods to detect polygenic adaptation have mainly focused on present-day populations; using either differences between populations, or variation within them, to identify polygenic adaptation. For example, Berg and Coop (2014)  . Whichever method is used, significant caveats must be addressed before attributing differences in such scores to polygenic adaptation (Novembre and Barton, 2018;Coop, 2019;Rosenberg et al., 2019). Most of these issues affect both presentday and ancient populations, but many are especially problematic when working with ancient genomes.
A prominently reported example of polygenic adaptation is that of selection for increasing height across a north-south gradient in Europe (Turchin et al., 2012;Berg and Coop, 2014;Robinson et al., 2015;Zoledziewska et al., 2015;Guo et al., 2018;Racimo et al., 2018;Berg et al., 2019b;Chen et al., 2020). Most studies which described this signal based their analyses on effect size estimates from the GIANT consortium, a GWAS meta-analysis encompassing 79 separate studies (Wood et al., 2014). Concerningly, follow-up work using the larger and more homogeneous UK Biobank cohort failed to replicate the signal of polygenic adaptation for height (Berg et al., 2019a;Sohail et al., 2019). A recent systematic comparison across a range of GWAS cohorts has further shown that the results of these tests are highly dependent on the ancestry composition of the cohort used to obtain the effect size estimates (Refoyo-Martínez et al., 2021). These analyses showed that residual stratification in GWAS meta-and mega-analyses can result in inflated effect size estimates that, in turn, can lead to spurious signals of selection. The effects of this residual stratification may be exacerbated for ancient populations with non-uniform relatedness to present-day GWAS cohorts (see Figure 2).
Residual stratification is a major concern for GWAS, even among a relatively homogeneous cohort like the UK Biobank. Zaidi and Mathieson (2020) used simulations to show that fine-scale recent demography can confound GWAS which has been corrected for stratification using common variants only. Failure to adequately correct for localised population structure can lead to spurious associations between a trait and lowfrequency variants that happen to be common in areas of atypical environmental effect. This finding is problematic as most GWAS have been conducted on either SNP array data, or on genomes imputed from SNP array data (Visscher et al., 2017). For example, GWAS summary statistics from the UK Biobank are based on imputed genomes (Bycroft et al., 2018). A limitation of this approach is that the accuracy of imputed genotypes are inversely correlated with the minor allele frequencies (MAF) of variants in the reference panel. Additionally, rare variants that are not segregating in the reference panel cannot be imputed at all. As a result, imputed genomes are specifically depleted in the rare variants needed to adjust for stratification from recent demography.
For large sample sizes, low-frequency variants (MAF ≤ 0.05) make a significant contribution to the heritability of many complex traits (Mancuso et al., 2016;Hartman et al., 2019), but FIGURE 2 | Potential effect of population stratification in GWAS. Two ancestral populations, X and Y, have contributed differing ancestry proportions to present-day individuals. Due to non-genetic environmental effects, individuals with a larger proportion of population Y ancestry have higher values for a measured trait. This may lead to biased GWAS effect size estimates, which associate population Y ancestry with increasing values of the trait. When used to make inferences about the past, this would lead to systematically inflated polygenic scores for this trait in samples from population Y. the role of rare variants is less well established. Both empirical and simulation studies have shown that for traits under either negative or stabilising selection, there is an inverse correlation between effect size and MAF (Simons et al., 2018;Schoech et al., 2019;Durvasula and Lohmueller, 2021). For the many traits thought to be under negative selection , large effect variants that are rare in present-day populations may have had higher allele frequencies in ancient populations due to selection. This makes polygenic scores for ancient individuals especially sensitive to bias from GWAS effect size estimates ascertained from common variants only. Conversely, where present-day rare variants with large-effect sizes are known, higher frequencies in ancient populations would result in more accurate PRS predictions, due to their larger contribution to the overall genetic variance.
A recent analysis indicated that a substantial component of the unidentified heritability for anthropometric traits like height and BMI lies within large effect rare variants, some with MAF as low as 0.01% (Wainschtein et al., 2019). However, using GWAS to recover variant associations for SNPs as rare as this would require hundreds of thousands of whole-genomes, substantially exceeding the largest whole-genome GWAS published to date (e.g., Taliun et al., 2021). The consequence of this missing heritability may be particularly acute for trait prediction in ancient samples, as large-effect rare variants which contributed to variability in the past may no longer be segregating in present-day populations. Indeed, simulations suggest that the genetic architecture of complex traits is highly specific to each population, and that negative selection enriches for private variants, which contribute to a substantial component of the heritability of each trait (Durvasula and Lohmueller, 2021). Empirical studies have also identified that functionally important regions, including conserved and regulatory regions, are enriched for population-specific effect sizes, and that this pattern may have been driven by directional selection .
In addition to these issues, the majority of SNP associations inferred from GWAS are likely not the causal alleles. Instead, GWAS predominantly identifies SNPs which are in high linkage disequilibrium (LD) with causal alleles. Most GWAS also assume a model in which all complex trait heritability is additive and well tagged by SNPs segregating in the cohort; although some GWAS do include non-additive models (e.g., Guindo-Martínez et al., 2021). Consequently, effect size estimates are contingent on the LD structure of the cohort in which they were ascertained. Due to recombination, this LD structure decays through time, and is reshaped by the population history in which selection processes are embedded.
Over the last decade, paleogenomic studies have repeatedly demonstrated that the evolutionary histories of human populations are characterized by recurrent episodes of divergence, expansion, migration and admixture (reviewed in Pickrell and Reich, 2014;Skoglund and Mathieson, 2018). For example, in West Eurasia, four major ancestry groups have contributed to the majority of present-day genetic variation (Jones et al., 2015). As such, the LD structure of present-day British individuals-which underpins effect size estimates from the UK Biobank-was substantially different prior to the Bronze Age, when the most recent of these major admixture episodes occurred (Allentoft et al., 2015;Haak et al., 2015). To improve ancestral trait prediction, new methods which explicitly model the haplotype structure of both ancient populations and present-day GWAS cohorts are needed.
In aggregate, these issues combine to substantially diminish the portability of polygenic scores between populations. Indeed, in present-day populations, the predictive accuracy of PRS degrades approximately linearly with increasing genetic distance from the cohort used to ascertain the GWAS (Scutari et al., 2016;Martin et al., 2017aMartin et al., , 2019Kim et al., 2018;Bitarello and Mathieson, 2020;Mostafavi et al., 2020;Majara et al., 2021). Even within a single ancestry group, the correlation between PRS calculated from different discovery GWAS shows considerable variance (Schultz et al., 2021). However, the extent to which the issue of PRS portability also affects ancient populations, which are either partially or directly ancestral to the GWAS cohort, are yet to be determined.
In cases where a robust signal of polygenic adaptation can be identified, care must still be taken when interpreting which trait was actually subject to directional selection. Due to the highly polygenic nature of most complex traits, there is a high rate of genetic correlation between phenotypes (Shi et al., 2017;Ning et al., 2020). This can occur when correlated traits share causal alleles (i.e., pleiotropy) or where casual alleles are in high LD with each other. Consequently, selection acting on one specific trait can generate a spurious signal of polygenic adaptation for multiple genetically correlated traits. Recently, Stern et al. (2021) developed a method for conditional testing of polygenic adaptation to address this problem. When considered in a joint test, previously identified signals of selection for educational attainment and hair colour in British individuals were significantly attenuated by the signal of selection for skin pigmentation (Stern et al., 2021). However, this approach can only untangle genetic correlations between traits which have been measured in GWAS cohorts, leaving open the possibility that selection is acting on an unobserved yet correlated trait. Indeed, many GWAS traits are either coarse proxy measures with substantial socio-economic confounding (e.g., educational attainment), or narrow physiological measurements (e.g., levels of potassium in urine); neither of which are likely to have been direct targets of polygenic adaptation. In practice, the truly adaptive phenotype is rarely directly observable, and all measured traits are genetically correlated proxies at various levels of abstraction.

LIMITATIONS AND CAVEATS SPECIFIC TO ANCIENT DNA
In addition to all of the general issues and caveats discussed above, working with ancient DNA also involves a range of issues that are particular to the degraded nature of the data; such as post-mortem damage, generally low average sequence coverage, short fragment lengths, reference bias, and microbial and human contamination (Gilbert et al., 2005;Dabney et al., 2013;Renaud et al., 2019;Peyrégne and Prüfer, 2020). All of these factors affect our ability to correctly infer ancient genotypes; and therefore, to construct accurate polygenic scores or infer polygenic adaptation.
A common strategy for dealing with the low endogenous fraction of aDNA libraries is to use in-solution hybridisation capture to retrieve specific loci, or a set of predetermined SNPs (Avila-Arcos et al., 2011;Cruz-Dávalos et al., 2017). This approach has substantial advantages in on-target efficiency, at the cost of ascertainment bias. For example, in the case of the popular "1240k" capture array, targeted SNPs were predominantly ascertained in present-day individuals (Fu et al., 2015;Haak et al., 2015). Consequently, an unknown fraction of the true ancestral variation is lost during capture. This is further exacerbated by the generally low coverage of most aDNA libraries; for which a common practice is to draw a read at random along each position in the genome, to infer "pseudohaploid" genotypes. When used to compute polygenic scores for ancient populations, only a subset of GWAS variants can be used, which substantially reduces predictive accuracy. Cox et al. (2021) estimate that the combined effect of low-coverage and pseudo-haploid genotypes reduced their predictive accuracy by approximately 75%, when compared to present-day data.
An alternative approach is to perform low-coverage shotgun sequencing, followed by imputation, using a large reference panel (Ausmees et al., 2019;Hui et al., 2020). This has the dual advantages of reducing ascertainment bias and increasing the number of GWAS variants available to calculate polygenic scores. However, imputation itself introduces a new source of bias, particularly if the reference panel is not representative of the ancestries found in the low-coverage samples. Nevertheless, the level of imputation bias can be empirically estimated by downsampling highcoverage aDNA libraries and testing imputed genotypes against direct observations (e.g., Margaryan et al., 2020). Where a suitable reference panel exists, recently developed methods for imputation from low-coverage sequencing data Rubinacci et al., 2021) show great promise for ancient DNA studies (e.g., Clemente et al., 2021).
Even under ideal conditions, in which exact polygenic scores for ancient populations are known a priori, interpreting differences in mean PRS between groups still requires careful consideration. For many polygenic traits, the variance between population means is lower than the variance within populations. As a result, differences in population level polygenic scores have limited predictive value for inferring the physiology or behaviour of individual people in the past. Genetics plays only a partial role in shaping phenotypic diversity, and differences in polygenic scores between individuals, or populations, does not automatically translate into differences in the expressed phenotype. Indeed, for some complex traits, an inverse correlation has been observed; in which polygenic scores have been steadily decreasing over recent decades, whilst the measured phenotype has been increasing [e.g., educational attainment (Kong et al., 2017;Abdellaoui et al., 2019)]. This highlights the substantial role of environmental variation in shaping phenotypic diversity. For ancient populations, we must also consider the wide variation in culture, diet, health, social organisation and climate which will have mediated any potential differences in population level polygenic scores. Furthermore, ancient populations are likely to have experienced a heterogeneous range of selective pressures. What we observe in present-day populations is not the result of a single directional process, but instead represents a mosaic of haplotypes which were shaped by different fitness landscapes, at varying levels of temporal depth.
Lastly, in most cases, we cannot directly observe phenotypes in the ancient individuals whose genomes have been studied. This greatly limits our ability to compare the genetically predicted value of a trait to its expressed phenotype, raising the question: are predictions of most ancient phenotypes inherently unverifiable? For well-preserved traits, like standing height, there is considerable variability in estimates produced from different skeletal elements and between different studies (Cox et al., 2021;Marciniak et al., 2021). For traits that do not preserve well in the archaeological record, the prospects of validation are much poorer. These include not only soft tissue measurements (e.g., pigmentation or haemoglobin counts), but also personality and mental health traits that require an individual to be alive to be properly measured or diagnosed. Furthermore, some phenotypes are non-sensical outside of a modern context. Whilst it is possible to build a polygenic score for "time spent watching television" (UK Biobank code: 1070), it is not clear how to interpret any potential differences one might find between Mesolithic huntergatherers and Neolithic farmers. This problem extends more generally to all phenotypes which have strong gene-environment interactions, in which the expression of the trait may have been substantively different in the past due to diverse environmental conditions (e.g., the interplay between BMI and diet).

PROSPECTS FOR THE FUTURE
The growth in the number of ancient genomes currently shows little signs of slowing, nor does the increasing availability of gene-trait association data. Predictably, efforts to perform trait predictions in ancient individuals will also continue to grow. We believe that increased emphasis on limitations and caveats in the way we study and communicate these findings will enable a better understanding of what we can and cannot predict with existing models.
As a working assumption, polygenic scores from any single GWAS should be considered unreliable in an ancient trait reconstruction analysis. Researchers should only trust observed signals of trait evolution if those patterns hold across multiple independent GWAS (e.g., Chen et al., 2020), and preferably where each of these GWAS has been performed on a large cohort with homogeneous ancestry (Refoyo-Martínez et al., 2021).
We also need to better understand how well GWAS effect size estimates, ascertained in present-day populations, generalise to ancient populations that are only partially ancestral to the GWAS cohort. One approach to this would be to use simulations, under a plausible demographic scenario, to explore how the predictive accuracy of PRS degrades through time and across the boundaries of major ancestral migrations.
Traits that are preserved in the fossil record can provide a degree of partial benchmarking (Cox et al., , 2021; however, the genetic components of variation are often only partially explained by polygenic scores, and environmental components almost always play large roles in expressed trait variation, often dwarfing the contribution of polygenic scores. Furthermore, only a few-largely osteological-traits are well preserved over time, so these comparisons will always be limited in scope. That being said, there are several promising avenues of research that could serve to improve genetic trait prediction in ancient populations. An existing approach to improve the portability of PRS across ancestries is to prioritise variants with predicted functional roles (Amariuta et al., 2020;Weissbrod et al., 2020). This approach aims to improve PRS portability in present-day populations by reducing the fraction of spurious associations due to the cohort specific LD structure of the GWAS reference panel. Another promising approach is to jointly model PRS using GWAS summary statistics from multiple populations (Márquez-Luna et al., 2017;Ruan et al., 2021;Turley et al., 2021). By including information from genetically distant groups, these methods can account for the variance in effect sizes inferred between GWAS cohorts. This multi-ancestry approach holds particular promise for ancient populations, as it may help to identify variant associations which are segregating in only a subset of present-day populations, but which were more widespread in the past.
These studies also underscore the importance of studying the ancestral haplotype backgrounds on which beneficial, deleterious or neutral alleles spread. Recent studies have shown that tests of selection on individual loci can gain power by explicitly modelling patterns of ancestry across the genome (Pierron et al., 2018;Hamid et al., 2021). Strong selective signals might be masked by post-selection admixture processes, but might become evident once the ancestry of the selected haplotypes is explicitly modelled (Souilmi et al., 2020). This phenomenon is also likely to affect polygenic adaptation studies, particularly when the degree of correlation between genetic score differences and differences in ancestral haplotype backgrounds is expected to be high, for example, after admixture between populations that have been evolving in isolation for long periods of time.
A promising avenue of research is developing around new methods for approximately inferring ancestral recombination graphs (ARG) via the construction of tree sequences (Kelleher et al., 2019;Speidel et al., 2019), which have recently been extended to incorporate non-contemporaneous sampling Wohns et al., 2021). An ARG is a model which contains a detailed description of the genealogical relationships in a set of samples, including the full history of gene trees, ancestral haplotypes and recombination events which relate the samples to each other at every site in the genome (Griffiths and Marjoram, 1997). One potential advantage of an ARG is that it may be used to help mitigate issues with the portability of polygenic scores. By building an ARG composed of both ancient samples and the present-day cohorts used to ascertain the GWAS associations, one could potentially determine which haplotypes are shared between the GWAS cohort and the ancient populations; thereby reducing effect size bias in populations that are only partially ancestral to the GWAS cohort.
Another area in which ancient genomes offer unique potential is in detecting polygenic adaptation in response to environmental change. The time-series nature of ancient genomes provides the potential for the incorporation of paleoclimate reconstructions (e.g., Brown et al., 2018) into tests of polygenic adaptation, in a manner that is not possible with present-day data alone.
Ultimately, the ancient genomics community must come to terms with the limitations of genetic hindcasting. Ancient genomes provide an unprecedented window into our past, but this window is often blurry and distorted. There is still a lot of information waiting to be obtained from ancient DNA, and some of the blurriness might ultimately come into focus as computational methods continue to improve. But we must also accept the fact that many aspects of past human biology-including physical characteristics and disease susceptibility-might be irrevocably lost to the tides of history. Ancient genome sequences are, after all, molecular fossils: imperfect and degraded records of lives that ceased to exist long ago.

AUTHOR CONTRIBUTIONS
EI-P and FR reviewed and edited the manuscript. All authors wrote the original draft of the manuscript and approved the submitted version. FUNDING EI-P was supported by the Lundbeck Foundation (grant R302-2018-2155) and the Novo Nordisk Foundation (grant NNF18SA0035006). FR and RM were supported by a Villum Fonden Young Investigator award to FR (project no. 00025300). Additionally, FR was supported by the COREX ERC Synergy grant (ID 951385). MD was supported by the European Union through the Horizon 2020 Research and Innovation Programme under grant no. 810645 and the European Regional Development Fund Project No. MOBEC008.

ACKNOWLEDGMENTS
We thank the members of the Racimo group for their helpful advice and discussions, and thank the reviewers and editor for their constructive feedback. Figure 1 was created with Biorender.com.