Accuracy of gene expression prediction from genotype data with PrediXcan varies across diverse populations

Predicting gene expression with genetic data has garnered significant attention in recent years. PrediXcan is one of the most widely used gene-based association methods for testing imputed gene expression values with a phenotype due to the invaluable insight the method has shown into the relationship between complex traits and the component of gene expression that can be attributed to genetic variation. The prediction models for PrediXcan, however, were obtained using supervised machine learning methods and training data from the Depression and Gene Network (DGN) and the Genotype-Tissue Expression (GTEx) data, where the majority of subjects are of European descent. Many genetic studies, however, include samples from multi-ethnic populations, and in this paper we assess the accuracy of gene expression predictions with PrediXcan in diverse populations. Using transcriptomic data from the GEUVADIS (Genetic European Variation in Health and Disease) RNA sequencing project and whole genome sequencing data from the 1000 Genomes project, we evaluate and compare the predictive performance of PrediXcan in an African population (Yoruban) and four European populations. Prediction results are obtained using a range of models from PrediXcan weight databases, and Pearson’s correlation coefficient is used to measure prediction accuracy. We demonstrate that the predictive performance of PrediXcan varies across populations (F-test p-value < 0.001), where prediction accuracy is the worst in the Yoruban sample compared to European samples. Moreover, the performance of PrediXcan varies not only among distant populations, but also among closely related populations as well. We also find that the qualitative performance of PrediXcan for the populations considered is consistent across all weight databases used.


27
In the past decade, genome-wide association studies (GWAS) have identified thousands of genetic 28 variants significantly associated with a wide range of human phenotypes. The vast majority of these 29 studies, however, were conducted in samples from European ancestry populations [1][2][3][4][5]. Differences 30 in allele frequencies, genetic architecture, and linkage disequilibrium (LD) patterns across ancestries 31 suggest that GWAS discoveries can fail to generalize across populations, and recent publications 32 have provided compelling evidence that GWAS findings often do not transfer from European 33 the Wald test to assess the significance of the coefficient for each gene and excluded the genes whose corresponding p-values were above the significance level of 0.05. 109 We then calculated Pearson's correlation coefficient, r, between observed and predicted expression 110 values for every gene, in each population separately. A few genes had constant predicted gene 111 expression levels across all subjects. Since we could not calculate the correlation if one of the 112 variables was constant, we excluded those genes. Thus, for every gene we had five Pearson's 113 correlation coefficients, one per population. Note that we used r instead of the square of Pearson 114 correlation, r 2 , in order to take the directionality of correlation into account. Using r 2 as a measure 115 of predictive accuracy can be misleading because a large proportion of genes predicted and observed 116 expression values that are negatively correlated. To assess how the training of prediction models with different populations affects prediction accuracy, 119 we used a linear mixed effect model approach. After filtering out poorly predicted genes, we fit the 120 following model: where r ij is the correlation coefficient for gene i in population j; and I F IN,i , I GBR,i , I T SI,i , and 122 I Y RI,i are indicator variables that are equal to 1 if the gene correlation was calculated on the 123 population indicated in the subscript, and otherwise are equal to 0. Thus, we modeled population as 124 a categorical predictor, with the CEU population as a reference. To account for variation between 125 genes, we included a random intercept γ i for each gene and we assumed that γ i ∼ N (0, σ 2 γ ). We Using DGN, GTEx WB and GTEx LCL models and sequence data, we predicted gene expression 155 for 10387, 5432 and 2777 genes, respectively (see Table 2). populations than for any of the European populations, regardless of the weight database used, and 168 this trend is even more obvious after the filtering process.

169
Afterwards, we binned the genes into six categories based on the gene correlation coefficients 170 (see Table 3).

198
As can be seen in the violin plots in Figure 1, both databases based on whole blood perform similarly, In this work, we evaluated PrediXcan performance and compared it across five geographically diverse 211 populations using multiple weight databases. Models from all seven weight databases were trained 212 mostly on subjects of European ancestry; three of the databases were derived from LCL and the 213 remaining four from whole blood. As a measure of prediction accuracy, we computed correlation 214 coefficients for each gene in all populations and used the linear mixed models framework to quantify 215 the differences in prediction performance across populations. We also investigated whether whole 216 blood models could be used for predicting gene expression levels in LCL.

217
Overall, PrediXcan accurately predicted gene expression for some genes; however, the majority 218 of genes had very poor correlation between measured and predicted expression levels. For almost 219 half the genes, the correlation was negative. As expected, prediction accuracy was higher when the 220 training and testing cohorts were of similar ancestry; i.e., models trained on Europeans performed 221 better in the subjects of European descent and the worst in the African subjects. Surprisingly, 222 prediction accuracy varied even among the European populations, with Finnish, British, and Italian 223 populations having significantly higher accuracy than the CEU. These results held under all the 224 weight databases we considered. Lastly, LCL-trained models outperformed whole-blood-trained 225 models, although the prediction accuracy was similar for many of the genes.

226
A recent study reported consistent results to our findings and suggested that gene expression 227 models should be trained on genetically similar populations [16]. Lack of genomic data from diverse The authors declare that the research was conducted in the absence of any commercial or financial 247 relationships that could be construed as a potential conflict of interest.

248
AM and TT conceived the idea, designed the analysis, interpreted the results, and wrote the paper.

250
AM ran the analysis.