The Estimated Prevalence of N-Linked Congenital Disorders of Glycosylation Across Various Populations Based on Allele Frequencies in General Population Databases

Congenital disorders of glycosylation (CDG) are a widely acknowledged group of metabolic diseases. PMM2-CDG is the most frequently diagnosed CDG with a prevalence as high as one in 20,000. In contrast, the prevalence of other CDG types remains unknown. This study aimed to analyze the estimated prevalence of different N-linked protein glycosylation disorders. We extracted allele frequencies for diverse populations from The Genome Aggregation Database (gnomAD), encompassing variant frequency information from 141,456 individuals. To identify pathogenic variants, we used the ClinVar database as a primary source. High confidence loss-of-function variants as defined by the LOFTEE algorithm were also classified as pathogenic. After summing up population frequencies for pathogenic alleles, estimated disease birth prevalence values with confidence intervals were calculated using the Bayesian method. We first validated our approach using two more common recessive disorders (cystic fibrosis and phenylketonuria) by showing that the estimated prevalences calculated from population allele frequencies were in accordance with previously published epidemiological studies. Among assessed 27 autosomal recessive N-glycosylation disorders, the only disease with estimated birth prevalence higher than one in 100,000 was PMM2-CDG (in both, all gnomAD individuals and those with European ancestry). The combined prevalence of 27 different N-glycosylation disorders was around one in 22,000 Europeans but varied considerably across populations. We will show estimated prevalence data from diverse populations and explain the possible pitfalls of this analysis. Still, we are confident that these data will guide CDG research and clinical care to identify CDG across populations.


INTRODUCTION
Congenital disorders of glycosylation (CDG) are a growing group of metabolic diseases with at least 137 defects (Ondruskova et al., 2021). According to the International Classification of Inherited Metabolic Disorders (ICIMD) database, 32 are classified as N-linked protein glycosylation defects (Ferreira et al., 2021). Based on population allele frequencies, the expected birth prevalence of the most common PMM2-CDG could be as high as 1:20,000 (Schollen et al., 2000), and in later reports, 1:77,000 to 1:286,000 (Vals et al., 2018;Yildiz et al., 2020). Mainly at the expense of PMM2-CDG, type 1 defects are more frequently diagnosed than type 2 defects (Peanne et al., 2017;Medrano et al., 2019) but compared to PMM2-CDG, other defects occur much seldom. The estimated combined prevalence of CDG among the Saudi population is 14 per million (Alsubhi et al., 2017), whereas, in Poland, the prevalence is approximately one case per million (Lipinski et al., 2021). The worldwide individual prevalence of different defects remains unknown.
The emergence of large-scale population-based exome and genome sequencing studies and the aggregation into even more extensive cross-population databases like gnomAD makes allele frequency data available from more than 100,000 individuals across different populations (Karczewski et al., 2020). Disease prevalence can be estimated when these data are combined with clinical variant classification databases like ClinVar (Landrum et al., 2018) and variant loss-of-function predictions. This approach has been previously used on assessing disease prevalence of different limb-girdle muscular dystrophy subtypes across populations (Liu et al., 2019). Although direct calculations using the Hardy-Weinberg equation can be used, the methods developed by Liu et al. (2019) make use of more advanced Bayesian statistics to add confidence intervals on prevalence estimators.
This study aimed to analyze the estimated prevalence of N-linked protein glycosylation defects across different populations based on allele frequencies in general population databases.

MATERIALS AND METHODS
Only N-linked protein glycosylation defects with autosomal recessive inheritance patterns (27/32) were included in the analysis. The list was based on the ICIMD database (Ferreira et al., 2021). We extracted allele frequencies for different populations from The Genome Aggregation Database (gnomAD v2.1.1) (Karczewski et al., 2020), encompassing variant frequency information from 141,456 individuals. We defined variants as pathogenic if they were classified pathogenic or likely pathogenic in the ClinVar database (version 20210119) (Landrum et al., 2018) or were marked as high confidence lossof-function (Moller et al., 2016) variants by the LOFTEE algorithm (Karczewski et al., 2020). In the case of conflicting annotations, high confidence LoF variants marked as (likely) benign in ClinVar were considered as non-pathogenic. Also, variants not passing all quality filters in the gnomAD variant call set were eliminated from counting toward the sum of pathogenic alleles.
The calculations were carried out across all seven gnomAD populations (African/African American, Latino/Admixed American, non-Finnish European, Finnish, Ashkenazi Jewish, East Asian, and South Asian). The number of individuals per population ranged from 5,185 (Ashkenazi Jewish) to 64,603 (non-Finnish European). In addition, we calculated prevalence estimates for the Estonian population (2,418 individuals), which is a subpopulation of non-Finnish Europeans. After obtaining population frequencies for pathogenic alleles, estimated disease birth prevalence values with confidence intervals were calculated using Bayesian statistics adapted from the previously published study by Liu et al. (2019). No manual variant curation was performed, and no gene-based exceptions were made to assure reproducibility and robustness (i.e., for PMM2-CDG, the p.Arg141His variant homozygotes were not excluded although known to be embryonically lethal). For validation, we used two common recessive disorders (cystic fibrosis and phenylketonuria) and compared the estimated disease prevalence with the data from epidemiological studies conducted in different populations.

RESULTS
The estimated prevalence from population allele frequencies of cystic fibrosis and phenylketonuria was in accordance with previously published data (Supplementary Table 1). Results from published epidemiological studies fell into 95% confidence intervals in non-Finnish European, Estonian, and Finnish populations, where reliable population prevalence is known.
The selected list of N-linked protein glycosylation defects, their estimated prevalence, and the number of reported cases is presented in Table 1. The full list of all defects in all assessed populations is shown in Supplementary Table 2. PMM2-CDG showed the highest estimated prevalence counted from 71 different pathogenic variants, and it is more prevalent in European (1 in 27,000), Ashkenazi Jewish (1 in 20,000), and admixed American populations (1 in 64,000). All other N-linked protein glycosylation defects had a much lower prevalence.
FUK-CDG and MAN2B2-CDG, which are both only recently reported and classified as N-linked protein glycosylation defects, showed high estimated prevalence, especially in East-Asians where FUK-CDG prevalence was estimated at 1:12,000 and MAN2B2-CDG at 1:11,000.
The estimated prevalence of combined N-linked protein glycosylation defects in Europeans was one in 22,000 or one in 24,000 after excluding FUK-CDG and MAN2B2-CDG. In East-Asians, however, the total CDG prevalence was estimated at 1:5,613, dropping to 1:121,935 after excluding FUK-CDG and MAN2B2-CDG. CDG seems to be more common in Finnish (1:16,000) and Ashkenazi Jewish populations (1:14,000). In Estonians, the combined CDG birth prevalence is estimated at around 1:50,000.

DISCUSSION
This study presents the estimated prevalence of different N-linked protein glycosylation defects calculated from population allele frequencies. As the CDG group involves at least 137 defects, we emphasize that we only included 27 autosomal recessive protein N-glycosylation affecting defects and excluded defects that affect multiple glycosylation pathways. The used methods make several assumptions and do not take variant-and gene-specific properties into account. For example, the calculations assume that pathogenic variants are independent and not on the same allele. Also, Hardy-Weinberg equilibrium is assumed but may not be present in some populations with more consanguinity. All diseases are considered to result from the biallelic loss-of-function mechanism; however, we cannot exclude the possibility that for some CDG, only specific variants (e.g., certain missense variants) cause the disease.
Also, this method did not consider the possibility that some variant recombinants might not be compatible with life. Since we cannot exclude this phenomenon in these genes, the chance of p.Arg141His homozygosity was also included when calculating the estimated prevalence of PMM2-CDG. However, as seen from the subanalysis of the Estonian population, it does not change the prevalence that much. In Estonia, the estimated prevalence of PMM2-CDG with the exclusion of p.Arg141His homozygosity was 1:77,000, and with inclusion, 1:62,000 (Vals et al., 2018). This example shows that the estimated prevalence of different defects might be somewhat overrated.
On the other hand, some aspects can cause an underestimation of disease prevalence. For example, currently, the data relies on gnomAD variants, and thus ultra-rare pathogenic variants not seen in gnomAD are not counted toward prevalence estimation. Also, only small variants (single nucleotide substitutions and small indels) were included, and thus pathogenic copy-number variants (deletions) may raise the disease prevalence estimators. Regarding missense variants, only those classified as pathogenic in ClinVar were included, while some other missense, as well as synonymous, intronic, and regulatory variants, may truly be pathogenic but missed due to lack of such information in databases. Regarding missense variants, one possibility to improve the estimates would be to include variants with high pathogenicity predictions. However, as shown by Liu et al. (2019) and consistent with our experience (data not shown), this leads to an overestimation of prevalence. Thus, we did not include any missense variant prediction tools in our final calculations. To be on the conservative side, we eliminated all low-quality variants in gnomAD; however, some of them, especially indels, may be true and pathogenic. Of note, presented results rely on data extracted from two publicly available databases, ClinVar and gnomAD, at a single time point. If other databases would have been used, we expect to see slightly different results; however, confidence intervals are shown to address this variability. Moreover, as the databases update both allele frequencies and pathogenicity classifications, a follow-up study in the future is planned to investigate whether the prevalence estimates will change after some years.
Despite many pitfalls, the validation with two well-known recessive diseases-cystic fibrosis and phenylketonuria-proved the method's accuracy and usability for estimating the disease prevalence. From our previous epidemiological studies in Estonia, the prevalence of phenylketonuria is known to be 1:6,700 (Lillevali et al., 2018) correlating with the estimate from gnomAD data is 1:6,604 (95% CI 1:4,269-1:12,073). For cystic fibrosis, the birth prevalence in Estonia is 1:7,743 (Kahre, 2004) comparing well with the calculated estimate 1:8,482 (95% CI 1:5,269). For other well-studied populations like Europeans and Finnish, the data is in good correlation as well (Supplementary Table 1).
As shown in Table 1, compared to PMM2-CDG, all other N-linked protein glycosylation defects are less common. Still, our reported prevalence of different defects shows compliance with published clinical observations. In 2016, an informal inquiry was conducted in various European laboratories to evaluate the number of screening positive and molecularly confirmed CDG-I and CDG-II patients (Peanne et al., 2017). Based on their results, the most common type 1 N-glycosylation defect in Europe was PMM2-CDG, followed by ALG6-CDG, ALG1-CDG, and MPI-CDG. If we compare this distribution with non-Finnish Europeans, a similar pattern is seen with type 1 N-glycosylation defects. The most common type 2 N-glycosylation defects, according to the inquiry, were MAN1B1-CDG, MGAT2-CDG, and B4GALT1-CDG. In the non-Finnish European population, MAN1B1-CDG is followed by MOGS-CDG, which does not show an abnormal serum transferrin profile with routine CDG screening (De Praeter et al., 2000), and therefore was not reported by laboratories. Compared to these two defects, MGAT2-CDG and B4GALT1-CDG show a much lower prevalence. Unexpectedly, two recently described defects, FUK-CDG and MAN2B2-CDG, showed the prevalence of higher than one per million. To date, FUK-CDG is described only in two individuals, whereas MAN2B2-CDG only in one (Ng et al., 2018;Verheijen et al., 2020). For FUK, we found 119, and for MAN2B2, 84 different possibly disease-causing alleles matching our definition of pathogenicity. Disagreement between relatively high estimated prevalence and the low number of reported cases could be explained by the phenomenon where homozygosity or compound heterozygosity of some variants may not be viable, as is the case with p.Arg141His homozygosity in PMM2 (Matthijs et al., 1998). Also, FUK and MAN2B2 are not included in smaller panels and are only identified by whole-exome sequencing. Theoretically, it may be the cause for the underdiagnosis of these defects to some extent. Notably, one of the pathogenic missense variants reported by Ng et al. (2018), NM_145059.3:c.2980A > C (p.Lys994Gln), reaches allele frequency of 0.002 in the East Asian population, and there is one homozygous individual present in gnomAD. Although mild phenotypes cannot be excluded in gnomAD individuals, also non-pathogenicity of this allele, at least in the homozygous state, should be considered. Of note, FUK and MAN2B2 both lack homozygous LoF variant carriers in gnomAD, which hints that biallelic LoF may be an actual pathogenic mechanism for those genes.
Although the individual prevalence of each N-linked protein glycosylation defect is very low, the combined prevalence for this group of CDG is notable. If all 27 defects are included, the combined prevalence in non-Finnish Europeans is one in 22,000. If we exclude FUK-CDG and MAN2B2-CDG, the prevalence in Europeans is slightly lower, one in 24,000. Compared to available data (Vals et al., 2018;Lipinski et al., 2021), the expected and observed prevalence differ. The lower observed prevalence might be somewhat underestimated because of undiagnosed patients due to different causes such as unrecognized phenotypes, negative screening results or limited access to different metabolic and molecular studies. In the Estonian population, the combined prevalence of 25 defects is 1:50,000. Therefore, with a yearly birth rate of 13,000-14,000, one individual with protein N-glycosylation defect should be born every 4 years correlating with our clinical data.
In summary, we broaden the knowledge about N-linked protein glycosylation disorders, which hopefully helps raise awareness about the prevalence of this CDG subgroup.