A Perception on Genome-Wide Genetic Analysis of Metabolic Traits in Arab Populations

Despite dedicated nation-wide efforts to raise awareness against the harmful effects of fast-food consumption and sedentary lifestyle, the Arab population continues to struggle with an increased risk for metabolic disorders. Unlike the European population, the Arab population lacks well-established genetic risk determinants for metabolic disorders, and the transferability of established risk loci to this population has not been satisfactorily demonstrated. The most recent findings have identified over 240 genetic risk loci (with ~400 independent association signals) for type 2 diabetes, but thus far only 25 risk loci (ADAMTS9, ALX4, BCL11A, CDKAL1, CDKN2A/B, COL8A1, DUSP9, FTO, GCK, GNPDA2, HMG20A, HNF1A, HNF1B, HNF4A, IGF2BP2, JAZF1, KCNJ11, KCNQ1, MC4R, PPARγ, SLC30A8, TCF7L2, TFAP2B, TP53INP1, and WFS1) have been replicated in Arab populations. To our knowledge, large-scale population- or family-based association studies are non-existent in this region. Recently, we conducted genome-wide association studies on Arab individuals from Kuwait to delineate the genetic determinants for quantitative traits associated with anthropometry, lipid profile, insulin resistance, and blood pressure levels. Although these studies led to the identification of novel recessive variants, they failed to reproduce the established loci. However, they provided insights into the genetic architecture of the population, the applicability of genetic models based on recessive mode of inheritance, the presence of genetic signatures of inbreeding due to the practice of consanguinity, and the pleiotropic effects of rare disorders on complex metabolic disorders. This perspective presents analysis strategies and study designs for identifying genetic risk variants associated with diabetes and related traits in Arab populations.

diabetes (4,5), and metabolic syndrome (6) in the last few decades. Diabetes is sweeping through Middle East; as per the International Diabetes Federation Atlas 8th Edition for 2017, the age-adjusted comparative prevalence of diabetes (18-99 years) in the Middle East and North Africa region is 10.5%, which is the second highest after the North America and Caribbean region. Up to 30% of native adult hospital visitors in Kuwait are afflicted with T2DM (4). The prevalence among adults in the countries from the Peninsula (Kingdom of Saudi Arabia 31.6%, Oman 29%, Kuwait 25.4%, Bahrain 25.0%, and United Arab Emirates 25.0%) are significantly associated with high per capita GDP (gross domestic product) and energy consumption (7). T2DM results from a complex interplay of adverse lifestyle exposures and genetic predisposition. Despite the high prevalence of T2DM, the Arab population lacks convincingly determined T2DM genetic risk variants and reports that sufficiently demonstrate the replication of established risk variants.
Studies illustrating a replication of established risk loci in Arab population are generally based on targeted genotyping, whereas those failing to observe the established loci are GWAbased. GWA studies for quantitative traits in the Arab population in Kuwait were unsuccessful to demonstrate established loci at genome-wide significance but instead led to identification of novel risk loci ( Table 2); these identified novel loci are often reported in literature as associated with biological processes relating to the T2D traits (Table S2). Moreover, even at nominal p-values, very few established markers are identified in the data sets in the Kuwaiti population ( Table 2-Items V-VI). The exemplary susceptibility gene loci namely, PPARγ ,  KCNJ11, TCF7L2, SLC30A, ABCC8, HHEX, CDKN2A, IGF2BP2,  CDKAL1, and FTO, known to be well-replicated in other ethnic population groups were not identified in our studies on the Kuwaiti population. Similarly, only two established T2DM risk gene loci namely (CDKAL1 and TCF7L2) were identified using an imputed data set of 5,000,000 single nucleotide polymorphisms (SNP) in the Lebanese population, which has a higher affinity to European populations compared with other Middle Eastern populations (21). It was also observed that the established markers do not necessarily replicate among inter-Arabic population groups. For example, Mtiraoui et al. (17) (Table 1) illustrated differences between the Levant Arabs (from Lebanon) and north African Arabs (from Tunisia) by demonstrating the association of TCF7L2 in both groups, IGF2BP2 and PPARγ exclusively in the Lebanese group, and KCNJ11 and SLC30A8 exclusively in the Tunisian group.
The partial overlap of established markers and differential replication of established markers in inter-Arabic populations along with the identification of novel loci are also observed in case of other complex disorders, such as rheumatoid arthritis, myocardial infarction/coronary artery disease, prostate cancer, and breast cancer (37)(38)(39)(40). It is highly interesting to perform genetic association studies in ethnic populations such as Arabs, not only to re-confirm findings from other ethnic groups but since they may also lead to identification of novel T2DM risk loci.

Study Cohorts
The above-mentioned observations in Arab populations are associated with the size of study cohorts used in the studies, which is considerably lower than those used conventionally in global GWA studies ( Table 1). This deficit affects the strength of the studies and leads to issues such as non-consideration of markers that have become rare in Arab populations. Other shortcomings of the study designs are inconsistencies (such as in age) between case and control cohorts and the failure to include in the set of markers tested for replication those that are in LD with established markers.

Differences in Phenotype Profiles Between Arab and Global Populations
While most of the well-characterized T2DM genes appear to be associated with β-cell dysfunction, diabetes observed in the Arab population is supposedly associated with obesity. This is underlined by the following observations in the Arab population: the prevalence of obesity in T2DM patients is high (41), Arab   (54)(55)(56)(57)(58). Examples of T2DM risk genetic loci, which are also associated with rare recessive disorders, are LIPC (Hepatic lipase deficiency), PDX1 (Pancreatic agedness 1), ENPP1 (Hypophosphatemic rickets, also associated with obesity), WFS1 (Wolfram syndrome 1), and SLC2A2 (Fanconi-Bickel syndrome).

ANALYSIS STRATEGIES FOR GWA STUDIES IN ARAB POPULATIONS
The current analysis strategies for GWA studies in Arab populations and suggested extensions are presented in Figure 1 and are discussed below.

Participant Recruitment and Sample Size
The sample size has a linear relationship with the number of identified risk loci; a plateau has not been observed for any trait to date (68). Recruiting the required large number of participants from the intended sample population continues to be challenging in Arab countries (69,70). Clinical research in Arab countries experiences a lack of public outreach capabilities and coordination between research institutes and hospitals or Ministries of Health. This challenge can be circumvented by understanding public perception and attitude toward medical research and by seeking out means to increase public trust and awareness of clinical research in the Arab population.
The optimum sample size is determined by various factors including homogeneity seen in the population, prevalence of the disorder, variance in the trait measurements, genetic models used in the association tests, number of markers tested in the study, allele frequencies of the risk variants, effect sizes, genome control inflation rates, desired Type I error rates, type of study design (quantitative trait association, case-control studies with unrelated individuals or family-based trios, or sibling case-control designs) (71,72). A number of tools are available to calculate the optimum sample size; some of the tools often used include Genetic Association Study (GAS) Power Calculator for one-stage genetic association studies http://csg.sph.umich.edu/abecasis/ cats/gas_power_calculator/; (73), CaTS Power calculator http:// csg.sph.umich.edu//abecasis/CaTS/index.html; (74) for two stage genome wide association studies; Quanto for various study designs (including the matched case-control, case-sibling, caseparent, and case-only designs) http://biostats.usc.edu/Quanto. html; (75), and QpowR for two-stage study design with unrelated individuals https://msu.edu/~steibelj/JP_files/QpowR.html.

Differences in the Extent of Inbreeding Among Subgroups in the Arab Population
The rate of consanguinity and the extent of resultant inbreeding differ among Arab countries as well within a single country. For example, among the three genetic substructures of Kuwaiti Arab population (76)(77)(78)(79), the subgroups of Saudi Arabian tribe ancestry and Persian ancestry exhibit high inbreeding coefficients (0.04226 and 0.025742, respectively) indicating endogamy, whereas the nomadic Bedouin subgroup exhibits lower inbreeding coefficient (0.00274), indicating heterogamy (76). Similar observations have been made with population substructures in Qatar (80). The genetic heterogeneity between the subgroups warrants subgroup-specific genetic association analysis using large sample sizes, particularly for those with higher rates of inbreeding.

Comorbid Conditions as Confounders
Diabetes is often comorbid with other complex chronic disorders (81); Most prevalent comorbid disorders, that occur in "concordant" form wherein both disorders represent parts of the overall identical pathophysiological risk profile, include hypertension, coronary artery disease and peripheral vascular disease (82). An example for comorbid disorders being influenced by unique environmental factors is an increased risk for diabetes in patients with learning disabilities, physical or sensory disabilities, and mental health problems. The cooccurrence of schizophrenia and diabetes may partly be driven by shared genetic factors (64). Comorbidities can also be consequences of hyperglycemia. Tests for genetic associations need to consider comorbid conditions. For advancements in metabolic disorder research in the Arabian Peninsula, the identification of overlaps between metabolic disorders and rare genetic as well as common disorders from existing literature and the formation of a catalog of causal factors/markers for overlapping traits are immediately required. A careful and extensive phenotyping in relation to rare and less frequent disorders must be prioritized during sampling. These variables can be further used as covariates in association tests.

Genetic Models Based on Recessive Mode of Inheritance for Association Tests
Consanguinity in successive generations cumulatively increases inbreeding levels, recessive alleles, and the proportion of homozygous gene regions in Arab populations (76,83). An overwhelming proportion (63%) of the disorders documented in the Catalog for Transmission Genetics in Arabs (CTGA) (84) follows a recessive mode of inheritance. Therefore, it is not surprising to observe that the risk variants reported in our studies appeared when the genetic model based on a recessive mode of inheritance was used ( Table 2), and majority of the risk variants were observed to be harbored in the runs of homozygosity (ROH) segments (24). However, when established markers appeared in the Kuwaiti population data sets, it was invariably when the association tests used additive models. It is obvious that genetic pre-disposition of T2DM is not a recessive disorder, although in certain populations homozygosity of susceptibility alleles can further increase the population prevalence of the disease. It is recommended that population genetics in the Arab region use recessive models in addition to additive models, especially when large number of homozygous segments and/or of recessive risk genotypes are observed in sufficiently large number of individuals in the study cohort.

Estimation of Acceptable Effect Size
Large sample sizes enable the identification of causal variants with small effect size in GWA studies. Although the estimation of sample sizes for a 2-stage design (discovery phase followed by replication) is generally strictly followed in GWA studies, the estimation of a true effect size explainable by sample size is not strictly followed in several studies; the proper selection of associated variants in the discovery phase is not usually difficult, mainly because the use of additive models (wherein both heterozygote + rare homozygous genotypes are tested against reference homozygous genotypes with additive effect) does not unusually inflate the effect size values for common markers. However, in case of the recessive model, variants that show ≥5% frequency (i.e., associated with few rare homozygous genotypes) can show unusually large effect sizes. Therefore, an estimation of the acceptable effect size for a desired percentage of variance in a given trait (mean ± SD) and sample size at 80% power is pivotal for restricting SNP associations from undesirable high effect size and inappropriate p-values resulting from recessive effect.

Joint Analysis-Combining Results From Discovery and Replication Phases
GWA studies follow a 2-stage design irrespective of addressing quantitative or binary trait association. A strategy involving the joint analysis of data from both stages, which is currently increasingly used, has been advocated to result in an increased capability of detecting genetic associations (74). Such a metaanalysis increases the capacity for detecting weak genotypic effects. An example of the resultant increased power is observed in the case of three established markers from CETP (rs3764261, rs1864163, and rs1800775), which do not attain genome-wide significance in discovery cohort of our studies but in joint analysis ( Table 2-Item V).

Heteroscedasticity and Trait Transformation
Quantitative traits (associated with complex disorders) used in association studies often violate the assumption of normality. Methods commonly used to handle non-normal traits include natural logarithm and rank-based inversenormal transformations (85). Although the merits of such transformations are questionable (86), their use has been increasing. Elaborate assessments regarding the extent to which such transformations mask the true phenotype variability are lacking in literature. Owing to high inbreeding and prevalence of autosomal recessive disorders in the Arab population (87), the segregation of rare homozygous alleles may exert relatively larger effects on quantitative trait variations. Hence, earmarking any outlying data dispersion as heteroscedastic may conceptually be incorrect. Furthermore, performing trait transformations may adversely mask the actual effects of such loci on the variation of quantitative traits. Thus, to avoid any false positive or negative associations, it is advised to perform association tests, using appropriate genetic models, with both the raw and transformed traits, and to simultaneously adjust the models for disorders (rare and common) that overlap with the metabolic syndrome.

Relatedness and Loss of Samples
The imprecise modeling of genetic relatedness and population stratification among study subjects results in a substantial inflation of test statistics and spurious associations. Moreover, randomized sample sets in the Arab population exhibit rich relatedness due to the prevalent practices of polygamy and consanguinity. Therefore, detailed quality control procedures for relatedness and admixture must be performed in case of Arab studies. Relevantly, our studies are performed using the following steps: (i) assessment of relatedness among participants to the extent of third-degree relatives and removal of one sample per pair of related participants, (ii) performing ancestry estimations using ADMIXTURE (88) and removal of samples with abnormal deviations to the extent of component ancestry elements that have been previously established for the three Kuwaiti population subgroups (76), and (iii) delineation of principal components using EIGENSTRAT (89) and removal of outlying samples. These exhaustive steps, aimed at reducing false positive findings, lead to huge loss of samples. Thus, although the use of unrelated individuals in GWA studies is a norm, the use of recent sophisticated algorithms [such as BLOT-LMM (90) FaST-LMM (91), EMMAX (92)], which account for kinship structures and ancestries within a sample set, may offer larger power to the study by retaining more samples.

Quantitative Trait Association Studies Using "All" Diabetic Cohort
Quantitative trait association studies are usually conducted in population-based cohorts comprising both people with diabetes and diabetes-free individuals at the time of participant recruitment. Association tests are usually adjusted for obesity, diabetes, and medication regimen. However, it is important to extend the studies to cohorts comprising entirely of diabetic or prediabetic participants so that the prospects for use of identified genetic determinants in diabetes care and treatment become promising (93).

Case for Whole-Genome Homozygosity Association (WGHA) Methodologies
High inbreeding within the Arab population renders it a promising repository for providing a large scope for discovery of ROH and segments of identity by descent (IBD). ROH indicates an ancient shared common ancestry, whereas IBD indicates a recent ancestry. Both traits are reportedly effective in delineating population demography and recessive components of Mendelian and complex phenotypes (76). The risk loci identified in our studies often overlay ROH regions, some of which are "novel." Currently, a promising new concept of "whole-genome homozygosity association (WGHA) methodology" in identifying genetic susceptibility loci harboring recessive variants (94) is being developed. Exemplary works include the identification of "risk ROH" for schizophrenia (95) and adult height (96). Tools such as LOHAS (97), which use either whole-genome sequence or genotype data in cohorts of either related or unrelated individuals, are now available for performing WGH association tests under the study designs for both case-control and quantitative trait association tests.

Case for Family-Based Genetic Association Studies in Arab Population
The large number of T2DM genetic loci identified to date using unrelated people explains only a relatively small proportion of observed heritability (familial clustering) of T2DM (8,98). Possible explanations for "missing heritability" may originate from the role of rare variants, copy number variants, indels and more complex rearrangements, gene-environment interactions and epigenetics (98)(99)(100)(101)(102). Family-based designs allow the segregation of rare variants in a pedigree; multiple copies of such rare variants facilitate the detection of their effects. Family-based studies require a fewer number of samples than population-based studies and offer advantages in terms of quality control, robustness to population stratification, and uniformity in exposure to environmental factors or lineage-specific diseases. They offer the potential to combine linkage and association data. Arab population, which is largely consanguineous, offers a large potential for family-based designs as the population can show familial gene clustering for diabetes and metabolic traits. However, except for few studies, such as the "Oman Family Study" (103)(104)(105)(106) and the study on an extended family from the UAE by Al Safar et al. (22), no notable familial study for diabetes risk loci has been reported on the Arab population. Both the abovementioned studies confirmed well-established gene loci, but failed to identify any novel "rare" variants. Considerable attention needs to be paid to appropriate study designs as family data continue to provide important information in the search for trait loci (107). It is ideal if the recruitment of largepedigrees/extended families, particularly those containing several sub-families suitable for both parent-offspring design or for sibling design, with high inbreeding and roots traceable up to at least six generations with deduced consanguinity data is possible.

Epigenetic Mechanisms of T2DM Genetic Risk Factors and Environmental Factors in Arab Population
As mentioned earlier, the post-oil era witnessed in Arab population a rapid shift in the eating and physical activity habits. Environmental and lifestyle factors (including diet, obesity, physical activity, tobacco smoking and environmental pollutants) can influence epigenetic mechanisms, such as DNA methylation, histone acetylation, and microRNA expression; these modifications can result in altered gene expression with effects on regulation of specific genes. Epigenome-wide association studies (EWASs) that examine the role of epigenetic modifications in the etiology and progression of metabolic disorders (108)(109)(110)(111)(112) and diabetes (113)(114)(115)(116) have recently emerged. Most of such EWASs with T2D and obesity are focused on Caucasian populations; however, a study emerged recently on Arab population from Qatar (117), which identified one novel CpG association at DQX1 at genome-wide significance for T2D and replicated eight previously reported associations involving FIGURE 2 | An integrative approach as direction of future T2DM genetics research in Arab populations. EWAS, epigenome-wide association studies; GWAS, genome-wide association studies; WGS, whole-genome sequencing; miR-Eqtl, microRNA expression quantitative trait loci; meQTL, methylation quantitative trait loci; eQTL, expression quantitative trait loci; CNV, copy number variation; SNP, single nucleotide polymorphism.

AN INTEGRATIVE APPROACH AS DIRECTION OF FUTURE T2DM GENETICS RESEARCH IN ARAB POPULATIONS
It has increasingly become evident that epigenetics, genetics, and environment are likely to interact with one another to define an individual's risk of diabetes and obesity (118). Integration of data on expression quantitative trait loci (eQTL), which represent regulatory loci, with genetic variants identified from GWA studies can give new insights into identification of causal genes for T2DM (119,120). The ability of epigenetic modifications and expression of miRNA (and largely the noncoding RNAs) to manipulate gene expression has enabled incorporation of such data in research on pathogenesis of T2DM (121)(122)(123). Consideration of expression data and epigenome data along with large-scale GWA data on genotyped and imputed SNPs and copy number variations in association studies for T2DM is depicted as future directions for diabetes research (Figure 2).

CONCLUSION
The failure to convincingly replicate a large number of Eurocentric risk variants for T2DM in Arab populations may have resulted from several aspects, including study design and strength, low prevalence of causative Euro-centric risk variants in the Arab population, or from the gene-environment interactions that masked the effect of the Euro-centric risk variants. However, epidemiological studies have illustrated the deficit of global risk assessment tools fitted to the Arab population (124). The performance of global genetic risk assessment tools (based on Euro-centric markers) in other populations is also questionable (125). The discrepancy of marker relevance in the applicability of Euro-centric genetic risk variants to Arab population could be resolved by performing large-scale genome-wide surveys (a combination of GWAS, exome, and genome sequencing and imputation) of the Arab population with diabetes. Detailed functional assessments of loci identified in the Arab population to interact with Euro-centric risk loci as part of common gene networks or physiologic processes should also be performed.

AUTHOR CONTRIBUTIONS
TT and FA-M conceptualized the study design. TT and PH developed the manuscript. All the authors participated in discussions. JA, MA-F, JT, and FA-M critically reviewed the manuscript.

FUNDING
The study was funded by the Kuwait Foundation for Advancement of the Sciences (KFAS) (Dasman Diabetes Institute project number RA 2016-026).