GWAS of depression in 4,520 individuals from the Russian population highlights the role of MAGI2 (S-SCAM) in the gut-brain axis

We present the results of the depression Genome-wide association studies study performed on a cohort of Russian-descent individuals, which identified a novel association at chromosome 7q21 locus. Gene prioritization analysis based on already known depression risk genes indicated MAGI2 (S-SCAM) as the most probable gene from the locus and potential susceptibility gene for the disease. Brain and gut expression patterns were the main features highlighting functional relatedness of MAGI2 to the previously known depression risk genes. Local genetic covariance analysis, analysis of gene expression, provided initial suggestive evidence of hospital anxiety and depression scale and diagnostic and statistical manual of mental disorders scales having a different relationship with gut-brain axis disturbance. It should be noted, that while several independent methods successfully in silico validate the role of MAGI2, we were unable to replicate genetic association for the leading variant in the MAGI2 locus, therefore the role of rs521851 in depression should be interpreted with caution.

. Distributions of height and weight in the cohort and filtration thresholds used in the study. References 19

Cohort phenotyping
The data was collected from December 2017 to February 2020. Subjects with mood disorders symptoms were identified through self-reports through an online questionnaire and two phenotypic blocks were constructed. The study included respondents over 18 years old, both sexes, with a height between 140 and 220 cm and weight between 40 kg and 150 kg, who agreed to participate and provide their genetic information for the study.
1. HADS-based phenotypes. Categorical and quantitative phenotypes were based on the Hospital Anxiety and Depression Scale (HADS) 1 . The questionnaire had 7 questions, each of which was offered 4 options of answer, assessed from 0 to 3 points depending on the severity of the symptom.
2. DSM-based phenotypes. Alternative categorical phenotypes in our study were assessed with the original online self-questionnaire based on DSM-5 diagnostic criteria with funnel filters in the questions structure. We used diagnostic criteria for major depressive disorder (MDD) (lifetime and current phenotypes), bipolar disorder (BD) (lifetime and current depression phenotypes) and generalized anxiety disorder (GAD). The MDD group consisted of subjects who answered positively to questions №1 and №2 of the first block (subparagraphs 1.1 and 2.1 are not considered separately), and also selected at least 3 items in question №3. Subjects that marked "yes" for item №4 were excluded from the MDD group. Such subjects fell into the broad BD group. The GAD group consisted of subjects who answered positively to questions №1, №2, №3 of the second block, and also noted at least 3 items in question №4.

MDD and broad
Russian language adaptations of HADS and DSM scales were used in the study, which were previously validated in the Russian population 3,4 . Tightening the criteria for subjects from the MDD group in the form of obligatory two positive answers to questions №1 and №2, which correspond to the two main symptoms of depression according to the DSM-5, was done deliberately for greater homogenity of the sample. In addition, to exclude subjects with probable bipolar disorder from this group, the main criterion for a manic episode was added, but without additional symptoms, which allows us to speak only of an extended phenotype of this disorder.

Control groups
The main control group consisted of participants without any DSM-based (MDD, BD, GAD) or categorical HADS (HADS-Depression and HADS-Anxiety) phenotypes.

Genetic data processing, gene mapping and prioritization methods
The R package fastman 5 was used to build the Manhattan plot for the GWAS results, and the QQ-plot was generated using a custom script. The associated haplotype, lying between the CEU population recombination hotspots identified from HapMap3 data, was visualized with LocusZoom 1.4 6 . The package LDlinkR 7,8 was used to calculate the values of r2 and D for variants within the locus based on CEU data (from the 1000 Genomes Project, 2015 9 ). The package gwasforest 10 was used to generate the forest plots for the variants in the peak locus for comparison of their effect directionality and size in a range of depression GWAS studies.
The variants from the GWAS meta-analysis summary statistics on depression performed by Howard et al., 2019 11 , from the GWAS on ICD-coded MDD phenotype reported by Howard et al. in 2018 12 , and from the GWAS on the lifetime MDD phenotype from Cai et al., 2020 13 , which have surpassed the threshold p-values of 5x10 -8 , 1x10 -6 and 5x10 -5 correspondingly along with the variants associated with linear HADS-D scale with p-value <5x10 -6 have been linked to corresponding genes using POSTGAP 14 . The GPrior 15 preprocessing module was then used to summarize the POSTGAP results to acquire genelevel features. Furthermore, each gene has been annotated with its expression values in a range of brain cell types and regions using data from Dropviz 16 , NeuroExpresso 17 and Allen Brain Atlas 18 . The latter annotation was performed using the R package ABAData 19 . BiomaRT 20 was used to convert mouse gene symbols to human. True and validation sets of genes for GPrior were generated from the genes, which were reported as significantly associated with depression in the GWAS Catalog 21 . Feature importance analysis for prioritization was performed using the R package randomForest 22 .
Data on consensus normalized MAGI2 expression values in a range of tissues and antibody staining levels for its product between cell types were obtained from the human protein atlas [23][24][25] .
The R package fgsea 26 was used to estimate enrichment score of the IBS sigmoid colon differential expression data from the study by Videlock et al. 27 with the genes associated with the linear HADS-D scale. The KEGG pathway enrichment analysis was performed using the clusterProfiler 28 R package.
The genes, associated with the variants that achieved p-values of 5x10 -6 in depression GWAS based on the DSM scale (with the logistic model), and the HADS scale (with the linear and logistic models), along with the genes from the set of differential expression analysis in IBS mentioned above, were annotated with GO terms using the data from the ontologySimilarity package 29 . The terms were filtered by the GO IC supplied with this package (the threshold of 0.5 was used). Based on the obtained GO 30,31 annotations, frequency of the terms associated with each gene was computed (the package stopwords 32 and an additional custom list of stop words were used for filtering the words before the computation). Thus, frequencies of terms occurring in GO annotations of the corresponding genes were obtained for the DSM and HADS scales along with the IBS data. PCA was then performed to understand the differences between them based on the terms. The FactoMineR 33 library was used to perform the PCA and the factoextra43 package was used to visualize the results. Kmeans from the stats R package was used to cluster the terms based on their PCA coordinates. Finally, the GOSemSim 34 R package was used to perform a semantic similarity estimation between the sets of genes, associated with each scale model combination considered, and the IBS gene sets (the set of differentially expressed genes in the sigmoid colon in IBS and the set of genes from the cAMP WGCNA module from the study by Videlock et al.).
We used SUPERGNOVA to investigate local genetic covariance between IBD and depression using ieu-a-294 and depression data for HADS and DSM presented here. 145 approximately independent regions with common variants from 1000 Genomes project phase III (rare variants with MAF < 5% were filtered out), for which data were available in all studies, were used in the analysis. The regions were found among 2353 approximately independent regions generated with LDetect for European population provided with SUPERGNOVA 35 . Ggupset 36 was used to generate the plots comparing composition of covariated regions with ieu-a-294 in the depression GWAS. Hclust function (with default settings) from stats R package 37 was used to perform hierarchical clustering of the studied depression scales based on composition of nominally significant (p<0.05) covariated regions with IBD (ieu-a-294).
PRSice-2 38 was used to calculate PRS scores for IBD based on Khera, 2018 39 study (polygenic risk score (PGS) catalog ID PGS000017) and from Coleman et al., 2020 40 (PGS ID PGS000193). Glm function from stats R package was used to model the relationship between IBD PRS and HADS-D scores.

Imputation validation
To eliminate the possibility of imputation artifacts for the leading variant -rs521851, we examined directly genotyped variants in this locus and rs12112897 (p=8.95x10 -6 , beta=0.426) demonstrated similar effect size to the imputed leading variant. Furthermore, we evaluated the accuracy of the imputation by comparing allele frequencies for rs521851 in the studied cohort and a whole genome sequencing panel of 107 individuals sampled from the Russian population (clients of Genotek Ltd. not ascertained for phenotypic status at the time of recruitment), and no statistically significant differences were found (AF depression cohort = 0.1088; AF whole-genome sequencing = 0.1038; gnomad NFE AF = 0.1000) (Sup . Table S3).       (A) Comparison between intersection sizes of the set of the genes, associated with the variants with p-value<5e-6 in logistic HADS-D GWAS with differentially expressed genes in sigmoid colon mucus of IBS-C patients compared with healthy controls (ibs cs sample; the data is from Videlock et al., 2018) and 100 random sets of genes with the same sigmoid colon expression profile (random cs sample; expression data is obtained from the GTEx portal*). (B) Comparison between intersection sizes of the set of the genes, associated with the variants with p-value<5e-6 in DSM GWAS with differentially expressed genes in sigmoid colon mucus of IBS-C patients compared with healthy controls (ibs cs sample; the data is from Videlock et al., 2018) and 100 random sets of genes with the same sigmoid colon expression profile (random cs sample; expression data is obtained from the GTEx portal*). *GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct