Candidate Regulators of Dyslipidemia in Chromosome 1 Substitution Lines Using Liver Co-Expression Profiling Analysis

Dyslipidemia is a major risk factor for cardiovascular disease. Although many genetic factors have been unveiled, a large fraction of the phenotypic variance still needs further investigation. Chromosome 1 (Chr 1) harbors multiple gene loci that regulate blood lipid levels, and identifying functional genes in these loci has proved challenging. We constructed a mouse population, Chr 1 substitution lines (C1SLs), where only Chr 1 differs from the recipient strain C57BL/6J (B6), while the remaining chromosomes are unchanged. Therefore, any phenotypic variance between C1SLs and B6 can be attributed to the differences in Chr 1. In this study, we assayed plasma lipid and glucose levels in 13 C1SLs and their recipient strain B6. Through weighted gene co-expression network analysis of liver transcriptome and “guilty-by-association” study, eight associated modules of plasma lipid and glucose were identified. Further joint analysis of human genome wide association studies revealed 48 candidate genes. In addition, 38 genes located on Chr 1 were also uncovered, and 13 of which have been functionally validated in mouse models. These results suggest that C1SLs are ideal mouse models to identify functional genes on Chr 1 associated with complex traits, like dyslipidemia, by using gene co-expression network analysis.


INTRODUCTION
Plasma lipid levels of total cholesterol (CHOL), high-density lipoprotein cholesterol (HDL-C), Low-density lipoprotein cholesterol (LDL-C), and triglycerides (TG), are major contributors to cardiovascular diseases (Kathiresan et al., 2007). Current evidence demonstrates that both environmental and genetic factors contribute to these lipid levels. Therefore, discovery of the genetic regulators would be beneficial to determine individual susceptibility to dyslipidemia and eventually for developing gene therapies.
Recent genome wide association studies (GWAS) in humans have linked hundreds of genetic loci to plasma lipid metabolism, including genes APOE, PCSK9, CETP, LIPC, LPL, and APOA5 (Willer et al., 2013a;Helgadottir et al., 2016). Furthermore, several rare variants have been uncovered with next generation sequencing technology (Natarajan et al., 2018). Although significant achievements have been made, the identified genetic loci only explain a small portion of the phenotypic variance, suggesting most of the genetic regulators remain unknown.
Mouse models have been widely used for deciphering regulatory genes of quantitative traits. Hundreds of genetic loci have been identified through quantitative trait loci (QTL) mapping in F2 or backcross mouse populations (http://www. informatics.jax.org/). However, it's challenging to identify causative genes within QTLs. During the past decade, mouse genetic reference populations, such as BXD recombinant inbred strains , Collaborative Cross (Consortium, 2012), Hybrid Mouse Diversity Panel (Ghazalpour et al., 2012), and chromosome substitution strains (CSSs) (Nadeau et al., 2000), have significantly accelerated the precise QTL mapping and functional gene identification through improved mapping power and resolution (Buchner and Nadeau, 2015). CSSs, which typically involve two inbred strains with significant phenotypic differences, are a panel of inbred strains by backcrossing the donor and recipient parents over 10 generations. The final panel contains the entire genome information of both strains, and each CSS carries one intact donor chromosome in the genetic background of the recipient strain. Therefore, any phenotypic differences between CSSs and recipient strain can be attributed to the substituted chromosome. This allows for easy detection of genes for multi-genic traits and quick identification of QTLs through linkage analysis in F2 population and fine mapping with congenic strains. Previously, we proposed a novel strategy of developing a Chr 1-specific CSS substitution line (C1SL) to dissect the complex traits. With this strategy, Chr 1 of the recipient strain C57BL/6J (B6) was replaced by different wild mice individually (Xiao et al., 2010;Xu et al., 2016). Compared to CSSs, C1SLs are suitable for both association studies and systems genetics analysis.
It is well known that genes do not act in isolation, but interact with one another to regulate complex traits. In addition, coexpressed genes usually have similar biological functions or are involved in same biochemical pathways. Therefore, building gene networks would provide an alternate way to identify potential regulators and gain insight into the underlying mechanisms of lipid metabolism (Stuart et al., 2003). To date, several algorithms have been developed to construct gene networks (Henry et al., 2014), and weighted gene co-expression network analysis (WGCNA) is the most widely used (Langfelder and Horvath, 2008). In addition to constructing gene networks, this method also allows one to summarize hub genes and module eigengenes (MEs). These can be used to subsequently identify trait-associated modules by performing "guilt-by-association" between phenotypes and eigengenes.
Several studies have demonstrated that Chr 1 harbors multiple genetic loci that regulate plasma lipid and glucose levels (Orozco et al., 2009;Leduc et al., 2011). In order to identify the casual genes, we measured plasma lipid and fasting glucose levels in C1SLs and quantified transcriptome levels of liver with RNA-seq technique. By combining gene co-expression network analysis with human GWAS and gene functional annotation, several plasma lipid and glucose regulating candidate genes, especially those located on Chr 1, were identified ( Figure 1).

Mice and Diet
All animal procedures were performed in accordance with guidelines of the Laboratory Animal Committee of Donghua University. 13 C1SLs and one B6 strain of adult male mice (an average of seven per strain; n = 97) were housed in a room maintained at 18-22°C with a 12-h light and 12-h dark cycle (6:00 A.M. to 6:00 P.M.). All animals were given a chow diet (M01-F25; Shanghai SLAC Laboratory Animal, Shanghai, China) for eight weeks, then fed with D12450B diet containing 4.3% fat, 19.2% protein, and 67.3% carbohydrate (Research Diets, New Brunswick, USA) until sacrificed by cervical dislocation at 20 weeks of age.

Experiment Measurement
Blood was collected into 1.5-ml tubes with EDTA by retroorbital bleeding from mice fasted for 4 h in the morning. Blood serum was separated by centrifugation at 2,500g for 15 min and frozen at −20°C until performing cholesterol enzymatic assays assay. Enzymatic assays for CHOL, HDL-C, LDL-C, TG, and glucose (GLU) were performed with biochemical blood analyzer (Hitachi 7180; Hitachi, Tokyo, Japan) by Sino-British SIPPR/ B&K Lab Animal (Shanghai, China).

RNA Isolation and Quality Control
RNA was extracted from liver tissues using RNAiso Plus reagent (TaKaRa Biotechnology, Dalian, China) according to the manufacturer's protocol. RNA quality was analyzed using NanoDrop 2000c and Bioanalyzer. Samples with A260/A280 of 1.8-2.0 and RNA integrity number greater than 8 were subsequently used for sequencing library preparation.

RNA Library Preparation and Sequencing
Twenty nine mRNA samples (two samples per strain except for strain LY) were used for RNA library preparation and sequencing. The poly(A) mRNA isolation was performed using Poly(A) mRNA Magnetic Isolation Module or rRNA removal Kit. The mRNA fragmentation and priming was performed using First Strand Synthesis Reaction Buffer and Random Primers. First strand cDNA was synthesized using ProtoScript II Reverse Transcriptase and the second-strand cDNA was synthesized using Second Strand Synthesis Enzyme Mix. The purified double-stranded cDNA by beads was then treated with End Prep Enzyme Mix to repair both ends and add a dA-tailing in one reaction, followed by a T-A ligation to add adaptors to both ends. Size selection of Adaptor-ligated DNA was then performed using beads, and fragments of ∼420 bp (with the approximate insert size of 300 bp) were recovered. Each sample was then amplified by PCR for 13 cycles using P5 and P7 primers, with both primers carrying sequences which can anneal with flow cell to perform bridge PCR and P7 primer carrying a six-base index allowing for multiplexing. The PCR products were cleaned up using beads, validated using an Qsep100 (Bioptic, Taiwan, China), and quantified by Qubit3.0 Fluorometer (Invitrogen, Carlsbad, USA). Then libraries with different indices were multiplexed and sequenced on Illumina X-ten instrument (Illumina, San Diego, USA) by GENEWIZ (Suzhou, China) according to the manufacturer's instructions. Sequencing was carried out using a 2x150bp paired-end (PE) configuration.
FIGURE 1 | Schematic of the methodology. A total of 14 strains (13 C1SLs and one recipient strain B6) were involved in this study. The upper panel shows the characteristics of C1SLs genome background. Orange bars represent B6 chromosome while the others represent different donor chromosomes from wild mice. Blood lipid and fasting glucose levels were measured at 20 weeks of age. Liver gene co-expression network was constructed with WGCNA. The trait-associated modules were identified through testing the association between traits and MEs. The candidate genes were further dominated by integrating human GWAS and HMDC data.

Reads Mapping and Quantification
Reads were aligned to the mouse reference genome (GRCm38) using Tophat2 (Kim et al., 2013) with default parameters. The Cufflinks program Cuffnorm (Trapnell et al., 2010) was used to generate tables of expression values (Fragments Per Kilobase of transcript per Million mapped reads, FPKM) which were normalized for library size based on GRCm38 gene annotation downloaded from iGenome (https://support.illumina.com/ sequencing/sequencing_software/igenome.html). Expression data were further filtered to remove genes that had less than 1 FPKM in 20% or more samples and then log-transformed with log2 (FPKM+1).

Weighted Gene Co-Expression Network Analysis (WGCNA)
Log2 transformed expression values were analyzed with WGCNA package in R (Langfelder and Horvath, 2008) to construct gene co-expression networks. Briefly, a correlation matrix was obtained by calculating pair-wise Pearson correlation coefficients between all genes across all samples. Then, a soft thresholding power b = 6 was chosen based on scale-free topology (R 2 > 0.9) to generate weighted adjacency matrix. The adjacency matrix was further transformed into Topological Overlap Matrix which assesses transcript interconnectedness. Following this, a dissimilarity measure was calculated. Genes were aggregated into modules by hierarchical clustering based on Topological Overlap Matrix and further refined using the dynamic tree cut algorithm. ME is the first principal component of a given module, and it was used to evaluate the module membership, which assessed the importance of genes in the network.

Candidate Gene Analysis Using Publicly Available Resources
We prioritized the candidates using the following public resources:  (Macarthur et al., 2016). GRASP includes available genetic association studies with p value <0.05. GWAS Catalog collects SNP-trait associations with p value <1 × 10 −6 . In the present study, mapped genes or genes nearest to the marker with p value < 1 × 10 −4 were used to looking for overlap with module gene lists. 3. Gene expression atlas across mouse tissue. Gene expression profiles for 22 mouse tissues, which were generated by the Mouse ENCODE project using RNA-seq (Yue et al., 2014), were queried from NCBI (https://www.ncbi.nlm.nih.gov/).
We define genes with "high liver expression" as those with an expression level in liver greater than threefold of the mean expression value across the 22 tissues.

Identification of Genetic Variants for the Candidate Genes
Genetic variants between C1SLs and B6 were identified with whole genome sequencing as previously described (Xu et al., 2016). Variant annotation was performed using Variant Effect Predictor (Mclaren et al., 2016).

C1SLs Exhibits Broad Phenotypic Variability in Plasma Lipid and Fasting Glucose Levels
In this study, plasma lipid (CHOL, HDL-C, LDL-C, and TG) and fasting glucose levels of 13 C1SLs and one recipient strain B6 were examined using enzymatic assays ( Figure 2A). Assay results demonstrate broad phenotypic variability with fold change 1.62 in GLU, 1.55 in CHOL, 1.51 in HDL-C, 2.11 in LDL-C and 1.58 in TG (Figures 2B-F and Supplementary Data S1). Compared to the C1Sls, recipient strain B6 showed relatively low levels of GLU, CHOL, HDL-C, and LDL-C and a high level of TG.

WGCNA Identifies Several Modules Significantly Associated With Plasma Lipid and Fasting Glucose Levels
We carried out high throughput RNA-seq using Ilumina X-ten platform to comprehensively quantify the gene abundance of liver tissue for 29 samples. A total of~2.3 billion reads were obtained, ranging from 26 million to 0.42 billion per sample (Supplementary Data S2). The raw reads were mapped onto the mouse genome with an average of 80% of the read pairs that are properly assigned. Gene expression levels were generated and normalized with Cuffnorm program. Further filtration was applied (See Materials and Methods), which resulted in 10,525 genes for subsequent analysis (Supplementary Data S3).
To identify regulatory genes for plasma lipid and glucose levels. We constructed gene co-expression networks using WGCNA. With the soft-thresholding power parameter (b = 6) determined by the scale-free topology ( Figures 3A, B), a total of 24 modules (after excluding module gray) were identified ( Figures 3C, D and Supplementary Data S3). The module size (i.e., the total number of genes in a module) varies significantly, ranging from 39 genes in module M5 to 2,141 genes in module M24. Among those modules, M19 (83 genes) is significantly associated with all five traits ( Figure  3D), while M1 (930 genes), M14 (99 genes), M20 (389 genes), M21 (117 genes) are significantly linked to TG level and M7 (491 genes), M8 (311 genes), and M12 (247 genes) are associated with fasting glucose level ( Figure 3D).

Gene Prioritizing of Trait-Associated Modules
M19 is the module most significantly associated with all of the traits. There are 83 genes in this module, and 74 are significantly correlated (p < 0.05) with these traits and MEs simultaneously (Supplementary Data S4). Gene ontology (biological process) enrichment analysis revealed that these genes are significantly enriched in lipid metabolism and gluconeogenesis regulation (Supplementary Data S5). In addition, 14 genes are found in human GWAS with p value <1 × 10 −4 ( Figure 4A and Supplementary Data S6) and 11 genes are known regulators for blood lipid or glucose metabolism ( Figure 4C). Four of them, Creg1, Abcc3, Cyp2b9, and Cyp26a1, are highly expressed in liver (Supplementary Figure 1). More importantly, the module hub gene, Tmem176a, is significantly correlated with blood lipid levels ( Figure 4B and Supplementary Data S4) and have been mapped to CHOL in human GWAS with a p value of 2 × 10 −8 (Supplementary Data S6).
For other modules associated with fasting glucose level (M7, M8, and M12), 377 genes are significantly associated with fasting glucose and their MEs simultaneously (Supplementary Data S8). Among them, eight genes are found in human in GWAS ( Figure 6A and Supplementary Data S6), and 27 genes are known glucose metabolism regulators ( Figures 6B-D). Furthermore, three of them, Pck1, Fads1, and Gckr, are highly expressed in liver (Supplementary Figure 1).

Prioritizing Causative Genes on Chr 1
Due to the fact that C1SLs only differ from B6 strain by one chromosome (Figure 1), we believe the phenotypic differences are partly driven by the genetic variations on Chr 1. In this study, a total of 38 genes in the trait-associated modules were found to be located on Chr 1. Of which, 35 harbor missense single nucleotide polymorphisms (SNPs), and all have 3' or 5' UTR variants ( Table 1). In addition, several genes have been associated with the traits in mouse models, including Creg1 and Aox1 in module M19; Phlpp1, Nr5a2, Rnf149, Ncoa2, and Abl2 in module M1; Mogat1, Igfbp2, and Col3a1 in module M20; G0s2, Crp, and Ppox in module M21, M7, and M12, respectively.
Human GWAS in relation to blood lipid and glucose metabolism have identified hundreds of associated genes (Kathiresan et al., 2007;Kathiresan et al., 2008;Dupuis et al., 2010;Willer et al., 2013a;Hwang et al., 2015;Siewert and Voight, 2018). However, most variants identified so far only explain a small portion of phenotypic variance, leaving the majority of heritability unexplained (Manolio et al., 2009). The inability to uncover the remaining spectrum of variance is related to multiple factors, including sample size, genetic structure, rare variants, and gene-gene interactions (Manolio et al., 2009;Parker and Palmer, 2011). In addition, stringent thresholds of p-value with high multiple testing corrections is also believed to exclude many positive signals (Lee et al., 2011;Lee and Lee, 2018). Joint analysis of human GWAS and mouse genetics would help to "rescue" some of the 'missing' heritability (Parker and Palmer, 2011;Ashbrook et al., 2014;Ashbrook et al., 2015;Wang et al., 2016). In the present study, we identified 48 genes in the trait- (C) Twenty four co-expression modules were identified from the liver RNA-seq dataset. WGCNA cluster dendrogram groups genes (n = 10,520) measured across C1SLs and its recipient strain B6 liver into distinct gene modules (M1-24) defined by dendrogram branch cutting. (D) Module-trait associations. Each row corresponds to a module column to a trait. Each cell contains the corresponding correlation and p-value. The table is color-coded by correlation according to the color legend. The hub genes were indicated aside each module.
C1SLs are aimed to identify genes associated with complex traits on Chr 1 by performing association studies or systems genetics analysis. However, the current study only included 13 C1SLs and one recipient strain B6 and performing association studies in such a small number of strains could result in many false positives due to the low statistical power (Flint and Eskin, 2012). Therefore, we used the systems genetics strategy, gene coexpression network analysis, to prioritize the novel candidate genes-especially those located on Chr 1. A total of 38 Chr 1 genes are found in the eight trait-associated modules with an average of 4.75 genes in each. This number is far less than QTL genes identified by linkage studies in F2 mouse segregation population or association studies in mouse reference populations (Buchner and Nadeau, 2015). In addition, we found at least one gene for each module that has been implicated in regulation of plasma lipid or glucose metabolism (Table 1). Therefore, this approach could allow for identification of functional genes (Chr 1) more efficiently than using previous methods and mouse population.
In summary, we identified eight gene networks associated with blood lipid and glucose levels by performing gene coexpression network analysis in C1SL mice population. Further joint analysis of human GWAS resulted in 48 candidate functional genes. In addition, 38 genes on Chr 1, including 13 well characterized genes, are prioritized as causative genes. However, these genes still need further studies to illustrate their potential functional roles. With the development of other C1SLs and further achiving of sequencing data, Co-expression network analysis on C1SLs can provide us a new avenue for identifying other causative genes for complex traits on Chr 1.  Figure 4C.

DATA AVAILABILITY STATEMENT
All raw reads were submitted to NCBI Sequence Read Archive with the accession number SRP198324.