Zebrafish Chromosome 14 Gene Differential Expression in the fmr1hu2787 Model of Fragile X Syndrome

Zebrafish represent a valuable model for investigating the molecular and cellular basis of Fragile X syndrome (FXS). Reduced expression of the zebrafish FMR1 orthologous gene, fmr1, causes developmental and behavioural phenotypes related to FXS. Zebrafish homozygous for the hu2787 non-sense mutation allele of fmr1 are widely used to model FXS, although FXS-relevant phenotypes seen from morpholino antisense oligonucleotide (morpholino) suppression of fmr1 transcript translation were not observed when hu2787 was first described. The subsequent discovery of transcriptional adaptation (a form of genetic compensation), whereby mutations causing non-sense-mediated decay of transcripts can drive compensatory upregulation of homologous transcripts independent of protein feedback loops, suggested an explanation for the differences reported. We examined the whole-embryo transcriptome effects of homozygosity for fmr1hu2787 at 2 days post fertilisation. We observed statistically significant changes in expression of a number of gene transcripts, but none from genes showing sequence homology to fmr1. Enrichment testing of differentially expressed genes implied effects on lysosome function and glycosphingolipid biosynthesis. The majority of the differentially expressed genes are located, like fmr1, on Chromosome 14. Quantitative PCR tests did not support that this was artefactual due to changes in relative chromosome abundance. Enrichment testing of the “leading edge” differentially expressed genes from Chromosome 14 revealed that their co-location on this chromosome may be associated with roles in brain development and function. The differential expression of functionally related genes due to mutation of fmr1, and located on the same chromosome as fmr1, is consistent with R.A. Fisher’s assertion that the selective advantage of co-segregation of particular combinations of alleles of genes will favour, during evolution, chromosomal rearrangements that place them in linkage disequilibrium on the same chromosome. However, we cannot exclude that the apparent differential expression of genes on Chromosome 14 genes was, (if only in part), caused by differences between the expression of alleles of genes unrelated to the effects of the fmr1hu2787 mutation and made manifest due to the limited, but non-zero, allelic diversity between the genotypes compared.

After trimming alignment was performed using STAR v2.5.3a to the Danio rerio genome included in Ensembl Release 94 (GRCz11). Aligned reads were counted using featureCounts (Liao et al., 2014) for each gene if alignments were unique and overlapped strictly within exonic regions. Undetectable genes (genes which contained less than one counted alignment in at least 4 of the 8 samples) were excluded from the analysis.

Initial differential gene expression analysis:
We first performed an initial differential gene expression analysis using the generalised linear model functionality of edgeR. EdgeR uses a negative binomial variance function and estimates dispersions using the Cox-Reid profile-adjusted likelihood (CR) method (Robinson et al., 2010). We specified a design matrix with the wild type genotype as the intercept (or baseline) and the effect of homozygosity for the hu2787 allele of fmr1 as a coefficient (Table S1). After dispersions were estimated and the negative binomial model was fitted, likelihood ratio tests were performed to determine which genes were significantly differentially expressed (FDR-adjusted p-value < 0.05) in the fmr1 hu2787/hu2787 samples. We identified 14 differentially expressed genes in this differential expression test and these genes mostly had low/average expression levels ( Figure S2). This bias may impact gene set enrichment analysis and should be corrected for, as gene sets with low-medium expressed genes will appear as enriched for being upregulated.

Figure S2
: Mean-difference (MD) plot displaying the average expression level (logCPM) against the log 2 fold change (logFC) of each gene. Differentially expressed genes are coloured in red. The blue fitted line from a generalised additive model (gam) identified a small bias for lowly expressed genes to be upregulated.
In response to the observed bias, cqn (Hansen et al., 2012) was used to rectify this issue as it may be derived from either a length or GC artefact. GC and length information for each gene was obtained from the Ensembl database (GRCz11, release 98). For input to cqn, GC content and length were taken as weighted averages and simple averages respectively. We then performed an additional differential gene expression analysis, including the offset term generated by cqn when fitting the negative binomial model. The likelihood ratio tests from this model identified 21 differentially expressed genes. Comparison of the MD plots before and after cqn show that the observed bias was mostly removed, suggesting the gene GC content and length were contributing factors. Figure S4: Comparison of mean difference plots before and after conditional quantile normalisation (cqn). The bias evident in a) the pre-cqn plots is no longer present in b) the post-cqn plot, suggesting GC and length bias were contributing factors. Blue fit lines are derived from generalised additive models.
We next tested for over-representation of genes based on which chromosome they are located on using goseq (Young et al., 2010). Goseq tests whether there is over-representation of pre-defined gene sets amongst the set of DE genes. It does not take into account the magnitude or direction of the fold change. However, it can take into account a bias of a gene being classified DE due to its GC content. We found that genes on chromosome 14 were highly over-represented in the DE genes (Table S2). We next tested for over-representation of the KEGG and HALLMARK gene sets within the DE gene list. We downloaded the KEGG and HALLMARK gene sets from the Molecular Signatures database (MSigDB) as a .gmt file with human gene Entrez identifiers. The human gene Entrez identifiers were converted to zebrafish Ensembl identifiers using a mapping file downloaded from the Ensembl Biomart web server (https://m.ensembl.org/biomart). Some genes in the KEGG gene sets did not contain a zebrafish orthologue. Therefore, the gene sets occasionally contained only a small number of genes and this would not be particularly informative. For this reason, only KEGG gene sets which contained > 10 zebrafish genes were retained for analysis.
We identified that the KEGG gene sets for lysosome and glycosphingolipid biosynthesis globo series were significantly over-represented in the set of DE genes. The HALLMARK gene set for early estrogen response approached significance for over-representation. The results from enrichment testing for HALLMARK gene sets within the set of DE genes is found in Table S3 and the results for the KEGG gene sets are found in Table S4.  Next, we performed gene set enrichment analysis (GSEA) on all detectable genes in the RNA -seq experiment to obtain a more complete view on the changes to gene expression due to fmr1 genotype. We performed GSEA using the fry (Wu et al., 2010;Ritchie et al., 2015) algorithm from the limma package. Fry approximates the ROAST method, which uses residual space rotation rather than permutations to determine the significance of a gene set showing changes to gene expression (Wu et al., 2010). Using this method, we did not find any significantly altered gene sets (KEGG, HALLMARK or chromosome position) after FDR adjustment of the mixed p-value. The top 10 results are shown in the Table S5.

Analysis of genes on chromosome 14:
The set of genes located on chromosome 14 was found to be over-represented in the DE list by goseq, and was the gene set identified by fry to contain the most DE genes (Table S5), although it did not attain the threshold for significance (p<0.05). We hypothesised that the most differentially expressed genes from chromosome 14 might be enriched in a particular biological pathway which could explain the over-representation detected by goseq (see Table S2 above).
To investigate this, we performed the fast implementation of the GSEA (Subramanian et al., 2005) algorithm, fgsea (Korotkevich et al., 2021), on the gene sets for chromosome position to obtain a list of the "leading edge" genes for chromosome 14. The first step for fgsea is to generate a ranked list of all genes in the experiment. We ranked genes based on the statistical significance of their differential expression (without direction). A visual representation of the ranked list is shown in Figure S5.

Figure S5: Visual representation of the ranked list.
All detectable genes were ranked on the statistical significance of their differential expression, resulting in the most differentially expressed genes at the start of the ranked list (i.e. smallest pvalue), and the least differentially expressed genes at the end of the ranked list (largest p -value).
We used this ranked list as input for fgsea, which tests whether there is an enrichment of predefined gene sets at either end of the ranked list. We considered a gene set to be enriched if the Bonferroni adjusted p-value was < 0.05. Table S6 gives the significantly enriched chromosome position gene sets identified by fgsea. The leading edge genes for each gene set from the fgsea algorithm are the most highly ranked genes of a gene set in the ranked list and contribute most to its enrichment. Figure S6 shows the enrichment plots of genes found on chromosomes 14 and 22 relative to the ranked list. The leading edge genes are the genes which are positioned in the ranked list before the peak of the enrichment score. The x-axis shows the ranked list of genes, with black bars indicating a gene from A) chromosome 14 and B) chromosome 22, and missing bars indicating a gene not from either chromosome. The y -axis gives the enrichment score, with the green line indicating the running enrichment score along the ranked list. The genes located before the peak are the leading edge subset.
To determine whether the leading edge subset of genes from chromosome 14 are enriched in any biological pathways, we performed over-representation analysis on the chromosome 14 leading edge against all detectable genes in the RNA-seq experiment. We found that the KEGG gene sets for RNA polymerase and neuroactive ligand receptor interactions were found to be significantly enriched after correction for multiple testing by FDR, but not by the more stringent Bonferroni method ( Table  S7).  The Kyoto Encyclopedia of Genes and Genomes, KEGG, pathways for A, neuroactive ligand receptor interactions and B, RNA polymerase, with the intensity of the colour representing the log 2 FC of each gene. Genes which are present in the leading edge of chromosome 14 are indicated with the orange boxes. Plots are adapted from Pathview (Luo and Brouwer, 2013) and displayed with permission of KEGG, (Kanehisa and Goto, 2000).
As an alternative viewpoint, we performed over-representation analysis on the chromosome 14 leading edge subset against all genes on chromosome 14. No significant over-representation was found (Table S8). However, all three genes from the ALS and long term potentiation gene sets which are found on chromosome 14 are present in the leading edge subset, suggesting a possible coregulatory mechanism.

Testing the influence of w2 (nacre) mutation contamination
Some of the parents used to generate the clutches of larvae in this experiment were heterozygous for the w2 allele of melanocyte inducing transcription factor a (mitfa), a C to T point mutation which results in a premature stop codon in exon 3 of mifta (Lister et al., 1999). (Apparently this allele was present in the original stock of fmr1 hu2787 fish we imported into Australia.) We first assessed the proportion of reads which aligned to the w2 allele of mifta relative to the wild type allele to estimate the proportion of parents who carried the w2 allele. Sample "D" and "S5" had relatively high proportions of w2-aligned reads. Samples S2, S4 and S8 had relatively low proportions of w2-aligned reads. Samples A, G and L had no reads aligned to the w2 allele. Exploratory analysis between the proportion of reads aligning to the w2 allele and the first four principal components of the dataset (which capture approximately 85% of the total variability) did not identif y any strong correlations.
To determine whether the proportion of w2 reads affects the results of a differential gene expression test, we used the proportion of w2 reads as a blocking variable (as a categorical factor, with no, low and high w2 contamination as the levels) in a limma voom analysis with the duplicateCorrelation function. A comparison between a ranking statistic (sign(logFC) x -log10(p-value)) showed marginal changes to the results of the DE analysis. Therefore, the presence of the w2 allele in mifta likely has minimal effects on DE analysis and can be ignored ( Figure S8). Figure S8: Presence of the w2 allele of mifta does not greatly affect the transcriptomes of pooled fmr1 -/larvae.
A) Alignment of reads to the w2 site of melanocyte inducing transcription factor a (mitfa, on chromosome 6, position 43429185 according to the GRCz11 build of the zebrafish genome). B) Percentage of reads aligning to the w2 site for each sample. C) Pearson correlations between the first four principal components of the conditional-quantile normalised expression values for each gene, and the proportion of reads aligning to the w2 site (prop_nacre), fmr1 genotype, and RNA-seq library size (lib.size). D) Proportion of w2 (nacre) reads shown against the first four principal component values. Minimal correlation is observed. However, the standard errors (grey) for the regression lines in blue are large, suggesting that the effect is not highly significant. E) Scatterplot showing a ranking statistic (sign(logFC) * -log10(p-value)) for each gene with and without using the proportion of w2 reads as a blocking variable in a limma voom analysis. The most highly ranked genes (i.e. the most differentially expressed) do not change when including the proportion of w2 reads in the model.