In Vivo Clonal Analysis Reveals Random Monoallelic Expression in Lymphocytes That Traces Back to Hematopoietic Stem Cells

Evaluating the epigenetic landscape in the stem cell compartment at the single-cell level is essential to assess the cells’ heterogeneity and predict their fate. Here, using a genome-wide transcriptomics approach in vivo, we evaluated the allelic expression imbalance in the progeny of single hematopoietic cells (HSCs) as a read-out of epigenetic marking. After 4 months of extensive proliferation and differentiation, we found that X-chromosome inactivation (XCI) is tightly maintained in all single-HSC derived hematopoietic cells. In contrast, the vast majority of the autosomal genes did not show clonal patterns of random monoallelic expression (RME). However, a persistent allele-specific autosomal transcription in HSCs and their progeny was found in a rare number of cases, none of which has been previously reported. These data show that: 1) XCI and RME in the autosomal chromosomes are driven by different mechanisms; 2) the previously reported high frequency of genes under RME in clones expanded in vitro (up to 15%) is not found in clones undergoing multiple differentiation steps in vivo; 3) prior to differentiation, HSCs have stable patterns of autosomal RME. We propose that most RME patterns in autosomal chromosomes are erased and established de novo during cell lineage differentiation.

The list of imprinted genes was retrieved from https://www.geneimprint.com and Tucci et al., 2019.A total of 167 genes were discoverable in the annotation file used in this study (ftp://ftp.ensembl.org/pub/release-68/gtf/Mus_musculus.GRCm38.68.gtf). The "Bias" threshold was set at 15% expression for the silenced allele and inclusion criteria was "all samples with a measurement". Additionally, an exclusion criteria due to missing datapoints was applied: for B cells (total = 11 samples), a minimum number of non-missing values was N=6, and if less than 6, the gene was classified as "maybe"; for T cell (total = 5 samples), one failed sample was enough for the gene to be classified as "maybe". Of the reported imprinted genes, 55 were expressed in our B cell samples, and 62 in T cell samples. To our knowledge, there are no accounts of imprinted genes in B or T cells in the literature. Of the 55 imprinted genes expressed in B cells, only 4 have been confirmed as imprinted in a related lymphoid tissue, the spleen (asterisks [*]; Andergassen et al, 2017), while only 5 genes were imprinted in the thymus among the 62 "imprinted" genes that were also expressed in T cells (Andergassen et al, 2017). The rows in the table are ordered: first biased (yes>maybe>no), then imprinted. Red highlights genes with nonrandom bias in both B and T cells. Blue highlights the genes described in the literature as imprinted and expressed biallelically in both B and T cells in our study. Purple highlights genes with nonrandom bias in both B and T cells and described in the literature as imprinted. There are imprinted genes in the literature that in our samples do not reach the AI nonrandom bias threshold criteria (<15% expression for the silenced allele), but are biased (e.g., Impact, see also Supplementary Figure 6). The only confirmed lymphoid imprinted gene (in spleen or thymus) we studied that is not biased in our B or T cell samples is Commd1.   Supplementary Table 4. List of all statistically significant AI differences between samples for the 14 genes identified as putatively RME in B cells (related with Figure 4A). P-value associated with each comparison after applying the QCC correction to the binomial test (Mendelevich et al., 2021). The dAI is the difference between AI values. Clonality indicates whether the sample was originated upon expansion of 1, 50 or 200 transplanted cells. Shaded rows highlight comparisons between only monoclonal (shaded) or only polyclonal samples (unshaded).  Supplementary Table 5. Probability of finding AI for any given gene in the monoclonal animals due to somatic indels and single nucleotide variants (null hypothesis). The parameter values based on the literature that are overestimations (as explained in the comments) are shown in bold. We privileged studies of murine cells (references in bold) and used studies on human cells as the second-best option (references in italics). Most parameter values based on assumptions (in italics) are also overestimations. This estimation suggests that not all AI patterns of the 14 genes we identify can be explained by somatic mutations. The most notable example is Pkp3, for which we show the probability (which led us to reject the null hypothesis of a genetic mutation explanation for the AI pattern).

Value for estimation Comment
Ref.

SNV per HSC (ns) 195
Overestimation. We used the HSC with the highest number of mutations (195). The average number is around 105. The animals used for WGS were 8-month old, and ours were <5 months (mutations accumulate over time). (Druce, 2021) Indels per HSC (ni) 42 Overestimation. The HSC with the highest number of mutations (42) was used. The average number is around 26.
Total number of genes expressed by a cell (ng) 24,000 Overestimation. The average number of expressed genes in a given cell is 10,700, but all genes were assumed to be expressed.

Intergenic mutations (Pintergenic)
These mutations were assumed to have no impact on AI.

Mutations in cis
Impact of indels in exons (P iie)

1.000
Overestimation. All indels are assumed to lead to non-sense mediated mRNA decay, which is unlikely.
Impact of indels in introns (P iii)

0.200
Overestimation. 20% of indels in introns are assumed to lead to AI, which is unlikely.
Impact of indels in regulatory regions (P iir) 0.200 Overestimation. 20% of indels in regulatory regions are assumed to lead to AI, which is unlikely. Frequency of indels in exons (P fie) 0.015 The values were assumed to be identical to the frequencies of SNVs.
Frequency of indels in introns (P fii) 0.261 Frequency of indels in regulatory regions (P fir) 0.020
Impact on transcription SNVs regulatory regions (P isr) 0.200 20% of SNVs in regulatory regions are assumed to lead to AI, which is unlikely. Frequency of SNVS in exons (P fse) 0.015 Empirical estimations. (Druce, 2021) Frequency of SNVs in introns (P fsi) 0.261 Frequency of SNVs in regulatory regions (P fsr) 0.020

Supplementary Figures
Supplementary Figure 1. Ly5.1 and Ly5.2 pan-leukocytic markers were used to distinguish recipient and donor cells in reconstituted animals, respectively. Ly5.1 and Ly5.2 do not label the CAST progenitor line, and when CAST is crossed with B6 Ly5.1/Ly5.1 or B6 Ly5.2/Ly5.2, to produce the recipient and donor F1 animals, respectively, the recipient and donor cells are distinguishable by these two markers. Blood samples of progenitor and descendants (F1) were lysed for red cells, stained with FITC-conjugated anti-Ly5.2 and PE-conjugated anti-Ly5.1, and analyzed using FACSCanto.
Supplementary Figure 2. Percentages of chimerism identified in the blood of reconstituted animals for 16 experiments at 12 weeks post-injection (orange dots, monoclonal animals; blue dots, polyclonal animals). An animal was considered reconstituted if the chimerism percentage was above 1%. The sequenced samples in this study belong to experiments 6, 13, and 15 (marked with asterisk (*)).

Supplementary Figure 3.
Representative plots of pre-sorted and post-sorted B/T-cell populations of an animal reconstituted with a single HSC. Cells from the spleen and thymus of recipient animals were isolated, stained for B-cell markers with PE anti-Ly5.2, FITC anti-Ly5.1, and PE-Cy7 anti-CD19 and APC anti-IgM (splenocytes), or T-cell markers with PE-Cy7 anti-CD4 and BV605 anti-CD8 (thymocytes), and sorted on a FACSAria. The cells were gated for PIto exclude dead cells and on Ly5.2 + /Ly5.1to obtain pure donor cells, and then for CD19 + /IgM + to select B-cells or for CD4 + /CD8 + to select for T-cells. The purity of sorted cells was assessed by analyzing 150-250 of the sorted cells.
Supplementary Figure 4. Monoclonality screening was used to confirm if the recipient system was reconstituted with a single HSC. The cDNA Sanger sequencing chromatograms cover a region with two SNPs in the Xist locus that allow us to assign the Xist transcript to the CAST or B6 X chromosome. Due to XCI, when a single cell is used for the reconstitution, a single peak is expected in the position of the SNP; when multiple cells were used for reconstitutions, two peaks should be observed in each of the SNP positions. Samples E6.1, E6.2, E13.1, E13.2 and E13.5 are cells expanded after the injection of multiple HSCs (polyclonal samples), samples E6.42, E6.43, E13.24, E13.29 and E15.10 are cells expanded after the injection of a single HSC (monoclonal samples). AI of X-linked genes for B and T cells. As a convention, an AI=1 means that the gene is 100% expressed from the allele of the inactive chromosome X (Xi); Xi allelic imbalance=1 means that the gene is 100% expressed from the inactive X-linked allele; Xi allelic imbalance=0 means that only the active X-linked allele was detected. Dots represent genes with expression higher than 10 TMM-normalized counts and only genes that were statistically different from the threshold at least once are shown. Yellow dots represent monoclonal samples; dotted violet stroke surrounding yellow dots denote statistical significance for that sample. Red dots represent the median of the AI observed for polyclonal and control samples (which are otherwise excluded from this top panel). Statistical significance was calculated by comparison of the AI with the sample-corrected threshold using binomial test and QCC correction. The threshold was calculated per sample, as 0.1 (which is the value usually found in the literature) + the value of the median of AI of all X-linked genes in the sample. (C), (D) Abundance (TMM-normalized counts) of the same genes and same samples represented in (A), (B). In addition, individual polyclonal and control samples are shown, as well as samples with abundance <10. Violet dots represent the monoclonal samples in which the AI significantly deviates from the sample-corrected threshold. Yellow dots represent the other monoclonal samples, blue dots, the polyclonal samples, and black dots are the control samples. Genes in violet (x-axis) were identified as escapees using three criteria: 1) only samples with abundance higher than 10 were considered; 2) the median of AI in the control samples (polyclonal and control samples) was balanced (0.5±0.2); and 3) the AI of the gene is statistically different from the threshold in at least two samples, irrespective of the tissue. Figure 8. Pairwise comparisons of AI between animals for B and T cells, with values of Pearson's coefficient correlation and the number of genes with a significant differential AI after applying QCC correction on the binomial tests. Abundance values are TMM-normalized counts.

Supplementary
Supplementary Figure 10. Pairwise comparisons of AI between Abelson-immortalized B-cell clones, with values of Pearson's coefficient correlation and the number of genes with a significant differential AI after applying QCC correction on the binomial tests. Abundance values are TMM-normalized counts.
Supplementary Figure 11. Location of 14 genes with persistent clone-and allelespecific autosomal transcriptional states across distributions of locus size, open reading frame (ORF) size, and expression in long-term hematopoietic stem cells (LT-HSCs), including all protein-coding genes. Gene sizes were obtained from the latest release of the gencode mouse genome annotations downloaded GTF file (http://ftp.ebi.ac.uk/ pub/databases/gencode/Gencode_mouse/release_M27/gencode.vM27.annotation.gtf.gz) with custom scripts. ORFs were generated from the downloaded genecode transcript