Single-Cell Transcriptomics Reveals Spatial and Temporal Turnover of Keratinocyte Differentiation Regulators

Keratinocyte differentiation requires intricately coordinated spatiotemporal expression changes that specify epidermis structure and function. This article utilizes single-cell RNA-seq data from 22,338 human foreskin keratinocytes to reconstruct the transcriptional regulation of skin development and homeostasis genes, organizing them by differentiation stage and also into transcription factor (TF)–associated modules. We identify groups of TFs characterized by coordinate expression changes during progression from the undifferentiated basal to the differentiated state and show that these TFs also have concordant differential predicted binding enrichment in the super-enhancers previously reported to turn over between the two states. The identified TFs form a core subset of the regulators controlling gene modules essential for basal and differentiated keratinocyte functions, supporting their nomination as master coordinators of keratinocyte differentiation. Experimental depletion of the TFs ZBED2 and ETV4, both predicted to promote the basal state, induces differentiation. Furthermore, our single-cell RNA expression analysis reveals preferential expression of antioxidant genes in the basal state, suggesting keratinocytes actively suppress reactive oxygen species to maintain the undifferentiated state. Overall, our work demonstrates diverse computational methods to advance our understanding of dynamic gene regulation in development.

with outlier expression. Second, we reduced MAGIC's diffusion time parameter t to prevent over-smoothing of imputed expression values used to calculate correlations.
Estimates of Pearson correlation are strongly affected by outliers. In our study, these outliers were removed by filtering out cells lowly expressing genes expressed by the bulk of keratinocytes. Specifically, for each foreskin keratinocyte, we calculated the sum of imputed expression across genes expressed ( ≥ 1 UMI raw data) in at least 1% of all keratinocytes. Stagewise distributions of these summed expression values identified outlier cells in each stage ( Figure S9); by removing cells in the lowest percentiles (see Table S6 for stage-wise thresholds), we mitigated a skew in the distribution of gene correlations ( Figure S10, top row).
MAGIC's diffusion-based imputation algorithm mitigates dropout effects by replacing raw expression values with a weighted average of expression values of cells with similar low dimensional representation. The extent of local averaging increases with diffusion time t, and large t can over-smooth expression values, thereby averaging out true biological variation and thus strengthening spurious correlations. We observed this effect in the broadening of distributions of Pearson correlations calculated using t=10, compared to the same distribution calculated using t=4 ( Figure S10, middle row). The diffusion time parameter t=10 was previously selected for this dataset based on recovery of simulated dropout events (Cheng et al., 2018), a useful metric for assuring that key expression values are not lost at the single cell level. Recognizing that the optimal value of the t parameter may depend on the type of downstream analysis and wishing to reduce spurious correlations, we used the expression values obtained from our imputation pipeline with t=4 and filtered for outliers using the above summed expression criteria (Figure S10, bottom row and Table S6) as imputed expression values in all analyses downstream of keratinocyte stage identification.
Clustering transcription factor expression trajectories and super-enhancer differential motif enrichment. We performed hierarchical clustering of stage-wise mean expression values to identify dynamic TFs showing similar differentiation trajectories. Keratinocyte TFs (Methods: Identification of keratinocyte-specific genes and transcription factors) were filtered to include only those whose maximum value of mean imputed expression across stages 1-7 was at least 1.75-fold higher than the minimum across the same set; to discard lowly expressed TFs, the minimum was set to 5 counts per million (cpm) when it was less than this threshold. The stage-wise mean expression values of these dynamic TFs were converted to log cpm with pseudocount 1 and then clustered using Pearson correlation distance and average linkage.
To relate regulatory activity measured by TF expression to regulatory activity measured by abundance of functional TF binding sites, we performed differential motif enrichment analysis in super-enhancers (SEs) characterizing BK vs. DK states. We obtained hg19 coordinates of BK and DK SEs from the authors of Klein et al. (2017) (referred to as NHEK-P SE and NHEK-D SE in that publication) and used Bedtools (Quinlan and Hall, 2010) to define BK-specific SEs not overlapping any DK SEs and DK-specific SEs not overlapping BK SEs. Next, we collected position-specific scoring matrices associated with our Keratinocyte TFs from the JASPAR (Mathelier et al., 2016), TRANSFAC (Matys et al., 2006), and Hocomoco (Kulakovskiy et al., 2018) databases, as well as those published in Jolma et al. (2013). FIMO (version 5.0.1) (Grant et al., 2011) was used to scan BK-and DK-specific SEs with each motif using default parameters plus the max-strand option and a 0 th order Markov background model given by the background frequencies of single nucleotides in the union of BK-and DK-specific SEs. This produced a table of motif hit counts for each TF motif in each BK-or DK-specific SE. Motifs were tested for differential enrichment of hit counts per unit length between BK-specific and DK-specific SEs using the Mann-Whitney U test followed by Benjamini-Hochberg multiple hypothesis correction. We accepted motifs with adjusted p-values less than 10 ;< as differentially enriched and used the asymptotic normality of the U statistic under the null hypothesis to measure the magnitude and direction of enrichment as the z-score of the U statistic: where and are the mean and standard deviation of U under the null (Mann and Whitney, 1947). Specifically, letting 3 and D denote the number of BK-and DK-specific SEs, When motifs from multiple databases yielded differential enrichment for the same TF or TF dimer, we selected the strongest motif, by calling the length of the shortest candidate motif ℓ and then ranking the motifs by the sum of Kullback-Leibler divergence from the 0 th -order background across ℓ most divergent bases. Finally, some TFs, such as JUN, FOS and FOSL1, were associated with several different motifs either as monomers or components of heterodimers; in the case of differential enrichment for these functionally distinct motifs, we assigned to the TF the mean of the U statistic z-scores for these enriched motifs.

Prioritization of knockdown targets
We prioritized Candidate Keratinocyte TFs according to log-fold change, during differentiation, of putative targets selected from the set of Keratinocyte Genes (Methods: Identification of keratinocyte specific genes and transcription factors). To identify regulatory targets, we first partitioned Candidate Keratinocyte TFs, denoted here by the set , according to their pattern of differential expression between the BK and DK states (Methods: Differential expression). The set (IJ) contained TFs differentially upregulated in the BK state; the set (KJ) contained TFs differentially upregulated in the DK state; and, the set (LKM) contained TFs not differentially expressed between the two states. Next, we considered as potential targets the set G of Keratinocyte Genes differentially expressed between the BK and DK state. Activating and inhibiting relationships between elements of T and G were assigned based on the strength of correlation calculated across cells specific to each partition of . More precisely, for TFs in the partitions (IJ) , (KJ) and (LKM) , we computed correlations across cells in stages 1-4, 4-7 and 1-7, respectively, leading to the Pearson correlation coefficients Q,R where percentile( , ) denotes the fg percentile of the set A. Finally, each TF in each partition (^) was assigned a differentiation-promoting score, score( ), by summing log2 expression fold-changes between DK and BK states for all elements of G passing the thresholds ] (^) and ; (^) : In this equation, I denotes the indicator function, ( ) is the log2 fold-change of expression for gene between the DK and BK states (Methods: Differential Expression) and signj Q,R (^) k accounts for the activating or inhibiting effect of TF on gene . Figure S4 shows the resulting differentiation-promoting scores along with log2-fold expression changes between imputed single-cell data averaged over the DK and BK states and between keratinocytes cultured in high (1.2 mM Ca) and low (0.07 mM Ca) calcium conditions. RNAi knockdown experiments tested the basal promoting function of four TFs with top five negative differentiation-promoting scores, after removing HOXA1 which was lowly expressed (less than 5 FPKM) in the keratinocytes cultured in in-vitro basal/proliferative conditions.

Regulatory network construction
Regulatory networks were constructed for the BK and DK states as follows. For the BK state, we considered Keratinocyte TFs with motifs enriched in BK SEs compared to DK SEs as putative BK regulators. Similarly, we took Keratinocyte Genes not downregulated in the BK state compared to the DK state as putative BK targets. Signed similarity scores Q,R between genes and were calculated using the soft thresholding method of Zhang and Horvath (2005): and average linkage. TF modules were called using the "inconsistent" criteria in SciPy's fcluster function with parameters depth=2 and threshold=0.75 (Jones et al., 2001-). Putative BK target genes were also organized by hierarchical agglomerative clustering. Each target gene was represented by a vector of similarity scores between the gene and all putative BK regulators. These vectors were clustered using Euclidean distance and average linkage. Like TF modules, gene modules were called using the "inconsistent" criteria of the fcluster function with parameters depth=4 and threshold=2.15 ( Figure S5A). We identified regulatory relationships between pairs of identified Gene and TF Modules by applying thresholding to the distribution of magnitude of mean similarity scores between all pairs: xy mean Q∈z ,R∈I Q,R y ∶ ∈ TF Modules, B ∈ Gene Modules •, ( Figure S5B). TF-Gene Module pairs with mean signed similarly score magnitude exceeding the threshold of Figure S5C were identified as having activating or inhibiting regulatory relationships ( Figure S5D) and were the focus of further investigation. The DK state network was constructed in the same manner as the BK state subject to the following changes: putative DK regulators were selected for motif enrichment in DK SEs compared to BK SEs; putative DK targets were Keratinocyte Genes not downregulated in the DK state compared to the BK state; calculation of Pearson correlations used single cells in stages 4-7; identification of TF modules used the fcluster function with parameters depth=2 and threshold=0.75; and identification of target gene modules used the fcluster function with parameters depth=16 and threshold=3.2 ( Figure S8A-D).

Antioxidant analysis
Genes annotated for antioxidant function were downloaded from the AmiGO2 database (version 2.5.12) (Carbon et al., 2009) and filtered to include only those genes expressed in more than 1% of all keratinocytes in scRNAseq data (Table S2). Genes with dynamic expression in foreskin keratinocytes (log2 fold-change between minimum and maximum stage-wise mean expression for stages 1-7 greater than 1, with the minimum set to 5 imputed cpm when it was less than this threshold) were selected for hierarchical agglomerative clustering. We clustered genes represented as vectors of log2 stage-wise mean imputed cpm with pseudocount 1 using Pearson correlation distance and average linkage.
To test the significance of size enrichment of the cluster showing peak expression in stages 1-3, we generated a null distribution of maximum cluster sizes using a permutation approach. For each of 10,000 iterations, we independently permuted the elements of each log2 stage-wise mean expression vector and repeated the hierarchical clustering procedure identifying four clusters. The p-value was calculated from the percentile of the observed cluster size in the distribution of simulated maximum cluster sizes.