Pattern analysis of pluripotency reveals the molecular basis of naïve, primed and formative states

The molecular regulatory network underlying stem cell pluripotency has been intensively studied, and we now have a reliable ensemble model for the ‘average’ pluripotent cell. However, evidence of significant cell-to-cell variability suggests that the activity of this network varies within individual stem cells, leading to differential processing of environmental signals and variability in cell fates. Here, we adapt a method originally designed for face recognition to infer regulatory network patterns within individual cells from single-cell expression data. Using this method we identify three distinct pluripotency configurations - corresponding to naïve, primed and ‘formative’ states - and associate these configurations with particular combinations of regulatory network activity archetypes that govern different aspects of the cell’s response to environmental stimuli, cell cycle status and core information processing circuitry. These results show how variability in cell identities arise naturally from alterations in underlying regulatory network dynamics and demonstrate how methods from machine learning may be used to better understand single cell biology, and the collective dynamics of cell communities.


Integrating regulatory interactions with single cell data
We first sought to obtain a reliable training dataset of protein expression patterns in pluripotent cells across multiple intracellular information levels, including the protein abundance of core transcription factors 4;5 , the phosphorylation status of signaling pathways [6][7][8] and global transcriptional activity based on histone acetylation 9;10 . Such systems-level proteomic information at single-cell resolution is currently only available through immunolabeling followed by mass-cytometry, a highly specialized technique that is available to only a small number of groups 36 . Thus, we sourced a relevant training dataset from the literature 37  To interrogate this data, we sought to supplement it by constructing a directed regulatory network specific to the features (transcription factors, surface epitopes, phosphorylation, etc.) that had been quantified (Fig. 2).
Features (that is, proteins profiled) in this signed, directed regulatory network are represented as nodes and regulatory interactions between features are represented as edges between pairs of nodes (an edge is positive if it is activating, and negative if the edge is inhibiting). Evidence for node interactions was extracted from transcription factor binding data from ChIPBase 2.0 41 , and information on other known interactions were sourced from the Kyoto Encyclopedia of Genes and Genomes (KEGG) 42 and Reactome 43 (see Table S1 for details). Unconnected nodes, such as the inert GFP reporter, and cell cycle markers pH3 and IdU were removed from the analysis. The resulting network G contains 27 nodes, connected by 124 edges (Fig. 2a).
The overall structure of G is conveniently encoded in the network adjacency matrix, where s = +1 for activating interactions, and s = −1 for inhibitory interactions.
The first step in our process consists of combining this regulatory network with the single cell expression training set. Trivially, the expression data represents the activity of the nodes in the network within each cell, but does not take into account regulatory interactions between nodes. To incorporate this information, we assumed that the activity of each edge within the network is determined by the signal intensities of both interaction partners within the individual cell. Accordingly, denoting the vector of expression values in a given  Table S1). The network accounts for multiple molecular information processing mechanisms, at multiple different spatial locations in the cell, including interactions between: transcriptional regulators (green squares), chromatin modifiers (petrol octagons), cell cycle factors (sea green rounded squares), signalling cascades (light green circles), and surface molecules (yellow diamonds).
cell by v, we created a weighted adjacency matrix W conditions as a training dataset and held back the remaining data to test the model.

Regulatory networks characterize alternate states of pluripotency
Once the training data had been produced, we conducted principal component analysis. In the same way that the principal components (PCs) in the eigenface routine may be reshaped and interpreted as facial archetypes from which individual portraits may be reconstructed, the principal components here may be reshaped and interpreted as network archetypes from which pluripotent cell identities may be reconstructed. However, while only ∼ 5% of the PCs are required for accurate face recognition, we found that (for both NG and NN mESCs) ∼ 23% of the PCs were required to explain 95% of the variance in our training data (Fig. 3a). The larger number of PCs required is not unexpected, and is reflective of the high levels of noise that are characteristic of high-throughput single cell data 44 . Therefore, rather than using the proportion of variance explained to determine the appropriate number of PCs to retain for subsequent analysis, we sought to identify the minimal number needed to preserve the natural clustering structure in the data.
We found that four distinct clusters of cells were readily identifiable in the full dataset (natural clustering structure was obtained by fitting to a Gaussian mixture model to the data and selecting the most appropriate model, here with four components, by minimizing the Bayesian information criterion [BIC], see Fig. 3d and Fig. S1d). This natural clustering was robustly retained when projecting the data onto the first three PCs (Fig. 3b); higher components only added noise to this basic clustering structure. This analysis suggests that PCs 1-3 account for the biological variability present in the data, while higher components primarily correspond to technical variability.
Since the PCs are linear combinations of the underlying features (here, network edges) each one may be thought of as regulatory network archetype, and the expression pattern of each cell in the training data may therefore be reconstructed as a weighted sum of these archetypes. By analogy with eigenface routine, we will call these network archetypes eigen-networks. Since PCs 1-3 account for the biological variability in the data, the structure of the eigen-networks associated with these components are of particular interest. The first eigennetwork (PC1 in Fig. 4a) naturally separated cells into two subsets (compare Fig. 3b), based upon overall activity of regulatory interactions (Fig. 3c). A subset of cells with low overall edge expression (cluster 1 in Fig. 3b) primarily contained apoptotic cleaved Casp3-positive cells (Fig. 3c) and cell cycle arrested cells (Fig. S1g) that lacked expression of the hallmarks of pluripotency, such as Oct4, Nanog and Klf4 (Fig. 3c). In contrast to this small subset, the majority of cells displayed high overall expression of pluripotency related-factors, including Oct4 (Fig. 3c).
The majority pluripotent population identified by the first eigen-network naturally separated into 2 distinct further sub-populations (clusters 2, 3 in tent state (cluster 2), which are characterized by low Nanog and Klf4 expression, and more sporadic connectivity between signaling and transcriptional controls (Fig. 3c).
In addition to these primary populations we also observed small subset of cells (∼ 2%) that could be  S1f-g). Rather, since Klf4 has been implicated in driving the transition from an epiblast stem cell phenotype to an embryonic stem cell phenotype 27 , we conjecture that this population is an intermediate between the naïve and primed pluripotent states. In accordance with this notion, we observe that it has the highest total within cluster variance, indicating the presence of substantial cell-cell variation (see Fig. S1e), which is typically found in cells transitioning from one state to another 45 . While the full nature of this state has yet to be determined, it is consistent with the recently proposed 'formative' phase of pluripotency, characterized by dissolution of core pluripotency sustaining mechanisms and increased interaction with the extracellular matrix 30 .
To investigate this possibility further we constructed representative networks for each of the four identified states using the first three eigen-networks and the weight vector corresponding to the centroid for each cluster (see Fig. 4b). The resulting networks may be thought of as representations of the characteristic patterns of network activity within each of the four states we identified. These networks show that: (1) the naïve state is characterized by strong co-regulatory activity between members of the core transcriptional circuit and strong integration of signalling pathways with this core sub-network (Fig. 4b). (2) By contrast, the formative state is characterized by partial dissolution of the core transcriptional circuit (in particular a loss of Nanog, Sox2 and p53 activity), which is accompanied by changes in cell-cell (CD54) and cell-matrix (CD73, CD44) mediated signaling. However, cells in this state continue to perceive environmental signals via the LIF/Stat3 signaling pathway (Fig. 4b), indicating continued receptivity to environmental cues. (3) The subsequent primed state is marked by a further dissolution of the core transcriptional circuit, including the loss of Klf4 regulatory activity (Fig. 4b) and a decrease in LIF/Stat3 signaling (Fig. 4b), suggesting that these cells are transitioning out of pluripotency. Accordingly, the primed state is also marked by the positive regulation of EpCAM (Fig. 4b), suggesting the onset of cell polarization, as is observed in the epiblast of the egg cylinder in vivo 46 .
In summary, this analysis revealed the presence of four distinct cellular communities, each characterized by different levels of activity of regulatory network archetypes, within mouse ES cell populations cultured in 0i conditions. To determine how general these results were we also examined network expression patterns mESCs cultured in 2i conditions, which stimulate Wnt signaling activity and reduce Erk-phosphorylation using small molecule inhibitors of MEK, and thereby shield the core transcriptional circuitry from extrinsic differentiation cues 39 . In accordance with the nature of these conditions we found that populations 1, 2 and 4 (corresponding to differentiated, primed and formative pluripotency states) were comprehensively depleted in mESCs cultured in 2i conditions, while cluster 3 (corresponding to the naïve state) was robustly maintained (Fig. 3e). These results re-affirm the potency of these conditions to purify the ground state of pluripotency, and provide mechanistic insight into the molecular mode of action of these conditions. Taken together these results indicate the presence of three distinct states of pluripotency each distinguished by characteristic patterns of regulatory network activity within individual cells.

Individual cells transition through distinct network activity states during reprogramming
To further investigate the biological importance of the regulatory network archetypes we had identified we then sought to determine their temporal expression during cellular reprogramming of somatic cells to pluripotency.
During cellular reprogramming, pluripotency regulatory network activity is typically initially established through the ectopic expression of four trans-genes, Oct4, Sox2, Klf4 and c-Myc (OSKM) 40 . Subsequently, the concerted action of these core reprogramming factors leads to profound changes to the cellular phenotype, ultimately re-instating a self-sustaining pluripotent identity in a small proportion of cells. The dynamics of this process are thought to be initially driven by low frequency stochastic events followed by the deterministic To analyze this data we first fit our training data (expression patterns of NG mESCs cultured in 0i conditions) projected onto the first three eigen-networks (as described above) with a Gaussian mixture model (GMM) with four components. This GMM may be thought of as an estimate of the joint probability density function P(x) for the training data, projected onto the first three PCs (where x ∈ R 3 identifies points in PC space). We We first observed that the surprisal remained high, and approximately constant, for the first 10-12 days of reprogramming (Fig. 5a), indicating that cells in the starting population (in this case NG MEFs) consistently exhibited expression patterns that are unusual for pluripotent cells, as expected. However, around days 10-12 the population split into two distinct sub-populations: a majority sub-population in which the surprisal remained high, and a minority sub-population in which the surprisal was substantially reduced, suggesting the emergence of population of pioneer partially reprogrammed cells (Fig. 5a). Over the next ∼ 20 days the proportion of cells in the low surprisal sub-population gradually increased, indicating the consolidation and proliferation of a robustly pluripotent population of cells (Fig. 5a).
To better understand the identity of this emerging pluripotent sub-population we sought to relate it to the three alternate pluripotency states we had identified (see Fig. 5). To do so we used our fitted GMM to classify each cell in the time-course into one of the four populations identified in the training data (Fig. 5b). Since numerous cells, particularly at the beginning of the time-course, did not resolve well onto any of the clusters in the training data (which is to be expected, since they are not pluripotent) we also incorporated a fifth class to capture those cells with network activity states that were distinct from those found in the training data (for details see Methods).
This analysis revealed that specific instances of regulatory network activity define distinct phases of the reprogramming process (Fig. 5b).
Initially These data suggest that reprogrammed cells do not emerge in significant numbers until after after dox is withdrawn, at which point the pluripotency regulatory network begins to assume a more natural configuration.
These observations are in accordance with the notion that activation of the OSKM transgenes prevent cells from entering a stabilization phase of reprogramming in which the pluripotent state becomes fully established 48 .
Notably, at around the same time there is an apparent reduction in the frequency of cluster 4 cells, which are marked by low Sox2 and p53 activity, indicating that these cells only exist transiently during reprogramming.
Since this population is more variable than the naïve and primed pluripotent populations, it may also mark the handover from the early stochastic phase of reprogramming, in which the activation of OSKM transgenes initiate transformation, to the late deterministic phase, in which the pluripotent cell identities are consolidated by endogenous regulatory mechanisms 47 .
Taken together these results indicate that reprogrammed MEF cells enter pluripotency via the formative state. It remains to be seen if this is a general characteristic of reprogramming, or if this particular route is due to the fact that the MEF starting population has a mesenchymal origin that happens to be more similar to the formative state than it is to the other pluripotent identities.

Discussion
The notion that there is a single well-defined pluripotent stem cell identity has been rapidly eroded by advances in single cell analysis methods, which are now revealing ever greater varieties of pluripotency 18-20;39 . Collectively, these results suggest that pluripotency is not a single phenotype but instead is a property that spans a continuum of observable cell states 1;30;39;52-54 . This is in part because the densely connected pluripotency regulatory network is rich in feedback loops which both stabilize pluripotency, and endow pluripotent cells with a remarkable phenotypic plasticity 5;55 . Hence, to fully understand pluripotency, strategies to decipher regulatory networks at single cell resolution are needed.
There have been a number of notable advances to this end, particularly with regard to methods for inferring and analyzing regulatory networks directly from single cell data, which can reveal aspects of regulatory control that are inaccessible to study with ensemble techniques such as ChIP-Seq 14;47;54;56 . For example, Trott and coworkers have inferred regulatory network activity from correlation patterns in single cell data in different stem cell sub-populations, and related these different activity patterns to different aspects of the stem cell identity 14 .
Similarly, Stumpf (not the current author) and colleagues have used powerful notions from information theory to more precisely identify regulatory interactions from single cell time-course data 56 . However, single cell data is inherently noisy, and consequently large numbers of cells are needed to gain the statistical power to accurately distinguish functional from spurious interactions 56 .
To circumvent this problem here we have presented a method that incorporates prior knowledge of regulatory interactions directly into single cell expression patterns, rather than inferring regulatory interactions from the data itself, and uses this prior knowledge to dissect the regulatory processes that give rise to different states of pluripotency. This approach is similar to that taken by Teschendorff and colleagues, who, by projecting single cell data onto a known regulatory network, find that pluripotency can be remarkably well related to systemslevel emergent network properties 17 . We anticipate that as single cell profiling methods develop we will see concurrent advances in the statistical methods needed to investigate and interrogate the resulting data: indeed, new statistical advances will be essential to fully realize the power of these new and emerging technologies. We expect that Bayesian methods, which use known regulatory interactions as a prior to guide learning of functional interactions directly from single cell data, will combine the benefits of the two approaches to this problem and may therefore be particularly powerful.
In summary, we have adapted a simple image analysis method to infer the presence of four distinct patterns of pluripotency, based on the activity patterns of three regulatory network archetypes within individual cells. The power of our method is not due to its mathematical or computational sophistication -indeed, it is mathematically and computationally straightforward -but rather in the biological interpretation it allows. As such it provides a simple example of how methods from machine learning may be easily adapted to address biological questions in an intuitive way. In particular, using this method we have identified a novel pluripotent state, which appears to be an intermediate between the well-known naïve and primed states (see Fig. 6) and shares many of the putative properties of a recently proposed 'formative' state 30 . Cells in this state are still dependent on LIF/Stat3 signaling yet are characterized by partial dissolution of the core transcriptional regulatory circuit and distinct changes in cell-cell and cell-matrix interactions. Furthermore, these cells only appear at low frequency in 0i culture conditions and transiently during the early stages of cellular reprogramming of MEFs to pluripotency. Taken together these results suggest that this 'formative' state is a temporary intermediate in which the feedback mechanisms that stabilize the core pluripotency circuit become weakened and cells begin to become competent for lineage allocation. It remains to be seen how the population we have identified relates to recent observations of formative pluripotency characterized by loss of Rex1 expression and genome wide reorganization 57 . We anticipate that the coming years will see greater advances in single cell profiling and analysis methods that will enable us to address this question, and identify with greater precision the regulatory networks that control the maintenance and exit from pluripotency. In summary, these data contain measurements of 46 features taken at the single-cell level by mass cytometry, from two separate engineered mouse embryonic stem cell (mESC) lines NG (Nanog-GFP) and NN (Nanog-Neomycin). Each mESC line contains doxycycline (dox) inducible gene cassettes for Oct4, Sox2, Klf4 and c-Myc used for secondary reprogramming to pluripotency from somatic mouse embryonic fibroblasts (MEFs).

Single-cell expression data
Data includes the expression profiles of mESCs in steady state pluripotent stem cell culture conditions containing

Software and computer code
Analyses were performed in R 58 version 3.3.2. Computer code used in this study is available as a R-markdown file from https://github.com/passt/Eigen-Networks.

Data availability
Data used in this study is available from Cytobank (accession 43324).