Mapping Mammalian Cell-type-specific Transcriptional Regulatory Networks Using KD-CAGE and ChIP-seq Data in the TC-YIK Cell Line

Mammals are composed of hundreds of different cell types with specialized functions. Each of these cellular phenotypes are controlled by different combinations of transcription factors. Using a human non islet cell insulinoma cell line (TC-YIK) which expresses insulin and the majority of known pancreatic beta cell specific genes as an example, we describe a general approach to identify key cell-type-specific transcription factors (TFs) and their direct and indirect targets. By ranking all human TFs by their level of enriched expression in TC-YIK relative to a broad collection of samples (FANTOM5), we confirmed known key regulators of pancreatic function and development. Systematic siRNA mediated perturbation of these TFs followed by qRT-PCR revealed their interconnections with NEUROD1 at the top of the regulation hierarchy and its depletion drastically reducing insulin levels. For 15 of the TF knock-downs (KD), we then used Cap Analysis of Gene Expression (CAGE) to identify thousands of their targets genome-wide (KD-CAGE). The data confirm NEUROD1 as a key positive regulator in the transcriptional regulatory network (TRN), and ISL1, and PROX1 as antagonists. As a complimentary approach we used ChIP-seq on four of these factors to identify NEUROD1, LMX1A, PAX6, and RFX6 binding sites in the human genome. Examining the overlap between genes perturbed in the KD-CAGE experiments and genes with a ChIP-seq peak within 50 kb of their promoter, we identified direct transcriptional targets of these TFs. Integration of KD-CAGE and ChIP-seq data shows that both NEUROD1 and LMX1A work as the main transcriptional activators. In the core TRN (i.e., TF-TF only), NEUROD1 directly transcriptionally activates the pancreatic TFs HSF4, INSM1, MLXIPL, MYT1, NKX6-3, ONECUT2, PAX4, PROX1, RFX6, ST18, DACH1, and SHOX2, while LMX1A directly transcriptionally activates DACH1, SHOX2, PAX6, and PDX1. Analysis of these complementary datasets suggests the need for caution in interpreting ChIP-seq datasets. (1) A large fraction of binding sites are at distal enhancer sites and cannot be directly associated to their targets, without chromatin conformation data. (2) Many peaks may be non-functional: even when there is a peak at a promoter, the expression of the gene may not be affected in the matching perturbation experiment.

Mammals are composed of hundreds of different cell types with specialized functions. Each of these cellular phenotypes are controlled by different combinations of transcription factors. Using a human non islet cell insulinoma cell line (TC-YIK) which expresses insulin and the majority of known pancreatic beta cell specific genes as an example, we describe a general approach to identify key cell-type-specific transcription factors (TFs) and their direct and indirect targets. By ranking all human TFs by their level of enriched expression in TC-YIK relative to a broad collection of samples (FANTOM5), we confirmed known key regulators of pancreatic function and development. Systematic siRNA mediated perturbation of these TFs followed by qRT-PCR revealed their interconnections with NEUROD1 at the top of the regulation hierarchy and its depletion drastically reducing insulin levels. For 15 of the TF knockdowns (KD), we then used Cap Analysis of Gene Expression (CAGE) to identify thousands of their targets genome-wide (KD-CAGE). The data confirm NEUROD1 as a key positive regulator in the transcriptional regulatory network (TRN), and ISL1, and PROX1 as antagonists. As a complimentary approach we used ChIP-seq on four of these factors to identify NEUROD1, LMX1A, PAX6, and RFX6 binding sites in the human genome. Examining the overlap between genes perturbed in the KD-CAGE experiments and genes with a ChIP-seq peak within 50 kb of their promoter, we identified direct transcriptional targets of these TFs. Integration of KD-CAGE and ChIPseq data shows that both NEUROD1 and LMX1A work as the main transcriptional activators. In the core TRN (i.e., TF-TF only), NEUROD1 directly transcriptionally activates the pancreatic TFs HSF4, INSM1, MLXIPL, MYT1, NKX6-3, ONECUT2, PAX4, PROX1, RFX6, ST18, DACH1, and SHOX2, while LMX1A directly transcriptionally INTRODUCTION Regulation of gene expression by combinations of transcription factors (TFs) is a fundamental process that determines cellular identity and functions. TFs have the ability to recognize and bind short sequence motifs throughout the genome, and, either alone or in combination with other TFs, modulate mRNA levels in a cell until it acquires the predetermined phenotype (Mitchell and Tjian, 1989;Wray et al., 2003). In humans it has been estimated that there are at least 411 different cell types (Vickaryous and Hall, 2006) and 1500-2000 different transcription factors (Roach et al., 2007;Vaquerizas et al., 2009;Wingender et al., 2015), with ∼430 TFs expressed at appreciable levels in any given primary cell type . Identifying key cell type specific transcription factors and their targets is fundamental to understanding cellular states, and is important for regenerative medicine where efforts are made to direct differentiation of stem cells toward a medically relevant cell type (Cahan et al., 2014).
Over the years, multiple approaches to map the targets of TFs have been developed. Computational approaches that predict TF targets based upon their co-expression with a given TF and/or the presence of a transcription factor binding site motif (TFBS) in their promoter regions have helped to identify direct targets (Wasserman and Sandelin, 2004;Tompa et al., 2005;Valouev et al., 2008;FANTOM Consortium et al., 2009); however, these are purely predictive methods and the validation rate, when experimental validations are carried out, is low. Motif prediction methods are limited as the vast majority of our TFs have no well-defined TFBS, and TFs from the same family bind very similar motifs. Even for those cases where a motif is known, the information content is so low that the majority of binding site predictions will likely be false positives (Wasserman and Sandelin, 2004). Lastly, unless the expression levels of the TFs themselves are taken into consideration, inaccurate predictions can be made where a binding event may be predicted as important despite the fact that the corresponding TF is not even present in the cell.
Alternatively, TF targets can be identified experimentally. Experimental perturbation of TFs (Hilger-Eversheim et al., 2000) followed by expression profiling can identify global sets of genes affected by the given TF. This is a powerful approach, but does not discriminate direct from indirect targets (genes regulated by TFs which are regulated by the perturbed TF). Another experimental approach directly determines physical binding sites in the genome using protocols such as ChIP-CHIP, DamID or ChIP-seq (van Steensel and Henikoff, 2000;Horak et al., 2002;Robertson et al., 2007). The caveat with these methods lies in that they do not distinguish functional from non-functional binding. By combining the perturbation and physical interaction approaches we can overcome the limitations of each.
The remaining issue, however, is the scale of the problem. TFtarget interactions vary between cell types as there are different combinations of transcription factors expressed and different chromatin configurations in each cell type. Thus, ultimately, what we need is a compendium of cell type specific regulatory networks for every cell type that makes up the human body. Given its scale, the problem necessitates prioritization of the cell type to be studied and the sets of TFs considered. We need ways to identify which TFs are most important to a given cell type.
Recently, the FANTOM5 project used single molecule sequencing to generate CAGE (Kanamori-Katayama et al., 2011) across a large collection of human and mouse primary cells, cell lines and tissue samples, providing a nearly comprehensive set of human and mouse, promoter and enhancer regions and their expression profiles (Andersson et al., 2014;Forrest et al., 2014). Importantly, for the prioritization of key TFs, the FANTOM5 CAGE data boasts expression profiles for 94% (1665/1762) of human TFs; this can be used to generate cell-type-specific ranked lists (expression relative to median across almost 1000 samples). What emerged from those lists is that the TFs with the most enriched expression in a given primary cell type often had phenotypes relevant to that cell type [e.g., mutations of osteoblast enriched TFs resulted in bone phenotypes, hematopoietic stem cell enriched TFs in blood phenotypes and inner ear hair cell enriched TFs in deafness ]. These enriched TFs are therefore likely key components of cell-type-specific transcriptional regulatory networks (TRNs). To probe cell type enriched TFs in more detail, we explored an integrated approach for dissecting TRNs using siRNA knock-down, qRT-PCR, CAGE (Shiraki et al., 2003), and ChIP-seq (Robertson et al., 2007).
The large numbers of cells required for our systematic studies made it necessary to find an easily expandable cell line. Reviewing the FANTOM5 expression profiles, we chose an interesting cell line, TC-YIK (Ichimura et al., 1991), derived from an argyrophilic small cell carcinoma (ASCC) of the uterine cervix, which expresses insulin and showed enriched expression for dozens of pancreatic transcription factors. We show that TC-YIK cells express 75% of a set of genes previously reported as islet cell specific and 85% of a set of genes previously reported as beta cell specific. Given the difficulty in obtaining primary human beta cells for research, our results may be of interest to studying pancreatic transcriptional regulation, with the caveat that we are only using TC-YIK as an experimentally tractable cell line model to examine the prediction of key TFs; it is a non-islet-cell insulinoma and therefore the regulatory edges inferred here may not generalize to primary islet cells.
Using newly created genome-wide datasets on TC-YIK enriched TFs, and a comparative set of non-enriched TFs, we sought to determine the importance of each factor in maintaining the TC-YIK cell state. Knock-down followed by CAGE profiling allowed us to identify, genome-wide, the set of genes affected by each TF, while integration with ChIP-seq data on the same factors allowed us to further discriminate direct from indirect TF targets. We present the results of the TC-YIK analysis and show that the combination of CAGE and ChIP-seq on key TFs is a powerful approach for studying mammalian transcriptional networks and necessary for dissection of direct and indirect edges. An overview of the datasets used, our analysis and the main findings are summarized in the workflow shown in Figure 1.
This work is part of the FANTOM5 project. Data download, genomic tools and co-published manuscripts have been summarized at http://fantom.gsc.riken.jp/5/.

The TC-YIK Cell Line Expresses Pancreatic Islet Cell Transcripts
Previously, TC-YIK cells were shown to generate neurosecretory granules and express chromogranin A (CHGA) and gastrin (GAST; Ichimura et al., 1991). A systematic review of endocrine hormones and peptides detected in TC-YIK confirmed CHGA and GAST were expressed at high levels and revealed also expression of insulin (INS), ghrelin (GHRL), and transthyretin (TTR; Table 1). All of these proteins [insulin, gastrin (GAST;Wang et al., 1993;Rooman et al., 2002;Téllez et al., 2011),   Date et al., 2002;Wang et al., 2008;Arnes et al., 2012), transthyretin (TTR; Refai et al., 2005;Su et al., 2012), and chromogranin A (CHGA, a precursor of pancreatic chromostatin; Cetin et al., 1993)] play key roles in the pancreas ( Table 1). In contrast to insulin, which is a biomarker for pancreatic beta cells, somatostatin (SST), glucagon (GCG), and islet amyloid polypeptide (IAPP), the biomarkers for pancreatic delta, alpha, and gamma cells, respectively, were lowly expressed or absent in TC-YIK cells. We next examined the expression of genes described in the beta cell gene atlas (Kutlu et al., 2009) as being specifically expressed in human islets. We find that 75% of the 938 human islet tissue specific genes reported by the authors are detected in TC-YIK [Supplementary Table 1, ≥ 5 tags per million (TPM)]. The authors provide a further subset of 445 genes that are enriched in alpha and/or beta cells and overlap the islet specific list (76 are expressed > 2-fold higher in alpha cells and 153 are expressed > 2-fold higher in beta cells). In TC-YIK, we find that 65% of these alpha cell enriched genes and 85% of the beta cell enriched genes are detected (Supplementary Table 2, ≥ 5 TPM). From this review we conclude that, although TC-YIK does not completely recapitulate the beta cell transcriptome, it shares significant similarity to islet cells. For this reason TC-YIK is sufficiently interesting for the purposes of an investigative study integrating CAGE and ChIP-seq data. Lastly, although there are rare reports of non-islet-cell insulinomas that ectopically express insulin [e.g., kidney (Ramkumar et al., 2014), liver (Furrer et al., 2001), brain (Nakamura et al., 2001)] and additional cases of argyrophilic small cell carcinoma (ASCC) of cervix (Kiang et al., 1973;Seckl et al., 1999), ours is the first report to our knowledge that identifies a non-islet-cell line (TC-YIK) where the majority of the beta cell program is active.

Pancreatic Transcription Factors are Enriched in TC-YIK cells
To identify TC-YIK-enriched-transcription factors, we ranked all 1665 human TFs according to their expression in TC-YIK cells relative to the median expression across the 988 human samples in the FANTOM5 phase 1 collection . The highest ranked TF was NEUROD1, a factor known to be key in the differentiation of beta cells and insulin production (Itkin-Ansari et al., 2005;Guo et al., 2012). Furthermore, of the 42 most TC-YIK enriched TFs (enrichment score > 1.25, ∼18-fold enrichment over median expression levels), 33 were previously implicated in pancreatic biology, including direct regulators of insulin (Sander and German, 1997), key factors for islet cell development (Wang et al., 2005;Guo et al., 2011), genes associated with diabetes (Foti et al., 2005) and with pancreatic endocrine tumors (Johansson et al., 2008; Table 2,  Supplementary Table 3). CAGE profiling of the mouse orthologs throughout pancreatic development (also profiled in FANTOM5) detected 33 of the 42 TFs in at least one stage with most changing expression levels over time (Supplementary Figure 1). This added support for a further seven of the remaining nine TFs enriched in TC-YIK (ASCL2, HLF, HSF4, IRF6, IRF8, MYRF, and NPAS3) as likely important factors in pancreatic development.

Assessing the Interconnection of Key TFs
A key question is whether the cell type enriched TFs identified in FANTOM5 are key regulators of the cellular state and whether these enriched factors are more (or less) important than housekeeping TFs that are more broadly expressed. Logic would suggest that those TFs expressed in an enriched manner are more likely to be regulated by other enriched TFs, and that their targets are also more likely to be enriched. To test our assumption, we first carried out siRNA perturbation of a set of enriched and non-enriched (but expressed) TFs in TC-YIK cells and assessed their effect on expression of enriched and non-enriched targets by qRT-PCR.
Multiple siRNAs were tested for each enriched factor and the one with the best efficiency was kept; siRNAs for 26 TFs reduced expression below 50%, a further 7 suboptimal siRNAs reduced expression to 51-77% of that of the scrambled control, while for the remaining TFs we were unable to find an efficient siRNA (Supplementary Table 4). An additional 8 non-enriched TFs were also perturbed below 50% ( Table 2). After perturbation, RNA was extracted and qRT-PCR was used to measure the knock-down response in a 41 × 52 matrix of expression changes, where 41 columns represent the TFs that were perturbed and 52 rows represent the measured qRT-PCR values of target genes after perturbation (Supplementary Table 5). Experiments were carried out in triplicate and knock-down was assessed relative to a scrambled siRNA sequence. Of the ∼2000 potential (TF-target) edges tested, 551 were up-or down-regulated 1.5-fold or more [threshold as used in our previous studies ].
Looking at the number of affected targets for each TF knockdown (out degree) and the number of knock-downs that affected each TF (in degree; summarized in Supplementary Table 6) we identified NEUROD1 as a key activator at the top of the hierarchy. NEUROD1 knock-down caused down-regulation of 21 of the 52 tested targets (the most influenced being PAX4, followed by GHRL, INS, GAST, CHGA, GCK, RFX6, and PAX6). In an analogous way, ISL1 was the main antagonist in the network, where its knock-down affected 11 targets, all of which were upregulated (among those CHGA, LMX1A, PAX4, and NEUROD1). Other likely key TFs, RFX6, SHOX2, PAX6, MNX1, LMX1A, and PDX1 also strongly affected several targets.   Table 3 with references to pancreatic biology. Experiments used in this paper (Si, siRNA perturbation; CA, cap analysis of gene expression; CS, ChIP-seq). TC-YIK enriched factors that were not tested by siRNA were excluded due to oligo design or knock-down efficiency problems. Of note, knock-down of 28 of the 33 TFs enriched in TC-YIK and 7 of the 8 non-enriched TFs affected insulin expression levels, with the enriched factors NEUROD1, DACH1, RFX6, and the non-enriched TFs HMGB2, GTF3A, and HMGA1 knockdown causing the greatest decreases in insulin transcript levels (Figure 2). Interestingly, knock-down of the non-enriched TF TATA binding protein (TBP) led to the highest increase in insulin transcript, which may indicate a shift in the balance between TATA dependent and TATA independent transcription.

Identifying Genome-wide TF Targets using Knock-down and Cage
The above section focused on a limited and biased set of 52 target transcripts. We next applied CAGE [KD-CAGE; (Vitezic et al., 2010)] to identify genome-wide the sets of promoters that were perturbed after knock-down of 15 of the enriched TFs and all 8 non-enriched TFs using the same RNA samples as used in the qRT-PCR. Notably the fold changes observed by CAGE and qRT-PCR were highly correlated  Figure 2), indicating the suitability of CAGE for this experiment.

(Supplementary
Promoters specifically affected by the TF knock-downs in comparison to scrambled siRNA control samples were then identified using edgeR (Robinson et al., 2010;Supplementary Table 7). Similar numbers of affected promoters were detected for enriched and non-enriched TFs; between 8229 and 19,467 and between 9922 and 18,362 promoters respectively (Supplementary Table 8). For six of the TF knock-downs (HMGA1, NEUROD1, LMX1A, SHOX2, NROB2, GATA4, RFX6), there were at least twice as many down-regulated promoters as up-regulated ones, suggesting that these factors work as activators. Conversely, for knock-down of MNX1 and TBP we observed at least twice as many up-regulated promoters as down-regulated ones, suggesting they work as repressors ( Figure 3A).

Identifying TFs Important for Maintaining Cell State
To understand which TFs are responsible for maintaining the TC-YIK cell state, we next identified a set of 4639 promoters with enriched expression (>3-fold) in TC-YIK compared to median expression in FANTOM5. We refer to this set as TC-YIK-enriched-promoters, and to the remainder as nonenriched-promoters. We then used these sets to separate TFs into synergists or antagonists to the cell fate: if perturbation of a TF causes down-regulation of a significantly larger fraction of TC-YIK-enriched-promoters than non-enriched-promoters, then this would suggest that the factor in question is important for maintaining the TC-YIK state (pro-TC-YIK); similarly, if the perturbation led to up-regulation of a significantly larger fraction of TC-YIK-enriched-promoters than non-enriched-promoters, this would suggest that the factor antagonizes the TC-YIK state (anti-TC-YIK).
Starting from the assumption that TC-YIK state is maintained by regulation of TC-YIK-enriched-promoters, we checked, for each TF knock-down, whether TC-YIK-enriched-promoters were more likely to be affected (either up-or down-regulated) compared to a random event. Knock-down of all factors resulted in significantly more TC-YIK-enriched-promoters being perturbed (in either direction) than expected (hypergeometric probability test, Supplementary Table 8), and testing the upand down-regulated sets separately also showed that for all perturbations significantly more TC-YIK-enriched-promoters were up-regulated and significantly more TC-YIK-enrichedpromoters were down-regulated than expected by chance. This suggests that all tested TFs contribute to some extent to the maintenance of the TC-YIK state (Supplementary Table 8, Figure 3B).
Of particular note, NEUROD1 knock-down led to downregulation of 50% of the TC-YIK-enriched-promoters, and ISL1 knock-down led to up-regulation of the most TC-YIK-enrichedpromoters compared to the other factors, suggesting that they are pro-and anti-TC-YIK factors respectively ( Figure 3B). To examine this in more detail we calculated the ratios of TC-YIK-enriched-promoters to non-enriched-promoters in the up-regulated sets over the down-regulated sets. High ratios correspond to anti-TC-YIK TFs and low ratios correspond to pro-TC-YIK TFs ( Figure 3C). To compare these ratios systematically we used Chi-square with Yates correction to test for significant differences (Supplementary Table 8).
In summary, it appears that both enriched and non-enriched factors contribute to the TC-YIK TRN and that, intriguingly, despite ISL1 and PROX1 both being enriched in TC-YIK, they seem to be antagonists to the system (Figure 3D).
Protein-DNA Edge Mapping by ChIP-seq of NEUROD1, LMX1A, RFX6, and PAX6 As the perturbation edges identified above could be either direct or indirect, we next used ChIP-seq data for four of the TC-YIK enriched factors to generate a paired complimentary dataset which would identify the genomic binding sites of the same factor. Integration of these two edge types (KD-CAGE and ChIP-seq) should allow us to discriminate direct from indirect edges. Biological duplicates for each factor were generated and ChIP-seq binding peaks were called relative to input chromatin using MACS (Zhang et al., 2008). We note that the number of peaks called for the same target in different biological replicates varied (NEUROD1: 7195 and 14,949 peaks, LMX1A: 7622 and 7361 peaks, PAX6: 587 and 7866 peaks, RFX6: 960 and 1659 peaks). To be conservative we only used peaks that were called as reproducible with 90% likelihood using the irreproducible discovery rate (Li et al., 2011) method (IDR ≤ 0.1) which yielded 144 RFX6 peaks, 190 PAX6 peaks, 4506 NEUROD1 peaks and 2166 LMX1A peaks. Scanning these peaks for known TFBS motifs using HOMER (Heinz et al., 2010) found significant enrichment for the relevant motifs (NeuroD1/Homer motif was found in 46% of NEUROD1 peaks, 7.4% of background; Lmx1a-mouse/Jaspar-9% of LMX1A peaks, 4.7% of background; PAX6/SwissRegulon-11% of PAX6 peaks, 2.2% of background, Supplementary Figure 3). For RFX6 there is no known motif; however, the motifs of other RFX family members, and in particular RFX5, were enriched (37% of RFX6 peaks and 3% of background). De-novo motif finding on the RFX6 ChIP-seq data identified a novel motif that is found in 58% of RFX6 peaks and 4% of background sequences. This motif closely resembles, but is different from, other RFX family motifs ( Figure 4A).
Examining the distribution of binding in the genome, we observed that the four factors often bound in combination at the same sites, and seldom bound at promoters. For example in the RERE locus we observed co-binding of NEUROD1 and LMX1A, and NEUROD1 and RFX6, respectively, at distinct sites (see boxes in Figure 4B). Genome wide, co-binding of two or more of these enriched factors was common, with more than half of the RFX6 and PAX6 sites overlapping a LMX1A or NEUROD1 site ( Figure 4C).
Given (1) the paucity of promoter proximal binding of these factors and (2) the ample similarity between TC-YIK cellular program and endocrine program, we compared the binding sites to a map of open chromatin sites in human islet cells. Pasquali et al. (2014) integrated FAIRE-seq, and ChIP-seq of H2A.Z, H3K4me1, H3K4me3, H3K27ac, and CTCF to classify open sites in the genome of human islets as promoters (C1), poised/inactive enhancers (C2), active enhancers (C3), CTCF-bound sites (C4), and other open sites (C5). In our ChIP-seq data, we found that between 46 and 62% of peaks overlapped at least one of these open chromatin sites (this was comparable to the overlap seen by the authors for their own TF ChIP-seq experiments; 48 to 81% for NKX2.2, PDX1, FOXA2, NKX6.1, and MAFB). For those peaks overlapping the islet cell open sites, we observed enriched binding at active enhancer sites and depletion of promoter sites for all four factors (Figure 4D, Supplementary Table 10), suggesting that these factors primarily work at enhancers.
In support of this observation, both NEUROD1 and PAX6 have been reported previously to bind enhancer regions (Andersen et al., 1999;Aota et al., 2003;Scardigli et al., 2003;Inoue et al., 2007;Babu et al., 2008), and a recent PAX6 ChIPseq dataset in neuroectoderm cells identified multiple PAX6 regulated enhancers, and reported that less than 2% of 16,000 PAX6 peaks are near TSS of coding genes (Bhinge et al., 2014). In the case of RFX6 there is still little known about its functional targets. Other RFX family members have been reported to be bound at enhancers (Reith et al., 1994;Maijgren et al., 2004;Creyghton et al., 2010;Watts et al., 2011), and in the Pasquali et al. study an RFX motif was over-represented at islet cell enhancer clusters (Pasquali et al., 2014). Intriguingly, RFX6 had twice as many peaks overlapping class C5 than expected, suggesting that RFX binding may be one of the earliest events at opening of sites (Niesen et al., 2005). For LMX1A, ours is the first report of its involvement at enhancers.

Integration of ChIP-seq and KD-CAGE Data to Identify Direct Transcriptional Targets of TFs
By combining KD-CAGE with ChIP-seq data for LMX1A, NEUROD1, PAX6, and RFX6, we hoped to identify directly regulated promoters (that is, promoters perturbed in the knockdown experiments that also had matching nearby ChIP-seq signal). In the case of NEUROD1 and LMX1A, we observed that promoters closest to a matching ChIP-seq peak were indeed affected. In particular for NEUROD1, almost 80% of promoters within 1 kb of a NeuroD1 ChIP-seq peak were down-regulated and for LMX1A almost 70% of promoters within 1 kb of an Lmx1a ChIP-seq peak were down-regulated ( Figure 5A). Both cases indicate that these factors work primarily as transcriptional activators. As one moves further away from a ChIP-seq peak the fraction of down-regulated promoters drops, however, even at distances greater than 5 kb (up to 100 kb) from a TSS we observed a higher proportion of down-regulated TSS compared to that seen for those >100 kb away, suggesting that both factors can affect gene expression in cis from neighboring enhancer elements (the closer the element, the higher the probability of being affected). Repeating the analysis only using peaks with or without a TFBS motif showed no significant differences in the fractions of TSS likely to be affected. In fact, for the case of LMX1A and NEUROD1 the fraction of perturbed TSS increased at shorter distances relative to a ChIP-seq peak, regardless of whether the ChIP-seq peak overlapped a motif or not (Supplementary Table 11). In the case of RFX6 and PAX6, we observed no such distance-dependent effect, suggesting that either these factors work predominantly via distal sites or that the small number of ChIP-seq peaks observed for these two factors confounded the analysis.
Finally it is worth noting that not all proximal sites appear to be functional. For NEUROD1 and LMX1A respectively, 17 and 18% of the TSSs within 1 kb of a ChIP-seq peak for the same factor were unaffected in the knock-down. An example is shown for the EYS locus. ChIP-seq and TFBS predictions support binding of LMX1A and NEUROD1 at the EYS promoter, but only NEUROD1 perturbation affected EYS expression levels ( Figure 5B; other examples are shown in Supplementary Figure 4).

Role of NEUROD1 and LMX1A in the TC-YIK TRN
Our original objective had been to integrate KD-CAGE and ChIP-seq to identify directly regulated targets (in this case of NEUROD1, LMX1A, PAX6, and RFX6). However, based on the results above, we conclude that the majority of binding events happen at enhancers, and only in the case of NEUROD1 and LMX1A where we observed enrichment for perturbed TSS at shorter distances to the TSS can we infer direct promoter mediated edges. For these two factors, we considered TSS that are down-regulated at least 1.5-fold and with a ChIP-seq peak at a distance of less than 50 kb as likely direct targets. This identified 317 and 1543 directly regulated promoters for LMX1A and NEUROD1 respectively (Supplementary Table 12). Finally, to understand the hierarchy of these factors we checked whether they directly regulate any of the other TC-YIK enriched TFs identified in the beginning of the paper. Focusing on the core network (TF-TF) we find that both NEUROD1 and LMX1A directly target 12 and 4 TC-YIK enriched TFs, respectively, but do not directly regulate each other ( Figure 5C).

CONCLUSION
In this paper we have introduced an experimental strategy to elucidate cell type specific transcriptional regulatory networks. We start by identifying cell type enriched transcription factors (pre-computed lists for all primary cell types available online from the FANTOM web resource (Lizio et al., 2015) http:// fantom.gsc.riken.jp/5/) and then use a combination of siRNA perturbation, CAGE and ChIP-seq to identify their direct and indirect targets. This strategy leverages the strengths of both approaches. Application of CAGE to siRNA perturbed samples identifies affected genes and ChIP-seq identifies directly bound targets. We show that ChIP-seq alone is insufficient  Table 4) that are directly regulated by NEUROD1 or LMX1A. To be called a direct target, we require at least one TSS of the target gene to be down-regulated 1.5-fold with a p-value of 0.05 and it must be within 50 kb of a ChIP-seq peak for the same factor. to discriminate functional from non-functional bound sites, while perturbation approaches alone cannot unequivocally discriminate direct from indirect targets. It is important to precise that we are not questioning the power of ChIP methods in identifying direct and indirect binding (Gordan et al., 2009); the novelty of our approach lies in demonstrating that even in the presence of a TF-DNA interaction, regulation of target genes can happen only if the site of interaction is functional. This work highlights an important and yet undervalued matter, as in many previous publications researchers have assumed the nearest gene to, or any gene within a fixed distance of, a ChIPseq peak, is a direct target (Shin et al., 2009;Bottomly et al., 2010;Tallack et al., 2010;Schodel et al., 2011). This is clearly an oversimplification. We have shown that almost a fifth of TSS within 1 kb of a NEUROD1 or LMX1A ChIP-seq peak are unaffected in matching siRNA knock-down. This could mean that these sites are non-functional or that they are cell-context dependent (Osmanbeyoglu et al., 2012;Whitfield et al., 2012).
Aside from exploring this strategy to build TRNs, we have introduced TC-YIK as a model to study transcriptional regulation of pancreatic genes. There is a need for such cell line models, as the majority of viable post mortem islet cell material is used for transplants into diabetic patients, thus pancreatic beta cells for research are difficult to obtain. Moreover, the isolation of pure beta cell populations, the lack of protocols to expand them in culture and the number of cells required to carry out extensive perturbation and chromatin immuno-precipitation experiments are prohibitive. We have shown by CAGE profiling that 85% of the beta cell genes identified by the beta cell gene atlas (Kutlu et al., 2009) are expressed in TC-YIK and that NEUROD1, LMX1A, PAX6, and RFX6 binding sites in TC-YIK are enriched at islet cell active enhancer sites. Furthermore, TC-YIK cells express key transcription factors known to be involved in pancreatic cell development and differentiation, including NEUROD1, PDX1, and FOXA2 (Wang et al., 2002;Itkin-Ansari et al., 2005;Guo et al., 2012). In fact, 33 of the top 42 most TC-YIK enriched TFs are implicated in pancreatic biology. In addition, 33 homolog TFs are expressed in developing mouse pancreas. On this account, we, for the first time, find evidence of ASCL2, HLF, HSF4, IRF6, IRF8, C11orf9/MYRF, and NPAS3 playing a role in pancreatic neuroendocrine gene expression and development. The only two TFs without prior references in the literature or detectable expression in the FANTOM5 mouse pancreatic samples were SIX3 and DLX6, respectively. Despite this, DLX6 expression has previously been reported in earlier pancreatic stages (E12.5 and E13.5; Gasa et al., 2004). This thorough review shows that the majority of transcription factors with enriched expression in TC-YIK have a role in pancreatic development and thus, TC-YIK is an important cell line model for studying transcriptional regulation of pancreatic gene expression.
Genome-wide expression profiling of the perturbed samples by CAGE revealed multiple insights. The majority of TF knockdowns led to more down-regulated genes than up-regulated ones, suggesting these TFs primarily work as activators, in agreement with the arguments of Hurst et al. (2014). From this logic, we predict HMGA1, NEUROD1, LMX1A, SHOX2, NROB2, GATA4, RFX6 as likely activators and MNX1 and TBP as likely repressors. Although there is the possibility that a predicted activator is in fact a repressor of an activator and a predicted repressor is an activator of a repressor, we find that both GATA4 (Rojas et al., 2008) and LMX1A (Andersson et al., 2006) have direct evidence as transcriptional activators and MNX1 (William et al., 2003) has been confirmed as a transcriptional repressor. By incorporating ChIP-seq data we can verify the roles of TFs directly. For both NEUROD1 and LMX1A we show that they work as direct transcriptional activators. This clarifies the role of NEUROD1 as a previous work reported it as both a transcriptional repressor and activator (Itkin-Ansari et al., 2005). Integration of the CAGE and ChIP-seq data clearly shows that >75% of TSS proximal to NEUROD1 are down-regulated in NEUROD1 knock-down ( Figure 5A). In the previous work by Itkin-Ansari et al. the authors used perturbation (over-expression) alone and assumed SST downregulation upon NEUROD1 over-expression indicated it was a target that was directly transcriptionally repressed; we think it is more likely that NEUROD1 indirectly antagonizes SST expression via other pancreatic TFs. This highlights the value of using both perturbation and ChIP-seq approaches.
In terms of what the application of our strategy to TC-YIK has told us about pancreatic gene expression, and the hierarchy of TFs, firstly we have shown that not only enriched (MNX1, NEUROD1, SHOX2, PAX4, NROB2, HOPX, RFX6, MLXIPL, GATA4, LMX1A, PAX6, ASCL1) but also non-enriched factors (ATF5, TAF10, HMGA1, TCF25, TAF9, HMGB2, GTF3A) contribute to the maintenance of the TC-YIK state. It is thus important to consider housekeeping TFs, too, when building cell-specific TRNs since they often work cooperatively with state specific factors (Ravasi et al., 2010). Our analysis also identified ISL1 and PROX1 as likely antagonists to the state. It may be that these antagonists help maintain a stem/progenitor like state (Wang et al., 2005;Eberhardt et al., 2006). We show that NEUROD1 and LMX1A are both directly activating multiple other pancreatic TFs, and that based on our data they do not directly regulate each other ( Figure 5C).
Finally, building cell-type-specific TRNs will require further work and integration of newer data types. In the case of RFX6 and PAX6 we made no predictions of their direct targets as there were few peaks bound at promoter regions and there was no enrichment for perturbed TSS near these peaks. This could be due to lower quality or less efficient antibodies used for the two factors, or could reflect lower expression levels compared to the other factors. Despite this, for all four factors (including the higher quality NEUROD1 and LMX1A experiments) the majority of peaks were at putative enhancer regions. In conclusion, mammalian TRN models will need to incorporate distal regulatory elements as well, as proximal elements. To address this issue in the future we will need to use protocols such as ChIA-PET (Fullwood et al., 2009) and HiC (Dixon et al., 2012) to link distal elements with the TSS that they regulate. We believe that such chromatin conformation methods combined with KD-CAGE and ChIP-seq have the potential to identify gold standard regulatory events at both promoters and enhancers, and are key to understanding how each cell type is wired.

Selection of Transcription Factors Significantly Enriched in TC-YIK for siRNA Knock Down
A pre-computed list of TFs with enriched expression in TC-YIK was downloaded from FANTOM5's sample browser SSTAR [direct link: http://fantom.gsc.riken.jp/5/sstar/FF:10589-108D4, see FANTOM web resource (Lizio et al., 2015)]. Enrichment is based on expression in the sample compared to the median expression across all samples in the FANTOM5 collection. The enrichment score is defined as log10[(expression in TC-YIK + 1)/(median expression in FANTOM5 + 1)]. The top 33 genes with enriched expression in TC-YIK were targeted for siRNA knock-down using stealth siRNAs from Invitrogen. As a comparison we also targeted a set of 8 non enriched TFs (TAF9, TAF10, ATF5, GTF3A, TCF25, TBP, HMGA1, HMGB2) that were expressed in TC-YIK at similar levels. In addition to these TFs, six target genes (INS, CHGA, GHRL, GCK, GAST, TTR) and five additional target TF genes where we were unable to find effective siRNAs (ASCL2, CBFA2T2, CDX2, INSM1, TFAP2A) were also added to the set. The combined set was used for systematic siRNA KD in triplicate of one factor at a time followed by qRT-PCR measurements of the perturbed genes in a Matrix RNAi design as described in Tomaru et al. (2009). siRNA sequences, knockdown efficiency and primers used in qRT-PCR are provided in Supplementary Table 9.

Genome-wide KD-CAGE
KD experiments followed by CAGE were profiled (see below) to obtain genome-wide promoter activities. Of the 41 most enriched TFs that were selected for Matrix RNAi, 15 among the most perturbed and all 8 non-enriched genes were chosen for siRNA transfection followed by CAGE. The 15 enriched TFs targeted for CAGE analysis were selected in a semi-random fashion that favored TFs that affected insulin expression in the qRT-PCR results (Figure 2). NEUROD1, DACH1, RFX6, ASCL1, PAX6, MNX1, HOPX, MLXIPL, LMX1A, SHOX2, GATA4, and PAX4 knock-down significantly reduced INS transcript levels. PROX1, NR0B2, and ISL1 were selected based on their reported roles in pancreatic biology as putative repressors, rather than their effect on INS levels. Experiments were carried out in biological triplicate, and scrambled siRNA samples were prepared as negative control. While the KD method has been previously described (Vitezic et al., 2010), we used a new variant of CAGE developed for the Illumina Hiseq 2500 called nAnT-iCAGE (Murata et al., 2014). Briefly, 5 µg of RNA was used for each sample and libraries were combined in 8plex using different barcodes. Tags were de-multiplexed and mapped to the human genome (hg19) using BWA (Li and Durbin, 2010), yielding an average of 8.9 M mapped counts per sample (map quality > 20). Expression tables were made by counting the numbers of mapped tags falling under the 184,827 robust CAGE peaks regions identified in FANTOM5 . Differential expressed promoters in TF knock-downs vs scrambled controls were identified using edgeR (Robinson et al., 2010) with a significance threshold of 0.05.

Chromatin Immunoprecipitation Assay
Chromatin was prepared and immunoprecipitation carried out as described previously .
List All experiments were carried out as biological duplicates. Immunoprecipitated and input chromatin samples were incorporated into 4-plex ChIP-seq libraries using the NEBnext kit (New England Biolabs). Libraries were labeled with a 6 bp barcode and then pooled to be sequenced on Illumina HiSeq2000.
Sequencing results were mapped to the human genome (hg19) using BWA software (Li and Durbin, 2010) providing an average of ∼180 M mapped tags per lane (or, alternatively, ∼45 M per sample), with a mapping rate of >96%. After mapping we performed peak calling using MACS software (Zhang et al., 2008) with the recommended default parameter settings for point binding type of events [mfold= (Refai et al., 2005;Tompa et al., 2005), bandwidth=300]. We additionally used Irreproducible Discovery Rate analysis (Li et al., 2011), to identify reproducible peaks which were used for downstream analysis.

Motif Enrichment Analysis
We used HOMER software for de-novo motif discovery (Heinz et al., 2010), as well as to calculate over-representation of known motifs. Known motifs provided with HOMER (v4.6, 3-29-2014) were expanded by importing all known NEUROD1, LMX1A, PAX6, and RFX motifs from SwissRegulon (Pachkov et al., 2007), JASPAR (Bryne et al., 2008), UniPROBE (Newburger and Bulyk, 2009), and HOCOMOCO (Kulakovskiy et al., 2013), into HOMER before carrying out the scan. We used the function findMotifsGenome.pl to discover motifs in all reproducible peaks for each factor (genomic regions from hg19) with the option "-mask" to filter out bindings on repeats. The target sequences are the regions under the peaks and the background regions are randomly sampled sequences from the genome (Hg19) with similar GC content as the target sequences.

Gene Ontology Enrichment Analysis
The R Bioconductor GOstats package (Falcon and Gentleman, 2007) was used to obtain gene ontology enrichment scores. For the ChIP-seq GO analysis was performed on bound TSSs, while for the CAGE KD experiments, the up-and down-regulated genes were analyzed separately. For both analyses, all genes expressed in TC-YIK (>1 TPM) were used as the background.

Data Access
This work is part of the FANTOM5 project. Data download, genomic tools and co-published manuscripts have been summarized at http://fantom.gsc.riken.jp/ 5/. A ZENBU genome browser view displaying TC-YIK related expression data can be accessed at this URL: http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=e3YeqamiJBW hbPgPq59ubD;loc=hg19::chr14:93349815..93441266 [Reviewer username: lizio2014-review@riken.jp, password: lizio2014 (note: if problems after logging in, re-enter the URL and try again. Password will be removed at publication)]. All sequencing data used in this study has been deposited to DDBJ Read Archive (http://www.ddbj.nig.ac.jp/) with accession number DRA002420 (CAGE data) and DRA002468 (ChIP-seq data). CAGE expression profiles and enrichment of TFs for TC-YIK cell line are part of the FANTOM5 main data set. siRNA perturbations, CAGE-KD, and ChIP-seq experiments were generated separately for this study. Additional material can be found at the following URL (http://fantom.gsc.riken.jp/5/ suppl/Lizio_et_al_2014/?cultureKey=&q=5/suppl/Lizio_et_al_ 2014 Reviewer username: m.lizio, password: m.lizio).

AUTHOR CONTRIBUTIONS
AF designed the study and wrote the manuscript; ML carried out all bioinformatics analyses and wrote the manuscript; YI carried out the siRNA perturbations, qRT-PCR and chromatin immunoprecipitation experiments with help from AK; MI provided the CAGE libraries; TL and AH mapped the CAGE data; YN provided the TC-YIK cell line; JS helped with visualization in ZENBU; HK contributed to the ChIP-seq analysis and provided the set of CAGE peaks; HS, HK, PC, YH, and AF supervised the project.  Supplementary Table 1 | Human islet cell enriched transcripts. Detection of human islet cell enriched transcripts from the beta cell gene atlas (Kutlu et al., 2009) in TC-YIK.

Supplementary Table 2 | Rat alpha and beta cell enriched transcripts.
Detection of human orthologs of rat alpha and beta cell enriched transcripts from the beta cell gene atlas (Kutlu et al., 2009) in TC-YIK.