Candidate Phyla Radiation Roizmanbacteria From Hot Springs Have Novel and Unexpectedly Abundant CRISPR-Cas Systems

The Candidate Phyla Radiation (CPR) comprises a huge group of bacteria that have small genomes that rarely encode CRISPR-Cas systems for phage defense. Consequently, questions remain about their mechanisms of phage resistance and the nature of phage that infect them. The compact CRISPR-CasY system (Cas12d) with potential value in genome editing was first discovered in these organisms. Relatively few CasY sequences have been reported to date, and little is known about the function and activity of these systems in the natural environment. Here, we conducted a genome-resolved metagenomic investigation of hot spring microbiomes and recovered CRISPR systems mostly from Roizmanbacteria that involve CasY proteins that are divergent from published sequences. Within population diversity in the spacer set indicates current in situ diversification of most of the loci. In addition to CasY, some Roizmanbacteria genomes also encode large type I-B and/or III-A systems that, based on spacer targeting, are used in phage defense. CRISPR targeting identified three phage represented by complete genomes and a prophage, which are the first reported for bacteria of the Microgenomates superphylum. Interestingly, one phage encodes a Cas4-like protein, a scenario that has been suggested to drive acquisition of self-targeting spacers. Consistent with this, the Roizmanbacteria population that it infects has a CRISPR locus that includes self-targeting spacers and a fragmented CasY gene (fCasY). Despite gene fragmentation, the PAM sequence is the same as that of other CasY reported in this study. Fragmentation of CasY may avoid the lethality of self-targeting spacers. However, the spacers may still have some biological role, possibly in genome regulation. The findings expand our understanding of CasY diversity, and more broadly, CRISPR-Cas systems and phage of CPR bacteria.


INTRODUCTION
The Candidate Phyla Radiation (CPR) comprises a huge fraction of Domain Bacteria. The scale of the radiation remains unclear, but it may include as much as 26-50% of all bacterial diversity Parks et al., 2017;Schulz et al., 2017). The CPR bacteria uniformly have small genomes (often ∼1 Mbp) and limited biosynthetic capacity (Brown et al., 2015;Anantharaman et al., 2016;Hug et al., 2016;. Most are thought to be symbionts, in some cases cell surface attached (episymbionts), that depend on other bacteria for basic cellular building blocks (for review, see . A previous meta-analysis found that only 2.4% of organisms from the Parcubacteria (OD1) and Microgenomates (OP11) superphyla encode CRISPR-Cas systems in their genomes, as compared to 47.4% in archaea and 24.4% in non-CPR bacteria (Burstein et al., 2016). The authors noted that when CRISPR-Cas systems occur in CPR bacteria they tend to be different from those found in other bacteria. Four genomes from Dojkabacteria (WS6), Parcubacteria (OD1) and Roizmanbacteria were previously recognized to encode CRISPR-Cas12a (Cpf1) systems (Zetsche et al., 2015), and more recently, six genomes were reported encoding a newly recognized compact CasY effector enzyme that has genome editing potential .
Several potential explanations for the low frequency of CRISPR-Cas systems in CPR bacteria have been suggested (Burstein et al., 2016). Small genome size may favor use of more compact restriction-modification systems for phage defense and low ribosome content may preclude sufficiently fastacting CRISPR-Cas systems required for effective interference (Burstein et al., 2016). Symbiotic lifestyles, characterized by close association between multiple cells and a host cell, could lead to higher phage densities, which may cause selection of defense systems other than CRISPR-Cas (Westra et al., 2015). It has also been suggested that CPR bacteria may not have the RecBCD mechanism identified in non-CPR Bacteria to curtail self-targeting spacer acquisition (Levy et al., 2015;. As few phage that infect CPR bacteria have been reported (Paez-Espino et al., 2016;Dudek et al., 2017), it is difficult to know how common phage that infect these bacteria might be. Phage particles in the process of infecting CPR bacterial cells have been observed via cryogenic electron microscopy (Luef et al., 2015). However, the sequences of phage associated with CPR bacteria are unusually difficult to identify in metagenomic datasets, in part due to the lack of CRISPR spacers that could be used to link them to host cells via CRISPR targeting (Andersson and Banfield, 2008). Further, like phage, CPR genomes encode a very high proportion of novel proteins , which obscures identification of potential prophage regions. Finally, phage structural proteins may be too divergent from those of well-studied phage to be identified. To date, phage have only been reported for bacteria from two CPR phyla, Absconditabacteria (previously SR1) and Saccharibacteria (previously TM7) (Paez-Espino et al., 2016;Dudek et al., 2017). Thus, there is a potentially huge knowledge gap related to the existence and diversity of CPR phage. This motivates the search for new CPR genomes with CRISPR-Cas systems that could potentially provide links to additional examples of phage that replicate in these bacteria.
In the current study, we investigated the microbiomes of a series of hot springs in Tibet. CPR bacteria are relatively abundant in these thermal environments, and some of their genomes encode interesting and unusual CRISPR-Cas systems. Although uncommon overall, CRISPR-Cas systems are surprisingly frequently encoded in the genomes of members of the Roizmanbacteria, and multiple different systems coexist in some genomes. We identified many new examples of systems based on CasY and uncovered an intriguing example of a locus with self-targeting spacers and a fragmented CasY gene. We identified CPR phage for which complete, curated genomes were reconstructed, as well as prophage in other genomes. Thus, our analyses provide new insights into CPR biology, their phage and the diversity of the relatively unstudied CRISPR-CasY system.

Study Site, Sampling and Physicochemical Determination
Hot spring (40.8-84.9 • C) sediment samples were collected from Tibet Plateau (China) in August 2016 (Supplementary Table S1). As described previously (Song et al., 2013), sediment samples were collected from the hot spring pools using a sterile iron spoon into 50 ml sterile tubes, transported to the lab on dry ice, and stored at −80 • C for DNA extraction. Temperature, dissolved oxygen (DO) and pH were determined in situ and the other physicochemical parameters were analyzed in the laboratory (Supplementary Table S1).

DNA Extraction, Sequencing, Quality Control and Metagenomic Assembly
Genomic DNA was extracted from sediment samples using the FastDNA SPIN kit (MP Biomedicals, Irvine, CA, United States) according to the manufacturer's instructions. The DNA samples were purified for library construction, and sequenced on an Illumina HiSeq2500 platform with PE (paired-end) 150 bp kits. The raw data of each metagenomic dataset were filtered to remove Illumina adapters, PhiX and other Illumina trace contaminants with BBTools, and low quality bases and reads using Sickle (version 1.33 1 ). The high-quality reads of each sample were assembled using metaSPADES (version 3.10.1) (Bankevich et al., 2012) with a kmer set of 21, 33, 55, 77, 99, and 127.

HMM-Based Search of CasY Proteins and Confirmation of CRISPR-CasY System
The six CasY proteins reported previously  were aligned using Muscle (Edgar, 2004), and filtered to remove those columns comprising 95% or more gaps with TrimAL (Capella-Gutiérrez et al., 2009). A HMM model was built based on the filtered alignment using hmmbuild 2 (Eddy, 1998) with default parameters, hmmsearch was used to search all the proteins predicted by Prodigal from scaffolds. Those hits with an e-value < 10 −5 were manually checked, and the online tool CRISPRs finder (Grissa et al., 2008) was used to identify the Cas1 protein and CRISPR loci. Only those scaffolds detected with CasY, Cas1, and CRISPR array were retained for further analyses. Other CRISPR-Cas systems identified in these genomes based on the presence of Cas proteins and CRISPR arrays were also analyzed in this study.

Extension and Manual Curation of CasY Scaffolds
Those scaffolds with partial CasY representatives were manually extended as follows: (1) mapping the high quality reads to the corresponding scaffolds using bowtie2 with default parameters; (2) filtering the mapping files using mapped.py (part of the ra2 suite) to remove those PE reads with two or more mismatches to the assembled scaffold across both reads combined; (3) importing the filtered mapping files into Geneious and mapping using the "Map to Reference" function; (4) extending the scaffolds at the partial CasY protein ends; (5) performing the first 4 steps again (multiple times if necessary) until full length CasY proteins were obtained.
The extended scaffolds and other full-length CasY scaffolds were checked for any potential assembly errors using ra2.py 2 , the general strategy was described previously (Brown et al., 2015). Errors reported as unresolved by ra2.py were fixed manually in Geneious using unplaced paired reads that were mapped to the scaffolding gaps.

Coverage Calculation, Genome Binning, Genome Curation and Completeness Assessment
The high quality reads were mapped to the corresponding assembled scaffolds using bowtie2 with default parameters and the coverage of each scaffold calculated as the total number of bases mapped to it divided by its length. For each sample, scaffolds over 2,500 bp were assigned to preliminary draft genome bins using MetaBAT with default parameters, considering both tetranucleotide frequencies (TNF) and scaffold coverage information. The clustering of scaffolds from the bins and the unbinned scaffolds was visualized using ESOM with a min length of 2,500 bp and max length of 5,000 bp as previously described (Dick et al., 2009). Misplaced scaffolds were removed from bins and unbinned scaffolds whose segments were placed within the bin areas of ESOMs were added to the bins. Scaffolds ≥ 1,000 bp from each sample were uploaded to ggKbase 3 . The ESOMcurated bins with interesting CasY-bearing scaffolds were further evaluated based on consistency of GC content, coverage and taxonomic information, and scaffolds identified as contaminants were removed. The genome bins with CRISPR-CasY systems

Gene Prediction and Metabolic Prediction
The protein-coding genes of the curated genomes (see above) were predicted using Prodigal (-m single) (Hyatt et al., 2010), and searched against KEGG, UniRef100, and UniProt for annotation, and metabolic pathways were reconstructed. The 16S rRNA genes were predicted based on HMM models, as previously described (Brown et al., 2015). The ribosome binding site sequence was obtained via the Prodigal gene prediction results.

CRISPR Loci Reconstruction and Spacer Identification
For all the confirmed CRISPR-CasY and other CRISPR-Cas systems, the quality reads were aligned to the scaffolds from the corresponding sample using bowtie2 with default parameters (Brown et al., 2015;Langmead and Salzberg, 2012). Any unmapped reads of read pairs were mapped to the scaffolds in Geneious using the function of "Map to Reference, " then the CRISPR loci were manually reconstructed, allowing for spacer set diversification and loss of spacer-repeat units in some cells. Thus, it was possible to place most reads in an order that reflects the locus evolutionary history. For each CRISPR locus, all the reads that mapped were extracted, and spacers between two direct repeats were used for target searches (see below).

Spacers Target Search and Identification of (Pro)phage Scaffolds
All the spacer sequences from each CRISPR locus were dereplicated, then the sequences were searched against scaffolds from related samples using BLASTn with the following parameters: -task blastn-short, -dust no, -word_size 8. Those scaffolds with 0 mismatch and 100% alignment coverage to one or more spacers were manually checked for phage-specific proteins, including capsid, phage, virus, prophage, terminase, prohead, tape measure, tail, head, portal, DNA packaging, as described previously (Dudek et al., 2017).

In silico Determination of Protospacer Adjacent Motif (PAM)
To determine the PAM of the CRISPR-CasY systems in Roizmanbacteria genomes, for each CRISPR spacer with a target in two complete phage genomes from QZM (see section "Results"), the upstream 5 bp and downstream 5 bp of the targeted DNA strand were searched manually and the PAM was determined and visualized using Weblogo (Crooks et al., 2004). The PAM analyses for other CRISPR-Cas systems analyzed in this study were performed in the same way.
For tree construction, protein sequences datasets were aligned using Muscle (Edgar, 2004). The 16S rRNA gene sequences were aligned using the SINA alignment algorithm (Edgar, 2004;Pruesse et al., 2012) through the SILVA web interface (Pruesse et al., 2007). All the alignments were filtered using TrimAL (Capella-Gutiérrez et al., 2009) to remove those columns comprising more than 95% gaps. For the 16 RP, ambiguously aligned C and N termini were removed and the amino acid sequences, which were concatenated in the order as stated above (alignment length, 2654 aa). The phylogenetic trees were reconstructed using RAxML version 8.0.26 with the following options: -m PROTGAMMALG (GTRGAMMAI for 16S rRNA phylogeny) -c 4 -e 0.001 -# 100 -f a (Capella-Gutiérrez et al., 2009;Stamatakis, 2014). All the trees were uploaded to iTOL v3 for visualization and formatting (Letunic and Bork, 2006).

Data Availability
The reconstructed CPR (representative genome of each group) and their infecting phage genomes reported in the current study were deposited at NCBI within BioProject PRJNA493250 (BioSample SUB4567369), under the accession numbers of SAMN10133524-SAMN10133631. The CPR and phage genomes, and also those four unbinned scaffolds detected with CasY systems can be explored and downloaded from ggKbase 4 following publication of this manuscript. Note that registration by provision of an email address is required prior to data download.

Newly Reconstructed Roizmanbacteria and Woesebacteria Genomes With CRISPR-Cas Systems
Candidate Phyla Radiation bacteria comprised up to 43.1% of the hot spring communities (9.5% on average) (Supplementary Figure S1A and Supplementary Table S1). Generally, we found that the samples from higher temperature hot springs had lower relative abundance of CPR (Supplementary Figure S1B). We selected 17 genomes that encode CRISPR-Cas systems for curation ( Figure 1A and Table 1). Based on rpS3 protein taxonomic analysis, one Woesebacteria genome and 12 Roizmanbacteria genomes encode CRISPR-CasY systems, and four other Roizmanbacteria genomes encode only type III-A CRISPR-Cas systems. Both of these phylum-level groups place within the Microgenomates (OP11) (Brown et al., 2015;Hug et al., 2016). Phylogenetic analyses based on 16 RPs with published Roizmanbacteria genomes (43 dereplicated in total) indicated the divergence of the newly reconstructed Roizmanbacteria from previously published genomes ( Figure 1A). The new Roizmanbacteria genomes were assigned to two distinct classes based on their 16S rRNA gene sequences (Yarza et al., 2014) and/or average nucleotide identity (ANI) (Figure 1A and Supplementary Figure S2). Five of the genomes represent two different strains, with an ANI of 98.39% (clade 1; Figure 1A), and the other 11 genomes belong to the same family (clade 2; Figure 1A). Genomes in clade 1 and 2 were assigned to groups ( Figure 1A and Table 1).

CRISPR-CasY Detected in Roizmanbacteria and Woesebacteria Genomes
We identified 69 CasY candidates (see methods), 17 of which are on scaffolds with a Cas1 protein and CRISPR locus (Supplementary Table S3). Of these, 12 scaffolds could be assigned to Roizmanbacteria genomes and one to a Woesebacteria genome ( Table 1). The other four scaffolds with CRISPR-CasY systems could not be binned, but were also included in our analyses ( Figure 1B).
The CRISPR-CasY systems from Roizmanbacteria and Woesebacteria have a different architecture than those reported previously , with CasY and Cas1 proteins on the same side of the CRISPR locus ( Figure 1A). The Roizmanbacteria CasY proteins have similar lengths of 1,252-1,256 aa, whereas that found in the Woesebacteria is 1,304 aa (Supplementary Table S3), comparable to lengths of previously reported CasY [1,153-1,287 aa; ]. Phylogenetic analyses of CasY proteins showed that the newly reported Roizmanbacteria and Woesebacteria sequences are most closely related to CasY.1 from Candidatus Katanobacteria [WWE3; ] ( Figure 1B).
CasY is an effector protein of Type V CRISPR-Cas systems. To date, all reported Type V CRISPR-Cas systems have RuvClike nuclease domains Chen and Doudna, 2017). Comparative analyses of all CasY proteins reported in this study and CasY.1 with Cpf1, C2c1, and C2c3 references (Shmakov et al., 2015), identified all the catalytic residues within the three conserved motifs of RuvC-I, RuvC-II, and RuvC-III ( Figure 1B), suggesting the RuvC domains in the new CasY proteins are active nucleases. On the other hand, we detected divergence in other regions of the CasY proteins from different sampling sites ( Figure 1B).

Other CRISPR-Cas Systems Identified in Roizmanbacteria Genomes
A Type III-A system was detected in all 11 clade 2 Roizmanbacteria genomes, seven of which encode more than one type of system ( Figure 1A, Table 1 Table S3), which are used for acquisition of new spacers (Nuñez et al., 2014;Shmakov et al., 2015;Hille et al., 2018). In detail, III-A systems in C2-Gp4 (for short of Clade2-Group4, same for others hereafter, Table 1) and C2-Gp5 have both Cas1 and Cas2. C2-Gp6 and C2-Gp7 possess Cas1 but not Cas2. Four genomes in C2-Gp3 lack both Cas1 and Cas2 but have a Mor transcription activator family protein (Figure 1A, Supplementary Table S3, and Supplementary Figure S3A). However, the CRISPR-Cas system in C2-Gp3 may be non-functional because the repeats are imperfect. A fragment of the C2-Gp3 genomes encodes the 16 ribosomal proteins used for phylogenetic analyses and a restriction-modification system that may instead be used for phage defense (Figures 2A,B).
A Type I-B system was identified in two Roizmanbacteria genomes belonging to the same genus (C2-Gp5 and C2-Gp7), but not in the C2-Gp6 genomes, despite the fact that C2-Gp5 and C2-Gp7 are very closely related to C2-Gp6 (ANI = 99% and 16S similarity = 98.9%). Comparative genomic analyses showed that the Type I-B system is located between genes encoding a secreted  Frontiers in Microbiology | www.frontiersin.org cysteine-rich protein and a lamin tail domain protein that are present in both genomes (Supplementary Figure S3B). Two very short hypothetical proteins were detected between the cysteinerich and lamin tail domain proteins in the C2-Gp6 genomes (Supplementary Figure S3B). However, NCBI BLAST and HMM searches indicate no homology of the hypothetical proteins to any known proteins or functional domains, respectively, and no significant similarity to the Cas proteins of Type I-B systems in the C2-Gp5 and C2-Gp7 genomes.

CRISPR-Cas12a Systems in Published Roizmanbacteria Genomes
We investigated 131 published Roizmanbacteria genomes available from NCBI to identify all CRISPR-Cas systems that occur in these bacteria (Supplementary Table S4). The CRISPR-Cas12a system (Cpf1), which was identified in one Roizmanbacteria genome (Zetsche et al., 2015), occurred in four Roizmanbacteria genomes from two classes ( Figure 1A and Supplementary Figures S2, S4), one of them in the class containing Roizmanbacteria clade II with type I-B and III-A systems (see above). Interestingly, the CRISPR-Cas12a systems reported in Zetsche et al. (2015) from Candidatus Roizmanbacteria bacterium CG_4_9_14_0_2_um_filter_39_13 included two Cas12a proteins. We refer to the one near the CRISPR locus as Cas12a, and the other as Cas12a' (Figure 1A). Phylogenetic analyses of Cas12a and Cas12a' proteins (previously reported and identified in this study) indicated those in CPR genomes could be assigned into at least three groups (Supplementary Figure S4A). Group 1 includes the Cas12a proteins from the two genomes with both Cas12a and Cas12a' , and is highly divergent from other Cas12a proteins. Group 2 includes the Cas12a' of Candidatus Roizmanbacteria bacterium CG_4_9_14_0_2_um_filter_39_13, along with the Cas12a proteins from another two genomes. Group 3 includes Cas12a' of Candidatus Roizmanbacteria bacterium GW2011_GWA2_37_7 (Zetsche et al., 2015) and clusters together with Cas12a from non-CPR Bacteria and Archaea.
The RuvC domains (-I, -II, and -III) of the CPR Cas12 and Cas12' group 2 and 3 proteins include all the conserved catalytic residues in Supplementary Figure S4B. However, in group 1 proteins, the conserved RuvC-II glutamic acid catalytic residue "E" was substituted by asparagine "N, " and in RuvC-III asparagine "N" was substituted to valine "V." These substitutions suggest that the Cas12a in the systems with both Cas12a and Cas12a' may not perform cleavage as documented previously (Zetsche et al., 2015).

Roizmanbacteria-Infecting Phage From Podoviridae and Siphoviridae
A total of 1,118 spacers perfectly targeted (100% match and 100% alignment coverage; see section "Materials and Methods") 565 unique scaffolds. Of these, 156 were targeted by two or more spacers (153 from the QZM samples of the current study) (Supplementary Table S5). Eleven of the CRISPR spacertargeted scaffolds encode a phage capsid protein, which was used as a marker for phylogenetic analyses (Figure 3). Five additional scaffolds encoding a similar capsid protein were identified by a BLAST search. Capsid proteins were also predicted from the Absconditabacteria (SR1) phage (8 out of 17 with capsid genes identified) and included in the phylogenetic analyses. The (pro)phage identified in this study as well as the Absconditabacteria phage were assigned to either the Podoviridae clade or the Siphoviridae clade (Figure 3). The complete Saccharibacteria phage that lacks an identifiable capsid protein (Dudek et al., 2017) is most closely related to Siphoviridae phage based on comparison of its terminase with annotated sequences in the NCBI database.
One scaffold (QZM_B3_scaffold_44) from a C2-Gp3 Roizmanbacteria was targeted by multiple spacers. Detailed analyses indicate that this region is a prophage, with a length of approximately 27 kbp (Figure 2C), and is among the first prophage reported in CPR bacterial genomes. This prophage is predicted to encode 40 protein coding genes, including a phage integrase, terminase, prohead protein, major tail protein, tail tape measure protein, tail fiber protein and lysozyme. Nineteen of the ORFs were targeted by 41 CRISPR spacers from CasY-based systems, all of which were from Roizmanbacteria (Figures 1, 4B). BLAST comparison detected highly similar scaffolds in the other three genomes of the C2-Gp3 group (Table 1 and Figure 2) and also unbinned scaffolds in QZM_A2_1, QZM_A2_3, and QZM_A3, suggesting that this is a common Roizmanbacteria prophage. However, when reads of other QZM-related samples were mapped to QZM_B3_scaffold_44, the prophage region showed much higher coverage in QZM_B1 and QZM_B4 than the flanking region (Supplementary Figure S5). Further, a subset of reads that circularize the phage genome were detected. These observations indicate that the prophage existed as phage particles in these two samples.
One putative phage scaffold (Supplementary Table S5) could be circularized, and circularization of the genome was confirmed by paired-end read mapping. The length of complete phage genome QZM_A2_Phage_33_19 is 31,813 bp, with a GC content of 32.9% (Figure 4A). Another two scaffolds (Supplementary Table S5) were manually curated to generate another complete phage genome QZM_B3_Phage_33_79, with a length of 30,824 bp and GC content of 32.5% (Figure 4B). Phage QZM_A2_Phage_33_19 and QZM_B3_Phage_33_79 share high sequence similarity, and they are probably closely related strains.
A total of 53 and 52 open reading frames (ORFs) were predicted from QZM_A2_Phage_33_19 and QZM_B3_Phage_33_79, respectively (Figures 4A,B and Supplementary Figure S6). Of these, 46 shared an average amino acid identity of 98%. ORFs common to both phage encode capsid, terminase, lysozyme and tail proteins. Although these two genomes are highly similar, 7 and 6 FIGURE 3 | Phylogeny of capsid proteins used for taxonomic assignment of Roizmanbacteria-infecting phage (or prophage) in this study. The Roizmanbacteria-infecting phage and prophage are shown in red, and those with spacer targets are indicated by triangles. Squares indicate phage determined to be similar based on their capsid protein sequences. The previously reported Absconditabacteria (SR1) phage are included for comparison. . The total number of spacers that target each gene is listed in parentheses following the protein annotation. The spacers targeting the phage genome from a given CRISPR-Cas system are indicated by bars on the dotted inner rings (see Figure 1 for CRISPR-Cas system type). Bars are colored by genome of origin (see top right). The non-shared proteins between these two phage genomes are indicated by green circles and numbered, their annotations are shown at the right. Hyp, hypothetical protein.
(C) The coverage information of these two phage genomes in QZM-related samples.
non-shared ORFs were detected in QZM_A2_Phage_33_19 and QZM_B3_Phage_33_79, respectively. Among those 13 nonshared ORFs, 4 are related to phage replication (Figures 4A,B), including one replication protein in QZM_A2_ Phage_33_19, and one transcriptional regulator and two replication proteins in QZM_B3_Phage_33_79. We used the divergent region between the two genomes to calculate the coverage of the phage in all QZM-related samples and found that they co-occur in most samples ( Figure 4C).
A total of 63 spacers targeted 26 ORFs in QZM_A2_Phage_33_19, and 52 spacers targeted 22 ORFs in QZM_B3_Phage_33_79 (Figures 4A,B), but no spacers targeted the intergenic regions of the two phage genomes. The majority of spacers with targets (49 and 39, respectively) were from the CRISPR-CasY systems in QZM Roizmanbacteria genomes, and all the other targeting spacers were from the Type I-B and III-A systems of C2-Gp7. No spacer from QZM_B4_Woesebacteria_36_36 (78 unique spacers) and the other 10 type III-A systems targeted the two complete phage genomes.
For phylogenetic analyses, we searched the NCBI database for capsid proteins similar to those in the genomes reported here (Figure 3) and identified a scaffold containing a similar capsid ORF that was binned into a Roizmanbacteria genome [Candidatus Roizmanbacteria bacterium RIFOXYA1_FULL_41_12; ] (Supplementary Figure S2). Comparative analyses showed a close relationship between the sequences of this prophage and the two complete phage mentioned above, including homologies for the capsid and two terminase proteins. In addition, these genes and several other hypothetical proteins share gene arrangements (Supplementary Figure S6). Thus, we conclude the two phage genomes reported are the full sequences for lysogenic (temperate) phage found in Roizmanbacteria genomes.

An Unusual CRISPR-CasY System With a Fragmented CasY Effector and Self-Targeting Spacers
Among the candidate CasY sequences from the Tibet hot springs predicted protein dataset were three adjacent partial proteins on a scaffold from sample GD2_1. In combination, the three ORFs appear to comprise a fragmented CasY protein (defined as "fCasY"). We identified Cas1 and a CRISPR locus adjacent to the fCasY (Figure 5A). Read mapping to the scaffold revealed that the CasY was fragmented by two mutations. One involves deletion of A (from "AAAAA" to "AAAA") and introduces a TAA stop codon five amino acids downstream. This mutation occurred in all the mapped reads, indicating that all the cells have CasY fragmented at this position. The second mutation is a single nucleotide substitution from "C" to "T, " which introduces a TAA stop codon. This mutation was detected in 82% of the mapped reads. Interestingly, however, the three conserved motifs (RuvC-I, -II, and -III) are preserved in the largest protein fragment and all the catalytic residues are shared with functional CasY proteins (Figures 1B, 5A). We identified the ribosome binding site (RBS) for fragments 1 and 2 as TAA, the same RBS associated with 353 of 946 ORFs of this Roizmanbacteria genome. The longest fragment is predicted to have an RBS of AAT, which was only shared by 55 ORFs.
The fCasY locus includes 22 unique spacers, six of which were detected only once in the mapped reads ( Figure 5B). We reconstructed the CRISPR locus ( Figure 5B) and found that all of the single copy spacers are at the locus end that is closest to the Cas1 protein. As in prior studies, we infer that these were recently added to the diversifying end of the CRISPR locus in a subset of cells. Interestingly, 12 out of the 22 unique spacers target the scaffolds of the C2-Gp5 genome, which encodes the fCasY system ( Figure 5C and Supplementary Table S6). In detail, 11 spacers targeted Roizmanbacteria genes, including those encoding a PINc domain ribonuclease, two permeases, a sigma-70 RNA polymerase and three hypothetical proteins with transmembrane domains. Only one spacer matched an intergenic region, which is next to two tRNAs (His and Thr). This spacer was recently acquired, as it is encoded on three reads that also sampled part of the leader sequence ( Figure 5B and Supplementary Table S6). Several of the self-targeting spacers are located in the old end of the locus ( Figure 5B) and occurred in majority of the cells in the population. Thus, we infer that Roizmanbacteria with these selftargeting spacers have survived for a substantial period of time.
In addition to the fCasY locus, we identified type III-A and I-B CRISPR-Cas systems in the C2-Gp5 genome. Notably, one spacer from the type III-A and I-B systems and two fCasY spacers target a complete 34,706 bp phage genome GD2_3_Phage_34_19 (Supplementary Figure S7 and Supplementary Table S6) assigned to Podoviridae. A Cas4-like protein was detected in this phage genome (Supplementary Figure S7). As phage with Cas4-like proteins can induce their hosts to acquire self-targeting spacers (Hooton and Connerton, 2014), the presence of this protein may explain acquisition of self-targeting spacers by the C2-Gp5 genome.
Spacers from the loci of C2-Gp5 target other putative phage scaffolds (Supplementary Table S5). For example, one fCasY spacer targets GD2_3_scaffold_2486, which encodes a putative phage gene. Spacers from both fCasY and I-B systems target GD2_2_scaffold_18083, which encodes a phage tail tape measure protein. Two spacers from the type I-B system target GD2_3_scaffold_517, which encodes a capsid protein that is distantly related to that in the prophage of C2-Gp3 (Figure 3).

PAMs 5 -TA and 5 -TG Are Shared by CasY and fCasY Systems
The PAM is used for the acquisition of spacers into the CRISPR array and is important for target recognition and cleavage (Hille et al., 2018). We determined the probable PAM of the CasY systems reported here to target the two complete phage genomes (QZM_A2_Phage_33_19 and QZM_B3_Phage_33_79). Among all the 39 unique target locations on these two phage genomes (88 spacers in total), 20 had a potential 5 TA PAM and 14 had a potential 5 TG PAM (Supplementary Figure S8 and Supplementary Table S6). Moreover, the one spacer in the CRISPR-CasY system of the C1-Gp1 genome that targets GD2_3_Phage_34_19 also has a 5 TA PAM (Supplementary Figure S7). Previously, the PAM determined for the CasY.1 of Candidatus Katanobacteria using an in vitro approach was a 5 TA, and both 5 TA (dominant) and 5 TG PAMs occur, based on in vivo data . For the fCasY, we checked to see if the self-targeting spacers have the same PAM as that of other CasY proteins. If this was not the case, the genomic region matching the spacer may not be recognized as a target by the fCasY CRISPR system. Among the 12 self-targeting spacers, 7 have 5 TA and 4 have 5 TG PAMs and one has a possible 5 AT PAM (Supplementary Table S6). Among the 5 fCasY spacers targets on phage scaffolds, two have 5 TA PAMs and two have 5 TG PAMs.
In combination the results indicate that both general CasY proteins and fCasY in this study use the 5 TA/TG PAM sequences for spacer acquisition and protospacer recognition. We identified a few targets with other PAM sequences (Supplementary Table S6), but it is possible that these targets have mutated the PAM sites during their evolutionary history, as previously documented (Paez-Espino et al., 2015).

Potential Phage-Host Genetic Interactions
When examining the genomic context of CRISPR-CasY systems we noted four very short genes located next to the CRISPR array in the C2-Gp7 genome (Supplementary Figure S9). All four genes had at least one homolog in the three complete phage and one prophage (BLASTp e-value thresholds = 1e-5) and when two or more homologs were identified in the same genome, they were together. However, homologs were not identified in the other newly reconstructed and previously reported Roizmanbacteria genomes (Supplementary Table S4). The four genes in the C2-Gp7 genome and phage and prophage shared >83% (up to 99%) nucleotide identity with >80% alignment coverage, but none had a NCBI blast hit with similarity >38% (>50 alignment coverage). Given this, and the deduction that QZM_A2_Phage_33_19 and QZM_B3_Phage_33_79 infect C2-Gp7 Roizmanbacteria (based on CRISPR spacer targeting), we conclude that there may have been lateral transfer of novel proteins related to phage-host interactions between Roizmanbacteria and their phage.

DISCUSSION
Candidate Phyla Radiation bacteria account for a huge amount of diversity within the Bacterial domain, but the mechanisms of their interactions with phage and the phage that infect them have remained largely undocumented. In part, this is due to scant information about their CRISPR-Cas systems, despite extensive genomic sampling from a wide variety of sites in nature (Burstein et al., 2016Dudek et al., 2017;. In this study, we report an unexpected diversity of CRISPR-Cas systems in the genomes of bacteria from the CPR phylum of Roizmanbacteria, both from newly reconstructed sequences from multiple hot spring sediments of Tibet, China (Supplementary Table S1) and some previously published genomes. Most of them are CasY-based systems ( Figure 1A and Table 1). These new sequences constrain more and less highly conserved regions of CasY proteins, information that may be important in future efforts directed at tailoring the properties of genome-editing enzymes.
The finding that some of the Roizmanbacteria genomes encode multiple CRISPR-Cas systems, including the relatively large types I-B and III-A, is unexpected, given the overall paucity of systems in CPR bacteria, and their small genome sizes ( Figure 1B). We infer that these systems are mostly active, given the identification of targets on potential phage scaffolds and evidence for locus diversification. Considering that majority of the spacers with targets on the three complete phage and one prophage were from CRISPR-CasY systems (Figures 2C, 4A,B), it seems that CasY is the primary CRISPR-Cas system used by these bacteria for phage defense. In the case of the Roizmanbacteria with only a degenerate Type III-A CRISPR-Cas system, defense may rely upon a restriction-modification system, as suggested previously for CPR bacteria that lack any CRISPR-Cas system (Burstein et al., 2016) (Figure 2). In support of this correlation, restriction-modification systems were not detected in those Roizmanbacteria with seemingly functional CRISPR-Cas systems ( Figure 1A). The discovery of two copies Cas12a proteins in a single system of two genomes is an additional case of unexpected investment in CRISPR-Cas-based phage defense by CPR bacteria (Figure 1A and Supplementary Figure S2). Overall, the genomes of Roizmanbacteria contained three of the six types of CRISPR-Cas systems reported so far (i.e., types I, III, and V), expanding our understanding of the investment of CPR bacteria in CRISPR-Cas-based defense.
The availability of a pool of CRISPR spacers enabled discovery of three Roizmanbacteria-infecting phage for which complete genomes were reconstructed, and one prophage (Figures 2-4 and Supplementary Figure S7). These are the first reported phage infecting members of the Microgenomates superphylum of the CPR. All of these phage, along with the previously reported CPR phage, were assigned to Podoviridae and Siphoviridae of the Caudovirales order (Figure 3). The phylogenetic relatedness and genetic similarity among the Podoviridae phage obtained in this study and a Roizmanbacteria prophage deposited at NCBI (Figure 3 and Supplementary Figure S6), and also the potential phage-host genetic interactions (Supplementary Figure S9), may indicate stable and similar host-phage relationships in a variety of habitats.
An interesting aspect of the CRISPR-CasY analyses was the fCasY system in one Roizmanbacteria that includes a locus with self-targeting spacers. It may be significant that a Cas4like protein is encoded in the genome of a phage that replicates in this Roizmanbacteria, given that a Cas4-like protein in a Campylobacter sp. phage was suggested to facilitate acquisition of self-targeting spacers into the CRISPR-Cas system of its host (Hooton and Connerton, 2014). Roizmanbacteria lack the RecBCD mediated double-stranded DNA break repair complex, the only documented mechanism for avoidance of self-targeting spacer acquisition (Levy et al., 2015). Thus, it is plausible that the phage-encoded Cas4-like protein led to acquisition of the self-targeting spacers, which should result in autoimmunity (Stern et al., 2010).
Autoimmunity can be avoided via loss of cas genes, mutated repeats adjacent to self-targeting spacers, extended basepairing with the upstream flanking repeat, and the absence of a PAM in the chromosomal region matched by the spacer (Stern et al., 2010), none of which were observed here. Autoimmunity also could be countered via loss of cas gene function. Interestingly, the fCasY harbored conserved RuvC domains and catalytic residues found in intact CasY proteins (Figures 1B, 5). However, given the relatively high abundance of Roizmanbacteria with fCasY in the community (1.37%), we infer that the fCasY protein fragmentation led to loss of cleavage function, preventing autoimmunity. It is possible that the region of the fCasY protein responsible for binding to the target sequence is encoded on a different gene fragment than that encoding the nuclease domain, so that the CRISPR RNA does not recruit the protein fragment with nuclease function.
The presence of old end CRISPR locus spacers that target the host chromosome suggests that the fCasY has been present in the genomes of the Roizmanbacteria C2-Gp5 population for some time. Why has this gene, or the entire locus, not been lost? It is possible that the spacers of the fCasY locus retain some function, for example in gene regulation (possibly involving binding of CRISPR RNAs to the DNA during transcription). Experiments will be required to determine whether fragments of fCasY can reassemble and bind to the genomic regions targeted by the self-targeting spacers (without cleavage) and to determine if the spacer-directed binding domain is on fragment 1 or 2 ( Figure 5A).

CONCLUSION
CRISPR-Cas systems are unexpectedly common in a subset of CPR bacteria, and the number, variety and potential functional diversity of these systems is greater than expected. It is already established that CRISPR-CasY systems from these intriguing and enigmatic bacteria will have biotechnological value. Lessons from natural system studies such as reported here may provide information about CasY sequence variety and function that may be useful in enzyme engineering. Beyond this, the new information about CPR bacteria, their phage and the mechanisms of their interactions expands our understanding of the complex phenomena that shape the structure and functioning of natural microbial communities.

DATA AVAILABILITY
The datasets generated for this study can be found in NCBI, PRJNA493250.

AUTHOR CONTRIBUTIONS
L-XC and JB designed the study. W-JL supported the metagenomic sequencing. L-XC performed the metagenomic assembly, HMM search, and scaffold extension and curation. L-XC and JB performed genome binning and curation. L-XC and JB conducted data analyses with input from BA-S, RM, and JD. L-XC and JB wrote the manuscript. All authors read and approved the final manuscript.