Identification and Analysis of Long Repeats of Proteins at the Domain Level

Amino acid repeats play an important role in the structure and function of proteins. Analysis of long repeats in protein sequences enables one to understand their abundance, structure and function in the protein universe. In the present study, amino acid repeats of length >50 (long repeats) were identified in a non-redundant set of UniProt sequences using the RADAR program. The underlying structures and functions of these long repeats were carried out using the Gene3D for structural domains, Pfam for functional domains and enzyme and non-enzyme functional classification for catalytic and binding of the proteins. From a structural perspective, these long repeats seem to predominantly occur in certain architectures such as sandwich, bundle, barrel, and roll and within these architectures abundant in the superfolds. The lengths of the repeats within each fold are not uniform exhibiting different structures for different functions. We also observed that long repeats are in the domain regions of the family and are involved in the function of the proteins. After grouping based on enzyme and non-enzyme classes, we observed the abundant occurrence of long repeats in specific catalytic and binding of the proteins. In this study, we have analyzed the occurrence of long repeats in the protein sequence universe apart from well-characterized short tandem repeats in sequences and their structures and functions of the proteins at the domain level. The present study suggests that long repeats may play an important role in the structure and function of domains of the proteins.


INTRODUCTION
Amino acid repeats are ubiquitous in protein sequences that often correspond to structural and functional units of proteins. The length of these repeats varies considerably from shorter units of homo repeats of single amino acid (Jorda and Kajava, 2010), oligopeptide repeats of 2-20 residues (Fraser and MacRae, 1973) and solenoid repeats of 20-40 residues to larger repetitions of length >50 called domain repeats (Andrade et al., 2001). These repeats occur as a single pair or as multiple copies in a tandem/non-tandem manner that are useful for structural packing or for one or more interactions with ligand (Katti et al., 2000;Luo and Nijveen, 2014). It has been observed that many proteins of length >500 contain internal repeats, suggesting the importance of repeats in producing larger proteins (Marcotte et al., 1998). However, these repeats possess weak identities due to extensive divergence, but retain similar folds and functions of the proteins (Holm and Sander, 1993). It has also been found out that long stretches of perfect repetitions are infrequent in protein sequences even though they are folded into recurrent structural motifs (Turjanski et al., 2016). Many methods and algorithms, such as Fourier transformation, short string extension, sequencesequence alignment, and sequence profiles comparison have been introduced for the identification of such diverged sequence repeats with insertion and deletion without prior knowledge. Web based servers such as the Internal Repeat Finder, RADAR, REPRO, TRUST, XSTREAM, HHRepID, T-REKS, and PTRStalker (Pellegrini et al., 1999(Pellegrini et al., , 2012George and Heringa, 2000;Heger and Holm, 2000;Szklarczyk and Heringa, 2004;Newman and Cooper, 2007;Biegert and Söding, 2008;Jorda and Kajava, 2009) have been developed by implementing the above techniques to detect amino acid repeats in proteins.
Analysis of larger proteins has demonstrated that significant portions of proteins are composed of domains. They are the conserved parts of proteins which can fold and function independently. The folded domains can either serve as modules for building up large assemblies or provide specific catalytic enzyme functions or bindings of the proteins. It has been found that repeats of a length >50 residues often correspond to conserved regions that are present in proteins as single or multiple copies for the function of the proteins (Hemalatha et al., 2007). Our analysis of sequence repeats of the proteins with known 3D structures in the PDB (Berman et al., 2014) has shown that they retain similar folds in spite of divergences, in order to conserve the structure and function of the proteins and, repeats that are in the single/two domains from the same family contain conserved motifs for the function of the proteins (Mary Rajathei and Selvaraj, 2013). Further, the conservation of interresidues interactions in domain repeats have been analyzed in terms of long-range contact, surrounding hydrophobicity and pair-wise interaction energy (Mary et al., 2015). A database IR-PDB for repeats in the sequence of the proteins in the PDB has been developed for the analysis of impact of repeats in proteins (Selvaraj and Rajathei, 2017).
The widely used sequence database UniProtKB (UniProt Consortium T, 2017) contains more than 500,000 sequences that are annotated with well-characterized repeats of tetratricopeptide, leucine-rich repeats, ankyrin, and armadillo/heat etc. However, there has been no survey of repeats of length>50 in the UniProt sequences, which may provide insights into their role in the structure, function and evolution of the proteins. In the present study, we have analyzed the occurrence of long repeats and their underlying structures and functions in a non-redundant set of UniProt sequences. Since repeats of size exceeding 50 residues are large enough to fold independently into stable domains (Kajava, 2012), we used Gene3D for structural domains, Pfam for functional domains and enzyme and non-enzyme functions for specific catalytic and binding for their structure and function of long repeats proteins. It was found that long repeats occur in about 23% of the considered proteins. Analysis of the structure of long repeats reveals that these repeats are predominantly observed in the structural folds of sandwich, bundle, barrel and roll. We observed that repeats in the domains for the function of the proteins. Further, we observed that long repeats tend to occur both in enzyme and non-enzyme functions of proteins. While long repeats are found in all the major enzyme classes, these are more abundant among both ligases and isomerases. Among the non-enzyme proteins, such as DNA binding, metal binding, calcium binding, and Nucleotide binding (NP), these repeats are observed more in Nucleotide binding and DNA binding proteins. The present analysis shows that the occurrence of long repeats and their structures and functions of the proteins at the domain level.

Data Collection
A collection of 555,100 proteins along with their assigned UniProt ID, amino acid sequence, protein name, protein family, enzyme function, and non-enzyme functions such as DNA binding, calcium binding, metal binding, and NP binding, as well as other annotation of the sequences from the databases of Pfam, Gene3D, PDB, and DisProt, was downloaded from UniProtKB/Swiss-Prot (UniProt Consortium T, 2017) and stored in a file. The Pfam is a database of protein domain families that assigns the domains, as well as their functional regions (Finn et al., 2014). Gene3D (Lewis et al., 2018), is a database that assigns the structure of the protein according to CATH hierarchy of class, architecture and fold in numerical values (Dawson et al., 2017). At the class level (C), the numerical value 1 is for all alpha class, 2 for all beta and 3 for a mixture of alpha and beta. Likewise, the numerical values are assigned for Architecture level (A) based on secondary structure arrangement in 3-D space and for Topology/Fold level (T) based on the connection of secondary structural elements. The PDB ID's of the 3D structure known proteins were obtained from the PDB database (http://www. rcsb.org/pdb/home/home.do). The intrinsic disordered regions of the proteins that were extracted from the literature are available in the DisProt database (Piovesan et al., 2017). A nonredundant representative set of 126,945 sequences that share <50% sequence identity was obtained by clustering the 555,100 sequences using the web server CD-HIT (Fu et al., 2012). The overall work-flow is summarized as a flowchart (Figure 1).

Finding Sequence Repeats of the Proteins Using RADAR
The presence of internal repeats in each protein sequences was identified using the repeat detection program RADAR (Heger and Holm, 2000), which was downloaded from the URL (https:// sourceforge.net/projects/repeatradar). The RADAR program is efficient for ab initio detection of repeats of length >15 in a single sequence by aligning the sequence against itself, as well as by generating the sequence profile using multiple sequence alignment. RADAR evaluates the statistical significance of the observed repeats by measuring a Z-score for each repeat unit (McLachlan, 1983;Heringa and Argos, 1993). The Z-score of a repeat unit is the number of standard deviations of the repeat unit score above the mean. The score of each unit is determined from a profile derived from the multiple alignment of repeat unit without considering end-gaps. Repeats with Z-scores threshold of > 6 are reported by the RADAR program. An in-house Perl program that incorporated the RADAR executable was written to detect internal repeats of all sequences in the dataset in a single run. Proteins containing repeats of length >50 were considered for further analysis.

Finding the Structure of Long Repeats Proteins
The UniProt ID's of proteins having long repeats were extracted and their Gene3D structural domain-based assignments of the proteins were extracted using a Perl program. Then, the name of class, architecture and fold of the protein was found out by using CATH search and grouped according to their name for the further analysis of architecture and fold of the protein with repeats.

Finding the Functional Domains of Long Repeats Proteins
The UniProt ID's of long repeat proteins were extracted and their assigned Pfam domains of the sequences were identified. The domain regions and their functional residues information of the proteins were found out using Pfam database search (Finn et al., 2014), and repeats in the domain regions were identified by manual search. The level of similarity of the repeats within a protein and within a protein family was found out in terms of % sequence identity through using the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970) implemented in the ggsearch36 program of the FASTA-36.3.5b package (Henikoff and Henikoff, 1992). Needleman-Wunsch alignment scores were calculated using the BLOSUM50 scoring matrix (Pearson, 2000) with a penalty of −12 for gap opening and −2 for gap extension. Further, the repeats in domains of the proteins were also analyzed for their functional involvement at the structure of the proteins using the server PDBsum by giving PDB ID as input (Laskowski et al., 2018).

Finding the Enzyme and Non-enzyme Functions of Long Repeats Proteins
The assigned enzyme numbers (EC) of long repeats proteins were extracted. The EC number of the protein at the first level corresponds to seven enzyme classes of Oxidoreductases (EC 1), Transferases (EC 2), Hydrolases (EC 3), Lyases (EC 4), Isomerases (EC 5), Ligases (EC 6), and finally, Translocases (EC 7). The enzyme numbers were extracted and grouped according their numbers for further analysis. The non-enzyme proteins that are assigned with DNA binding, calcium binding, metal binding and NP binding were also extracted and grouped according to their name.

Abundance of Proteins Having Long Repeats
The presence of amino acid repeats of length >15 was found out in 85,726 (67%) out of non-redundant set of 126,945 UniProt protein sequences (Supplementary Data File 1). The long repeats were found out in 29,768 (35%) proteins. These repeats are present as a single pair or multiple copies of repeats in tandem/non-tandem manner. For example, Nacetylmuramoyl-L-alanine amidase Rv3717 protein (  466-4,590/4,593-5,634). Through the analysis of length distribution of long repeats, as well as their repeat number distribution of long repeats against the number of proteins (Figure 2), we observed that the lengths <200 are observed in more than 90% of the proteins with an average of 100 residues, and repeated in 2-5 number of times with repeat numbers of 2 (61%) and 3 (26%) in most of the long repeat proteins. The Z-score values of the repeats were extracted and found that 74,089 out of 74,154 repeat units have Zscores >6. Among these, 66,400 repeat units have Z-scores of >20. This suggests that most of the observed repeats are statistically significant.

Analysis the Structure of Long Repeats Proteins
The structural class, architecture and fold of the 14,176 proteins (48%) have been found out using structural domain based Gene3D assignments. Among these, some proteins are having two or more Gene3D assignments. In this study, 10,504 proteins that contained a single Gene3D assignment were considered for further analysis (Supplementary Data File 2). For example, Annexin A1 (P04083) protein contains a single Gene3D assignment of 1.10.220, which means that this protein belongs to class alpha (1) of orthogonal bundle architecture (10) with Annexin V domain fold (220).

Analysis of Long Repeats at the Architecture Level
According to CATH domain-based hierarchy (http://www. cathdb.info/browse/tree), the presence of long repeats in different architectures of alpha (α), beta (β), and alpha/beta (α/β) class proteins was observed (Figure 3). Out of five architectures of α class, these were observed in the four architectures, namely orthogonal bundle, up-down bundle, α horseshoe and α/α barrel. Among these, substantial numbers were present in the architectures of bundle and horseshoe. Under β class, the repeats were present in 13 out of 20 architectures and the sandwich, propeller, roll and barrel were observed most. Likewise, repeats were found in 10 out of 14 architectures of α/β class and the architectures of 3-layer (αβα) sandwich, 2-layer sandwich, α/β barrel and αβ-complex were observed most. By combining the architectures from different classes of proteins, repeats in specific architectures of sandwich, bundle, barrel and roll compared to other architectures were found out.

Analysis of Long Repeats at the Fold Level
The existence of repeats in different folds of sandwich, bundle, barrel and roll architectures was found out. Repeats were observed in 84 out of 287 folds in orthogonal bundle and 32 out of 101 folds in up-down bundle of α class. At the β class, 12 out of 43 folds in β sandwich, 18 out of 48 folds in β barrel, and 13 out of 40 folds in β roll architecture of the proteins were having repeats. Under α/β class, repeats in 47 out of 126 folds under 3-layer (αβα) sandwich, 57 out of 224 under 2-layer sandwich, 6 out of the 18 folds under α/β barrel and 16 out of 58 folds under α/β roll were observed. Among that, some folds were observed in a greater number of proteins compared to other folds (Figure 4). In α class, the Arc Repressor Mutant Subunit A fold and four Helix Bundle fold of bundle architectures were observed in most of the proteins compared to other folds ( Table 1). Under β class, the Immunoglobulinlike fold and Jelly Roll fold of β-sandwich, PH-domain fold of β-roll, and OB fold of β-barrel were observed most. The Rossmann fold in 3-layer (αβα) sandwich, TIM Barrel in αβ barrel and Herpes Virus-1 followed by Alpha-Beta plaits fold of 2-layer Sandwich, and Ubiquitin-like (UB roll) of αβ roll were observed most. The results reveal the predominant occurrence of long repeats in the diverse structure exhibiting folds of the proteins.

Analysis of Long Repeats for Structural Repeats
The long repeats in proteins with known 3-D structure (as available from UniProt annotation) were analyzed for structural repeats. The proteins with tandem repeats were found out and analyzed at the structural level. We observed long tandem repeats form structural repeats in the folds of up-down and orthogonal bundle of α-class, Immunoglobulin, Jelly Roll and OB fold of β-class, Rossmann fold, TIM barrel, α/β plait, and UB roll of α/β class. Figure 5 shows the structural repeats of the proteins in the folds of up-down and orthogonal bundle of α-class, Immunoglobulin, Jelly Roll and OB fold of β-class, Rossmann fold, TIM barrel, α/β plait, and UB roll of α/β class. Further, we found out that the lengths of the repeats are not uniform and vary considerably within each fold. Figure 6 shows the considerable variation in lengths, as well as in the secondary structures of different proteins possessing the Rossmann fold that usually contains βαβαβ secondary structure arrangements. The Desulfovibrio vulgaris CbiK(P) Cobaltochelatase (PDB ID: 2XVY) contains two repeats of βαβαβαβ secondary structure arrangement of length 103 ( Figure 6A) (Malay et al., 2009), whereas, another protein Thermoplasma volcanium Phosphoribosyl pyrophosphate synthetase (PDBID: 3MBI) contains two repeats   of βαβαβαβα of length 121 in Figure 6B (Cherney et al., 2011). The analysis results suggest that the length variations of repeats within the Rossmann fold lead to the presence of additional α-helices, β-strands, and coil regions. Thus, longer repeats of different lengths provide the structural differences within a fold of the proteins.

Analysis of Long Repeats for Intrinsic Disordered Region
The intrinsically disordered regions (IDR) for 51 (<1%) of long repeats proteins were found out using DisProt database. While analyzing the predisposition of long repeats for IDR, most of the repeats were identified in the structured regions. However, we also identified long repeats in an IDR. For example, Nucleoporin NUP1 (P20676) protein of length 1,076 contains tandem repeats of length 62 in the region of (352-399/403-462/522-564/666-728/731-778/779-840/849-906/907-972/978-1,031), which has been identified as an IDR (300-1,078). This analysis suggests that long repeats are generally structured in most of the proteins while few of them may have IDRs.

Analysis of Functions of Long Repeats at the Domain Level
The Pfam domain assignments in 26,750 (90%) of proteins were found and suggested the occurrence of repeats in the functional domain families containing proteins. While grouping FIGURE 7 | List of the 36 protein families that are having long repeats in more than 40 member proteins.
by protein family, the existence of repeats in 5,258 distinct protein families was found out. Some of the protein families are having long repeats in a greater number of their member proteins (Supplementary Data File 3). Figure 7 shows the list of 36 protein families such as Class II aminoacyl-tRNA synthetase, Ser/Thr Protein kinase, Class I aminoacyl-tRNA synthetase, Cytochrome P450, Mitochondrial carrier (TC 2.A.29), G-protein coupled receptor 1, and ABC transporter that are having repeats in more than 40 member proteins of the family. We observed long repeats in the domains of the family with varying lengths. For example, the Peptidase S8 family proteins contained long repeats in 41 member proteins of the family ( Table 2). Among these, 38 protein repeats were in the Peptidase S8 domains with varying repeat lengths. Figure 8 shows some of the proteins' repeat regions as well as their alignment that covers the Peptidase S8 domain regions. The level of similarity between the repeats in the Peptidase S8 domain within a protein and within the member proteins of the Peptidase S8 domain family was computed in terms of % sequence identity. For example, the sequence identity of 29% was observed for the repeats (157-215/228-313), within the Peptidase domain (157-401) of the Aqualysin-1 protein (P08594) ( Table 2). Further, the sequence similarities of repeat unit (157-215) of this protein, with the repeat units in the Peptidase S8 domain of the 37 member proteins, were also computed. We observed that 65 % of repeats were in the range of 20-40% sequence identity and the remaining protein repeats were in the range of 10-20% identity. This observation suggests that the repeats within a protein, as well as within a protein family, are considerably diverged.
Further, repeats in the domains are involved in the function through functional residues (highlighted in red color). For example, the regions (162-173) and (197)(198)(199)(200)(201)(202)(203)(204)(205)(206)(207) of repeats (157-215/228-313) of Aqualysin-1 (UniProt ID P08594) have contained functional residues VYVIDTGIRTTH and HGTHVAGTIGG for Serine proteases (Figure 8). The functional involvement of the repeats was also found out in the structure of the proteins using PDBsum. For example, the functionally involved residues (highlighted red in color) of repeats (157-215/228-313) in the structure of Aqualysin-1 (PDB ID 4DZT) were found out using PDBsum search (Figure 9). This suggests that these repeats occur in the domains of the family for the function of the proteins.

Analysis of Enzyme and Non-enzyme Functions of Long Repeats
Further, the enzyme functions in 13,333 proteins and nonenzyme functions in 2,437 proteins, of a total of 15,770 (53%) of long repeats proteins, were also found out. Of a total of 13,333 enzymes having long repeats, Ligases (35.91%) have the maximum number of repeats followed by Isomerases (28.98%), Translocases (11.81%), Transferases (11.12%), Lyases (5.12%), Hydrolases (4.74%), and Oxidoreductases (2.32%). Among the non-enzymes in 2,437 proteins, NP binding proteins (48.09%) have the maximum number of repeats followed by DNA binding (30.44%), metal binding (16.94%), and calcium binding (4.51%). These observations suggest the importance of long repeats in both the catalytic and binding function of proteins apart from serving as modules of large assemblies.  (298) 12 Nisin leader peptide-processing serine protease (Q07596 683)

DISCUSSION
Our survey of long repeats in a non-redundant set of UniProt sequences has highlighted the occurrence of these repeats that play an important role in the structure and function of domains of the proteins. Previous studies have focused on structural and functional implications of proteins with homo repeats (Uthayakumar et al., 2012), fibrous repeats (Parry, 2005) and different well-characterized repeats of length 5-50 (Andrade et al., 2001). Therefore, an in-depth study of long repeats in UniProt sequences was carried out for a better understanding of the correspondence of repeat sequences with their structures and functions. In this study, we used the RADAR program for internal repeat detection, since it often detects both tandem and interspersed repeats in larger size. Our earlier studies for repeats analysis (Mary Rajathei and Selvaraj, 2013;Mary et al., 2015) have shown the ability of RADAR to detect repeats of length > 50 that are structurally similar and conserved in a 3D structure environment. Further, the sensitivity and accuracy of RADAR repeats, by comparison with Pfam, indicate good coverage, accurate alignments, and reasonable repeat borders (Heger and Holm, 2000). The identified repeats vary in the range of 50-1,759 of lengths and diverged with more insertions and deletions, but the calculated z-scores by RADAR have shown their statistical significance. From a structural perspective, long repeats tend to occur abundantly in certain architectures of sandwich, barrel, bundle, and roll. Within these architectures, they are predominately observed in the super folds of up-down and orthogonal bundle of α-class, Immunoglobulin, Jelly Roll and OB fold of β-class, Rossmann fold, TIM barrel, α/β plait, and UB roll of α/β class of the proteins. The adoption of classic super secondary elements (αα, βαβ, ββ) and incorporation of repetitive duplication of a small stable unit may be the possible reasons for abundance of larger duplication in these folds (Thornton et al., 1999). For example, the evolution of the (βα) 8 repeat in the TIM barrel is through repetitive duplication of a small stable unit (βα) (Lang et al., 2000). It has been observed that repeats in the folds may fulfill the physical demand (stable and fast folding conformation) of the protein chain during the process of evolution, in order to meet the cellular function (Lupas et al., 2001). Further, it has been shown that the existence of structural symmetries in the super-folds (6 out of 10) may also require larger duplication during evolution of the proteins (Brych et al., 2003). Kim et al. (2010), through their SymD (detecting symmetry in protein structures) method, have identified 33 folds that contain 10 or more symmetric domains. There is considerable overlap between the symmetry in the folds they identified and those observed in the present work ( Figure 5). We observed that long repeats of different lengths within a fold provide the structural differences of the proteins for different functions. Further, the analysis of predisposition of long repeats for disordered regions has shown that long repeat proteins are mostly structured to form stable folds. However, it has been observed that short tandem repeats are highly disordered, which do not adopt a single defined configuration for specific function (Tompa, 2012;Habchi et al., 2014;van der Lee et al., 2014).
Further, repeats have been analyzed for a specific domain of the family, in which protein function could be found out through the domain (Rentzsch and Orengo, 2013). We found that repeats in the domain regions of the family are involved in the function through functional residues. Earlier, we analyzed the repeats in the individual proteins of PDB and found that the existence of repeats in single/two domains from the same family, for the function of the proteins and that are not in the domains, are also involved in the function of the proteins (Mary Rajathei and Selvaraj, 2013). We observed that the lengths of repeats in the domains of the family are not uniform. Further, the computation of sequence identity of the repeats within a protein and within a family of Peptidase S8 domain shows lower similarity, which may be the consequence of their divergences over a period. Earlier, it was observed that repeat proteins are indeed repetitive in their families, exhibiting abundant stretches of short perfect repetitions (Turjanski et al., 2016). The repeats of varying lengths in the structures of the fold, as well as in the functional domains of the family, have suggested that long repeats are considerably diverged and may not be overlapped. However, further studies would be needed to understand the conservation of long repeats of the proteins in the structure and function of the proteins.
Further, we observed the existence of long repeats in all seven enzyme classes of the proteins and are especially more abundant in ligases and isomerases. Among the nonenzyme proteins, long repeats are observed in DNA binding, calcium binding, metal binding and NP binding proteins with NP binding and DNA binding in a greater number of proteins. However, further studies are needed to understand why certain enzyme classes and non-enzyme classes are having long repeats in more numbers. This shows that the occurrence of long repeats, not only serves as modules of large assemblies, but also in the catalytic function or binding of the proteins.
While commenting on the evolution of the well-characterized short tandem repeats in many evolutionary lineages, it has been postulated that repeat-containing proteins are cheap to evolve, rather than the de nova sequence evolution, as the repeat units are thermodynamically stable (Andrade et al., 2001;Andersson et al., 2015). Through our analysis, we observed the occurrence of long repeats in the stable folds for different functions of the proteins and suggested that long repeats may play a role in the evolution of proteins with stable folds and novel functions.

CONCLUSIONS
The present large scale study has focused on the presence of long repeats in a non-redundant set of the entire annotated UniProtKB/Swiss-Prot database and reveals that long repeats are found in 23% of the proteins. Regarding their three-dimensional structures, they are found in certain structural folds that are incorporated with repetitive duplication of small stable folds. Further, the long repeats of different lengths within each fold are observed in different structures of the proteins. From a functional perspective, these repeats are found in both enzyme and nonenzyme functions containing proteins. Hence, long repeats may have a role in the evolution of proteins with stable folds and novel functions.

DATA AVAILABILITY STATEMENT
The UniProt annotated sequence files and the RADAR output files were analyzed for this study. Major results are available as Supplementary Material.

AUTHOR CONTRIBUTIONS
DR developed the computer programs in perl platform for this study and drafted the manuscript. SP supported to analyze and computation the data. SS conceived the idea and helped in the preparation of the manuscript.