Tandem Repeats in Proteins: Prediction Algorithms and Biological Role

Pellegrini, Marco

doi:10.3389/fbioe.2015.00143

REVIEW article

Front. Bioeng. Biotechnol., 24 September 2015

Sec. Computational Genomics

Volume 3 - 2015 | https://doi.org/10.3389/fbioe.2015.00143

Tandem Repeats in Proteins: Prediction Algorithms and Biological Role

MP
Marco Pellegrini ^*

Laboratory for Integrative Systems Medicine (LISM), Istituto di Informatica e Telematica, and Istituto di Fisiologia Clinica, Consiglio Nazionale delle Ricerche, Pisa, Italy

Article metrics

View details

Citations

9,8k

Views

2,5k

Downloads

Abstract

Tandem repetitions in protein sequence and structure is a fascinating subject of research which has been a focus of study since the late 1990s. In this survey, we give an overview on the multi-faceted aspects of research on protein tandem repeats (PTR for short), including prediction algorithms, databases, early classification efforts, mechanisms of PTR formation and evolution, and synthetic PTR design. We also touch on the rather open issue of the relationship between PTR and flexibility (or disorder) in proteins. Detection of PTR either from protein sequence or structure data is challenging due to inherent high (biological) signal-to-noise ratio that is a key feature of this problem. As early in silico analytic tools have been key enablers for starting this field of study, we expect that current and future algorithmic and statistical breakthroughs will have a high impact on the investigations of the biological role of PTR.

Introduction

A seminal paper (Andrade et al., 2001) reports the observation that repetitive subsequences that appear in tandem repetitions (TR) within the protein primary sequence often form integrated assemblies when these residues are mapped to their corresponding three-dimensional folded conformation. These TR confer multiple binding opportunities and may play a structural role by giving rigidity to a protein, and by exposing functional domains. Moreover, Andrade et al. (2001) remark that tandem repeated structures should not be assimilated to the traditional notions of domains and motifs that may appear singly or in multiple interspersed copies in each protein (while they can be repeated across families of protein), since they constitute a rather distinct class. They also remark that repeats in protein sequences are usually hard to detect because on average the repeating unit is relatively short, and moreover there can be considerable sequence divergence among units of the same TR. We will refer throughout this article to these repetitive sub-sequences as Protein Tandem Repeats (PTR or Protein-TR, for short).

A study by Marcotte et al. (1998) indicates that internal subsequence repetitions in protein primary structure are quite widespread. They have been detected in about 14% of all the then known proteins, with eukaryotic proteins being three times more as likely to have internal repeats than prokaryotic ones. More recent measurements in (Pellegrini et al., 2012) give a count of about 25% of the proteins in the Uniprot database (Apweiler et al., 2004) holding a PTR of length at least 20 aa.

A recent survey of some algorithmic aspects of PTR detection in protein sequences is in Luo and Nijveen (2014). In this survey, we will touch lightly on the multi-faceted aspects of PTR research, including prediction algorithms, databases, early classification efforts, mechanisms of PTR formation and evolution, and synthetic PTR design. We also touch on the rather open issue of the relationship between PTR and flexibility (or disorder) in proteins.

Protein-TR Detection Algorithms Based on Sequence

Structural and functional properties of Protein-TR are often preserved also in presence of high divergence among the subsequences corresponding to the PTR units, both at the level of DNA coding sequence and at the level of AA sequence. This property makes automatic PTR detection a challenging task, and a variety of approaches have been implemented since the late 1990s. More recently, a tendency to integrating basic sequence data with evolutionary or biochemical annotations has emerged. Table 1 reports the list of sequence-based algorithms.

Table 1

Name	Type	Year	Reference	Notes
INTREP	Alg	1999	Pellegrini et al. (1999)	http://nihserver.mbi.ucla.edu/Repeats/
prospero	Alg	1999	Mott (1999)	http://www.well.ox.ac.uk/rmott/ARIADNE/prospero.shtml
REP	Alg	2000	Andrade et al. (2000)	http://www.bork.embl.de/~andrade/papers/rep/search.html
RADAR	Alg	2000	Heger and Holm (2000)	https://github.com/AndreasHeger/radar/
REPRO	Alg	2000	George and Heringa (2000)	http://www.ibi.vu.nl/programs/reprowww/
TRUST	Alg	2004	Szklarczyk and Heringa (2004)	http://www.ibi.vu.nl/programs/trustwww/
REPPER	Alg	2005	Gruber et al. (2005)	http://toolkit.tuebingen.mpg.de/repper/
HHrep	Alg	2006	Soding et al. (2006)	http://toolkit.tuebingen.mpg.de/hhrep
TRED	Alg	2006	Sokol et al. (2007)	Available upon request
XSTREAM	Alg	2007	Newman and Cooper (2007)	http://jimcooperlab.mcdb.ucsb.edu/xstream/
HHRepID	Alg	2008	Biegert and Soding (2008)	http://toolkit.tuebingen.mpg.de/hhrepid/
ARD2	Alg	2009	Palidwor et al. (2009)	http://cbdm.mdc-berlin.de/~ard2/
T-REKS	Alg	2009	Jorda and Kajava (2009)	http://bioinfo.montp.cnrs.fr/
REPETITA	Alg	2009	Marsella et al. (2009)	http://protein.bio.unipd.it/repetita/
PTRStalker	Alg	2012	Pellegrini et al. (2012)	http://bioalgo.iit.cnr.it/
TRDistiller	Alg	2014	Richard and Kajava (2014)	Available upon request

Synthetic table of resources for PTR studies: sequence-based algorithms.

Interestingly, early algorithms by Marcotte et al. (1998), Pellegrini et al. (1999), and Andrade et al. (2000) were instrumental to the first PTR classification efforts, while more recent tools have been aimed at providing web-server-based utilities, or at populating databases.

REP in Andrade et al. (2000) is one of the first PTR detection algorithms which uses a homology-based method to identify statistically significant protein repeats.

Other early methods developed for finding TRs in proteins are based on detecting sub-optimal alignments in the self-alignment matrix generated by the Smith-Waterman algorithm (or similar methods). Some methods developed along this line are Internal Repeat Finder (Marcotte et al., 1998; Pellegrini et al., 1999), prospero (Mott, 1999), RADAR (Heger and Holm, 2000), REPRO (Heringa and Argos, 1993; George and Heringa, 2000), and TRUST (Szklarczyk and Heringa, 2004). These methods often detect both tandem and interspersed repeats.

XSTREAM (Newman and Cooper, 2007) uses a seed expansion approach, while Jorda and Kajava (2009) proposed T-REKS, which uses a clustering approach based on k-means.

The systems HHrep (Soding et al., 2006) and HHRepID (Biegert and Soding, 2008) are instead based on building and matching Hidden Markov Models for the repeating substrings to be sought (not necessarily tandem).

Some approaches based on neural networks aim at detecting particular repetitive structures. For example, Palidwor et al. (2009) developed a classification technique for detecting alpha-rods repeats, a specific important repetitive structure [see also Rubinson and Eichman (2012)].

For the class of protein solenoid repeats, REPETITA, by Marsella et al. (2009), uses several AA biochemical properties (including polarity, secondary structure, molecular volume, electric charge, and codon diversity) and a discrete fourier transform approach to detect self-similarities.

Pellegrini et al. (2012) propose the notion of fuzzy TR (FTR) for proteins, which is based on using a normalized BLOSUM-weighted edit distance between AA sub-strings and in assuming that in a FTR, even if the constitutive unit elements may be pairwise at high divergence, there exists an “origin” string, not necessarily still part of the protein in exam, that is at a relatively small divergence from any of its unit elements. Here, the notion of high/low divergence is relative to the divergence between random AA strings under the chosen weighted edit distance. An exhaustive search of FTRs in long proteins is computationally demanding, since the bare definition leads to an NP-hard problem. Thus, an efficient heuristic is used in PTRStalker to guess the candidate “origin” strings.

Gruber et al. (2005) propose REPPER a meta searching approach that combines the output of different algorithms. A web-based meta-search server that allows to run and compare easily several tools on the same input is also described in Schaper et al. (2015).

Shapper et al. (Schaper et al., 2012; Anisimova et al., 2015) propose a statistical method based on phylogenetic fingerprints and ML-estimation that, in conjunction with one or more standard predictors, is able to filter out predicted TR that are more likely to be false-positive.

As screening large portions of protein sequence DB looking for TR patterns is time consuming, Richard and Kajava (2014) propose a pre-screening tool (TRDistiller) whose purpose is to quickly filter out proteins that almost surely do not contain a TR, while retaining for further analysis the proteins carrying a TR with high probability.

As the list of possible tools to choose from becomes longer, there is an emerging need for guidance on which tool is most suitable for a given task. Unfortunately, at the best of my knowledge, no such comprehensive comparative study has been attempted yet. More limited comparative tests can be found in Pellegrini et al. (2012) where five methods (RADAR, TRUST, T-REKS, XSTREAM, and PTRStalker) are compared in their ability to detect very long PTRs (≥ 4000 AA), with XSTREAM and PTRStalker emerging as the best choice for this task. A second test is aimed at detecting dimeric proteins by five tools (RADAR, TRUST, HHRep, HHRepID, and PTRStalker), with PTRStalker, TRUST, and HHRepID being able to successfully uncover such dimeric structures in some of the tested proteins. In Jorda and Kajava (2009), four methods (T-REKS, XSTREAM, Internal Repeat Finder, and TRED) are compared by the number of sequences they could identify as holding a PTR longer than 14 AA in the SWISSPROT database, with T-REKS giving the highest number (almost doubling the closest competitor). In Marsella et al. (2009), three methods (REPETITA, TRUST, and RADAR) are compared to assess their ability in guessing the correct periodicity in solenoid repeats, with REPETITA having an edge over the other two methods.

Protein-TR Detection Algorithms Based on Structure

Functional features are more readily linked to the structural features of a protein rather than to their primary sequence, thus available structural data should also be used to detect protein 3d symmetries and repetitive 3d motifs (Goodsell and Olson, 2000). However, only for a fraction of the known protein sequences, the corresponding 3D conformation could be determined, therefore the range of applicability of structure-based methods is limited w.r.t. the range of the sequence-based methods.

In this case, the algorithmic challenge lies in the multidimensional nature of the data, and on the fact that the space of rigid transformations (rotations, translations) as well as the inherent flexibility of proteins must be taken into account when attempting to match 3d substructures in order to detect the PTR periodicity. Table 2 reports the list of structure-based algorithms.

Table 2

Name	Type	Year	Reference	Notes
DAVROS	Alg	2004	Murray et al. (2004)	http://www.ebi.ac.uk/~murray/davros/
OPAAS	Alg	2004	Shih and Hwang (2004)	http://www.ibms.sinica.edu.tw/
Swelfe	Alg	2008	Abraham et al. (2008)	http://wwwabi.snv.jussieu.fr/public/Swelfe/
RQA	Alg	2009	Chen et al. (2009)
Gplus	Alg	2009	Guerler et al. (2009)	http://agknapp.chemie.fu-berlin.de/gplus/
SymD	Alg	2010	Kim et al. (2010)	http://symd.nci.nih.gov/
ProSTRIP	Alg	2010	Sabarinathan et al. (2010)	http://cluster.physics.iisc.ernet.in/prostrip/
RAPHAEL	Alg	2012	Walsh et al. (2012b)	http://protein.bio.unipd.it/raphael/
Frustratometer	Alg	2013	Parra et al. (2013)	http://www.proteinphysiologylab.tk/
ConSole	Alg	2014	Hrabe and Godzik (2014)	http://console.sanfordburnham.org/
PRIGSA	Alg	2014	Chakrabarty and Parekh (2014)	http://bioinf.iiit.ac.in/PRIGSA/

Synthetic table of resources for PTR studies: structure-based algorithms.

In Murray et al. (2002), both the sequence and the structure signals are integrated within a continuous wavelet transform approach to detect repeating motifs. In particular, the sequence is represented by values of the Kyte–Doolittle hydrophobicity scale, while structure is characterized via the relative accessible surface area. This approach has been shown to be successful on most of the well known types of repetitive motifs.

DAVROS (Murray et al., 2004) is a PTR prediction system that builds upon a structural alignment program (SAP) that evaluates internal structural symmetries via a protein self-similarity matrix and employs a Fourier Transform approach to identify strong signals over the noisy background.

Swelfe (Abraham et al., 2008) finds internal repeats by combining three abstraction levels. Swelfe quickly identifies statistically significant internal repeats in DNA sequence, in the amino acid sequence and in the 3D structures using dynamic programing. The associated web server also shows the relationships between repeating feature at each level and facilitates visualization of the results.

SymD (Kim et al., 2010) is an algorithm that aims at detecting internal spatial symmetries of proteins. It uses the alignment method in Kim et al. (2009) on pairs of structure formed by the target protein and its shifted versions built by all circular permutations of its residues. Although not all PTR give rise to symmetric 3D structures, many do, therefore this approach often indicates the presence of a PTR. Other methods based on this symmetry detection approach are RQA (Chen et al., 2009), OPAAS (Shih and Hwang, 2004), and Gplus (Guerler et al., 2009).

ProSTRIP (Sabarinathan et al., 2010) uses dynamic programing to find similar structural repeats in a protein structure encoded by the protein backbone dihedral angles.

RAPHAEL (Walsh et al., 2012b) is a more recent method for the detection of solenoids in protein structures. It aims at mimicking the periodicity and distance patterns detection criteria a human curator is likely to exploit when assessing the presence of a solenoid visually. In particular, the candidate protein is subject to a random rotation and translation, and subsequently for each of the three C-alpha coordinates a projection is performed. This operation produces a profile curve, in which the distance between consecutive local maxima is a candidate periodicity value. By averaging over multiple random rotations and translations, a robust period estimation is attained. Additional simple rules allow to further detect non-periodic residues interspersed in the solenoid periodic structure.

Parra et al. (2013) use the structural alignment tool TopMatch (Sippl, 2008) to search exhaustively the space of possible sub-structures that tile a large fraction of a given structure, and thus can represent a bona fide structural repetitive element of the input protein.

PRIGSA (Chakrabarty and Parekh, 2014) represents distance information among residues in an adjacency matrix, and it is based on the observation that similar sub-structures can be recognized as unique profiles of the principal eigenspectra of this matrix.

ConSole (Hrabe and Godzik, 2014) aims at detecting solenoid domains having as input structural information, by searching repetitive patterns in a contact matrix, which, for every pair of residues i,j in a protein, encodes a value 1 if the two residues have at least a pair of heavy atoms at Euclidean distance below a threshold t (set at t = 4.5 Å). Ad hoc rules are further applied in order to handle insertions in the solenoid repetitive patterns.

As in the case of sequence-based methods, very few comparative studies among the proposed structure-based tools have been done. In Kim et al. (2010), six methods (DAVROS, OPAAS, Swelfe, RQA, Gplus, and SymD) are compared in their ability to identify characteristic symmetries in fold families from CATH, SCOP, and ASTRAL databases, with SymD having an overall better performance. In Sabarinathan et al. (2010), two methods (ProSTRIP and Swelfe) are compared over well known families of repeat proteins, for the task of detecting periodicity and exact repeat positions. On well known PTR proteins, both methods detect approximatively the correct period, however, ProSTRIP detects more repeating units. On the harder class of multidomain proteins ProSTRIP is also better at guessing the correct periodicity. In Walsh et al. (2012b) five methods (both sequence and structure based) are compared (namely Swelfe, RAPHAEL, REPETITA, TRUST, and RADAR) in their ability to guess the PTR periodicity, with RAPHAEL giving better predictions, when we allow for a slackness of 5 AA in the predicted value. For exact predictions, RAPHAEL, REPETITA, and TRUST are about equivalent.

Databases for Protein-TR

Information about PTR can be retrieved as annotations in general purpose integrated protein databases. However, such annotations often cover only the well studied PTR, therefore in recent years a number of special purpose repositories have been assembled with the objective of making large scale PTR analysis easier. We list here in Table 3 only DBs that are available on-line at the present time, as many older published articles refer to DBs no longer available.

Table 3

Name	Type	Year	Reference	Notes
RepSeq	DB	2007	Depledge et al. (2007)	http://www.repseq.org/
PRDB	DB	2012	Jorda et al. (2012)	http://bioinfo.montp.cnrs.fr/
ProRepeat	DB	2012	Luo et al. (2012)	http://prorepeat. bioinformatics.nl/
PTRStalkerDB	DB	2012	Pellegrini et al. (2012)	http://bioalgo.iit.cnr.it/
RepeatsDB	DB	2013	Di Domenico et al. (2013)	http://repeatsdb.bio.unipd.it/

Synthetic table of resources for PTR studies: databases.

RepSeq (Depledge et al., 2007) is a specialized DB for PTR in lower eukaryotic pathogens.

PRDB is a PTR database that supports queries on protein tandem repeats found in sequence data bases. Currently, it holds about 1.25M PTR extracted from the Swissprot, PDB, and NR databases in early 2010 using the T-REKS detection tool (Jorda et al., 2012). This database has been instrumental for uncovering original biological correlations in Jorda et al. (2010).

PTRStalkerDB lists the PTR found with the PTRStalker method on the SwissProt database release 57.15 of March 2, 2010 that contains 515,203 sequence entries.

ProRepeat (Luo et al., 2012) is a curated and integrated data base and analysis platform for research on the biological features of amino acid tandem repeats. ProRepeat collects PTR of protein sequences listed in the UniProt knowledge base from different species; moreover, it includes 85 completely sequenced eukaryotic proteomes from the RefSeq collection. The latest datasets used in ProRepeat are UniProtKB Release May 2011 and RefSeq Release 40.

RepeatsDB (Di Domenico et al., 2013) is a database of annotated tandem repeat protein structures that uses both a state of the art detection method (RAPHAEL) and manual curation to survey the protein structures listed in PDB. The latest version 2.0.0 (beta) released in 2015 holds 10,039 PTR structures (including manually classified and predicted PTR). Automated updates every 3 months are planned.

Although progress in the area of databases for PTR has come about in the past few years, there is also much scope for improvement, in particular, as the amount of proteomic data increases rapidly, it is important to maintain the PTR databases aligned with the latest releases of the reference protein sequence and structure. Also, given the variety of algorithms and approaches to PTR prediction, DB that uses one single algorithm as source of data could suffer for the specific algorithm’s biases, and more robust prediction could be obtained instead by using multiple detecting algorithms.

Classification of Protein-TR

Kajava (2012) reports an extensive survey of bioinformatic tools to support various analysis of TR in proteins, including tools for identification of TR in proteins, databases reporting PTR (either exclusively, or as an annotation in a larger protein DB), classification of repetitive 3D structures, and tools for structural prediction targeting proteins with PTR (as opposed to globular ones).

Early surveys by Marcotte et al. (1998), Andrade et al. (2001), and Kajava (2001) are very much concerned with the task of identifying specific classes of proteins highly characterized by their PTR content with the aim of finding corresponding structural and functional regularities. Andrade et al. (2001) propose a taxonomy of six main classes (β-propellers, β-trefoils, TPR-like, Ankyrin-like, Armadillo/HEAT-like and Leucine-Rich). Instead Kajava (2001) uses a classification based on the repeating unit length (1–2 residues = class I crystalline aggregates, 3–4 residues = class II fibrous proteins, 5–40 residues = class III solenoid-like proteins, and class IV beads-on-string proteins with repeats longer than 30 residues folded into globular domains). Later in Kajava (2012), a refinement of this classification by splitting class III into two sub-classes of solenoid and non-solenoid structures has been proposed. The database RepeatsDB (Di Domenico et al., 2013) uses the classification proposed by Kajava (2012).

Mechanisms of Protein-TR Expansion during Evolution

Björklund et al. (2006) and Moore et al. (2008) analyze the internal sequence similarity in proteins of several species and note that the domain repeats are often expanded through simultaneous duplications of several domains in one event, while the duplication of one domain at a time is a less common event. Moreover, many of the repeats appear to have been duplicated in the middle of the repeat region. This behavior is in contrast to the evolution of other proteins that mainly happens through additions of single domains at either terminus of the protein. No common mechanism for the expansion of all repeats could be detected in this study, for example, duplication patterns show no dependence on the size of the domains. Repeat expansion in some families can possibly be explained by shuffling of exons but exon shuffling does not appear to be a general formation mechanism.

Some domain families show distinct specific duplication patterns, for example, nebulin domains have mainly been expanded with groups of seven domains at a time, while duplications of other domain families involve varying numbers of domains for each event. A more detailed analysis of nebulin domains evolution is in Björklund et al. (2010).

By mapping the Protein TR back onto their coding DNA sequences, Street et al. (2006) study the conservation of intron/exon patterns across several species and show evidence that subdivide the repeat protein genes into two classes. The first class has random-length exons that are likely produced by accumulating introns though random insertion within the array of repetitive units. The second class is composed exclusively of exons corresponding to the multiple of the repeating unit, and thus is likely to be formed by local duplications of intron/exon modules.

Protein-TR Evolutionary Conservation

In Schaper et al. (2014), it is described a proteome-wide analysis of the evolution of TR in human proteins, using a database of 61 eukaryotes. The main finding is that the vast majority of human PTR are ancient, with TR unit number and order preserved intact since remote speciation events. Moreover, no human PTR shows evidence of a recent duplication or deletion event. Thus, presumably, most PTRs fold into stable and conserved structures, indispensable for their function. Similar findings for plants are shown in Schaper and Anisimova (2015). The analysis of PTR in Drosophila melanogaster reported in Ponting et al. (2001) led to the identification of novel PTR in the products of disease-related human genes homologous to those in Drosophila melanogaster.

Protein-TR in Protein Design

Different structures which arise from tandem arrays of a repeated structural motif have generated significant interest with respect to protein engineering and synthetic protein design (Forrer et al., 2003, 2004; Main et al., 2003, 2005; Javadi and Itzhaki, 2013). Several results are reported in these articles about re-engineering of PTR binding specificities, with attention to protein folding kinetics and protein stability.

Sawyer et al. (2013) present a “module-based” design approach in which modules composed of tandem repeats are aligned to identify repeat-specific features that will be important to include in future repeat protein design templates.

Parmeggiani et al. (2015) describe a general database-driven approach for reliable generation of synthetic stable modular repeat proteins. Concomitant to the distillation of general design principles for PTR engineering, research activities have been also concentrated toward specific classes of Protein-TR which have shown a more promising potential for applications (Stumpp et al., 2015). A notable example is that of Designed Ankyrin Repeat Proteins (DARPins) (Binz2003) that have been extensively studied [see a recent survey by Plückthun (2015) and references therein], since they provide a biochemically stable scaffold for designing protein variants able to recognize targets with affinity and specificity that are equal or possibly superior to that of antibodies. Similar promising studies focus also on armadillo repeat proteins (Reichen et al., 2014) and leucine-rich-repeat proteins (Park et al., 2015).

Order, Disorder, and Protein-TR

While our view of protein functions is often linked to the presence of a well defined 3-dimensional protein conformations, it has been recognized (Tompa, 2002) that many important protein functions are also linked to proteins (or regions within a protein) that lack a folded structure, but display a highly flexible random-coil-like conformation under physiological conditions [named intrinsically unstructured proteins (IUP) or intrinsically unstructured regions (IUR)].

The concept of order and disorder in protein segments (Dunker et al., 2001; Tompa, 2002) has been often investigated in correlation with the presence or absence of PTR at the sequence level. For example in Tompa (2002), 21 IUP are examined, and further 21 cases are cited in Dunker et al. (2001). It is noticed that IUR often correspond to regions of low compositional complexity (low sequence entropy) and sometimes to repetitive sub-sequences in fibrillar proteins. Tompa and Fersht (2009) discuss in detail the cases of PTR in PEVK regions of human Titin, in prion proteins and in the CTD domain of RNA polymerase. These findings on specific instances are, however, hard to generalize.

A general property observed by Jorda et al. (2010) is that higher level of repeat perfection correlates positively with the disordered state of protein sub-chains.

The emergence of IUP/IUR prediction tools, such as IUpred (Dosztányi et al., 2005), ESpritz (Walsh et al., 2012a), and DISOPRED (Jones and Cozzetto, 2015), to name a few, and comprehensive databases of IUP/IUR, such as DisProt (Sickmeier et al., 2007) and MobiDB 2.0 (Potenza et al., 2015), can be quite useful for finding generalizable connections between PTR and ordered/disordered states of protein regions.

Correlation of Protein-TR with Other Protein Properties

In Turutina et al. (2006), the sequences of the Swiss-Prot protein families are analyzed in order to detect family-specific latent periodicity fingerprints induced by PTR, using the method in Korotkov et al. (2003), and 94 such protein families are reported as well-characterized by such fingerprints.

A complete analysis of PDB sequences using RADAR is reported in Rajathei and Selvaraj (2013), where a good correlation among PTR, structural similarity, and functionally involved residues is highlighted.

In Mularoni et al. (2007) and Mularoni et al. (2010), the function and evolution of a particular class of PTR formed by repetitions of a single AA are investigated (homo-TR). These two studies concentrated on human and mouse homo-TR of length four. The protein stabilizing properties of homo-TR are also reported in Katti et al. (2000). A more general statistical analysis of homo-PTR in human proteins is in Jorda and Kajava (2010).

Conclusion

The present survey on Protein-TR touches several aspects of this research fields, including detection algorithms (Sections “Protein-TR Detection Algorithms Based on Sequence” and “Protein-TR Detection Algorithms Based on Structure”), databases (Section “Databases for Protein-TR”), classification (Section “Classification of Protein-TR”), the relationship between PTR and biologically relevant concepts (Sections “Mechanisms of Protein-TR Expansion During Evolution,” “Protein-TR Evolutionary Conservation,” “Order, Disorder, and Protein-TR,” and “Correlation of Protein-TR with Other Protein Properties”), and it highlights also recent progress in the design of synthetic PTR (Section “Protein-TR in Protein Design”).

Although there has been steady progress in the last 15 years in devising new prediction tools, both sequence and structure based, very little comparative or integrative work has been done. Most of the proteome-wise studies use only one tool to define and detect PTR and draw conclusions on PTR distributions and statistics. Though this approach was completely justified in the pioneering times (late 1990s and early 2000s), it is necessary now to refine these methodologies and make full use of the wealth of algorithms and approaches devised in the last decade. A more robust assessment of the distribution and annotations of PTR over the entire proteome could be attained by applying and merging the outcomes of multiple tools. In this context, the manually curated databases of PTRs can provide the necessary validation benchmarks.

From the point of view of the design of prediction tools, one open challenge is to devise sequence-based tools that are able to come close to the performance of structure-based tools. Thus providing higher quality PTR predictions for a larger pool of sequenced proteins.

Funding

The present work is partially supported by the Flagship project InterOmics (PB. P05), funded by the Italian Ministry for Instruction University and Research (MIUR) and CNR organizations, and by the joint IIT-IFC Laboratory of Integrative Systems Medicine (LISM).

Statements

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

1
AbrahamA.-L.RochaE. P. C.PothierJ. (2008). Swelfe: a detector of internal repeats in sequences and structures. Bioinformatics24, 1536–1537.10.1093/bioinformatics/btn234
2
AndradeM. A.Perez-IratxetaC.PontingC. P. (2001). Protein repeats: structures, functions, and evolution. J. Struct. Biol.134, 117–131.10.1006/jsbi.2001.4392
3
AndradeM. A.PontingC. P.GibsonT. J.BorkP. (2000). Homology-based method for identification of protein repeats using statistical significance estimates. J. Mol. Biol.298, 521–537.10.1006/jmbi.2000.3684
4
AnisimovaM.PečerskaJ.SchaperE. (2015). Statistical approaches to detecting and analyzing tandem repeats in genomic sequences. Front. Bioeng. Biotechnol.3:31.10.3389/fbioe.2015.00031
5
ApweilerR.BairochA.WuC. H.BarkerW. C.BoeckmannB.FerroS.et al (2004). Uniprot: the universal protein knowledgebase. Nucleic Acids Res.32(Suppl. 1), D115–D119.10.1093/nar/gkh131
6
BiegertA.SodingJ. (2008). De novo identification of highly diverged protein repeats by probabilistic consistency. Bioinformatics24, 807–814.10.1093/bioinformatics/btn039
7
BjörklundA. K.LightS.SagitR.ElofssonA. (2010). Nebulin: a study of protein repeat evolution. J. Mol. Biol.402, 38–51.10.1016/j.jmb.2010.07.011
8
BjörklundÅK.EkmanD.ElofssonA. (2006). Expansion of protein domain repeats. PLoS Comput. Biol.2:e114.10.1371/journal.pcbi.0020114
9
ChakrabartyB.ParekhN. (2014). Prigsa: protein repeat identification by graph spectral analysis. J. Bioinform. Comput. Biol.12, 1442009.10.1142/S0219720014420098
10
ChenH.HuangY.XiaoY. (2009). A simple method of identifying symmetric substructures of proteins. Comput. Biol. Chem.33, 100–107.10.1016/j.compbiolchem.2008.07.026
11
DepledgeD. P.LowerR. P.SmithD. F. (2007). Repseq – a database of amino acid repeats present in lower eukaryotic pathogens. BMC Bioinformatics8:122.10.1186/1471-2105-8-122
12
Di DomenicoT.PotenzaE.WalshI.Gonzalo ParraR.GiolloM.MinerviniG.et al (2013). Repeatsdb: a database of tandem repeat protein structures. Nucleic Acids Res.42, D352–D357.10.1093/nar/gkt1175
13
DosztányiZ.CsizmokV.TompaP.SimonI. (2005). Iupred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics21, 3433–3434.10.1093/bioinformatics/bti541
14
DunkerA.LawsonJ.BrownC. J.WilliamsR. M.RomeroP.OhJ. S.et al (2001). Intrinsically disordered protein. J. Mol. Graph. Model.19, 26–59.10.1016/S1093-3263(00)00138-8
- CrossRef
- Google Scholar
15
ForrerP.BinzH. K.StumppM. T.PlückthunA. (2004). Consensus design of repeat proteins. Chembiochem5, 183–189.10.1002/cbic.200300762
16
ForrerP.StumppM. T.BinzH.PlückthunA. (2003). A novel strategy to design binding molecules harnessing the modular nature of repeat proteins. FEBS Lett.539, 2–6.10.1016/S0014-5793(03)00177-7
17
GeorgeR.HeringaJ. (2000). The repro server: finding protein internal sequence repeats through the web. Trends Biochem. Sci.25, 515–517.10.1016/S0968-0004(00)01643-1
- CrossRef
- Google Scholar
18
GoodsellD. S.OlsonA. J. (2000). Structural symmetry and protein function. Annu. Rev. Biophys. Biomol. Struct.29, 105–153.10.1146/annurev.biophys.29.1.105
19
GruberM.SodingJ.LupasA. N. (2005). REPPER-repeats and their periodicities in fibrous proteins. Nucleic Acids Res.33(Suppl._2), W239–W243.10.1093/nar/gki405
20
GuerlerA.WangC.KnappE.-W. (2009). Symmetric structures in the universe of protein folds. J. Chem. Inf. Model.49, 2147–2151.10.1021/ci900185z
21
HegerA.HolmL. (2000). Rapid automatic detection and alignment of repeats in protein sequences. Proteins41, 224–237.10.1002/1097-0134(20001101)41:2<224::AID-PROT70>3.0.CO;2-Z
22
HeringaJ.ArgosP. (1993). A method to recognize distant repeats in protein sequences. Proteins17, 391–411.10.1002/prot.340170407
23
HrabeT.GodzikA. (2014). Console: using modularity of contact maps to locate solenoid domains in protein structures. BMC Bioinformatics15:119.10.1186/1471-2105-15-119
24
JavadiY.ItzhakiL. S. (2013). Tandem-repeat proteins: regularity plus modularity equals design-ability. Curr. Opin. Struct. Biol.23, 622–631.10.1016/j.sbi.2013.06.011
25
JonesD. T.CozzettoD. (2015). Disopred3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics31, 857–863.10.1093/bioinformatics/btu744
26
JordaJ.BaudrandT.KajavaA. V. (2012). Prdb: protein repeat database. Proteomics12, 1333–1336.10.1002/pmic.201100534
27
JordaJ.KajavaA. V. (2009). T-REKS: identification of Tandem REpeats in sequences with a K-meanS based algorithm. Bioinformatics25, 2632–2638.10.1093/bioinformatics/btp482
28
JordaJ.KajavaA. V. (2010). “Protein homorepeats: sequences, structures, evolution, and functions,” in Advances in Protein Chemistry and Structural Biology, Vol. 79, ed. McPhersonA. (Waltham, MA: Academic Press), 59–88.
- Google Scholar
29
JordaJ.XueB.UverskyV. N.KajavaA. V. (2010). Protein tandem repeats: the more perfect, the less structured. FEBS J.277, 2673–2682.10.1111/j.1742-4658.2010.07684.x
30
KajavaA. V. (2001). Review: proteins with repeated sequencestructural prediction and modeling. J. Struct. Biol.134, 132–144.10.1006/jsbi.2000.4328
31
KajavaA. V. (2012). Tandem repeats in proteins: from sequence to structure. J. Struct. Biol.179, 279–288.10.1016/j.jsb.2011.08.009
32
KattiM. V.Sami-SubbuR.RanjekarP. K.GuptaV. S. (2000). Amino acid repeat patterns in protein sequences: their diversity and structural-functional implications. Protein Sci.9, 1203–1209.10.1110/ps.9.6.1203
33
KimC.BasnerJ.LeeB. (2010). Detecting internally symmetric protein structures. BMC Bioinformatics11:303.10.1186/1471-2105-11-303
34
KimC.TaiC.-H.LeeB. (2009). Iterative refinement of structure-based sequence alignments by seed extension. BMC Bioinformatics10:210.10.1186/1471-2105-10-210
35
KorotkovE. V.KorotkovaM. A.KudryashovN. A. (2003). Information decomposition method to analyze symbolical sequences. Phys. Lett. A312, 198–210.10.1016/S0375-9601(03)00641-8
- CrossRef
- Google Scholar
36
LuoH.LinK.DavidA.NijveenH.LeunissenJ. A. M. (2012). Prorepeat: an integrated repository for studying amino acid tandem repeats in proteins. Nucleic Acids Res.40, D394–D399.10.1093/nar/gkr1019
37
LuoH.NijveenH. (2014). Understanding and identifying amino acid repeats. Brief. Bioinformatics15, 582–591.10.1093/bib/bbt003
38
MainE. R.JacksonS. E.ReganL. (2003). The folding and design of repeat proteins: reaching a consensus. Curr. Opin. Struct. Biol.13, 482–489.10.1016/S0959-440X(03)00105-2
39
MainE. R.LoweA. R.MochrieS. G.JacksonS. E.ReganL. (2005). A recurring theme in protein engineering: the design, stability and folding of repeat proteins. Curr. Opin. Struct. Biol.15, 464–471.10.1016/j.sbi.2005.07.003
40
MarcotteE. M.PellegriniM.YeatesT. O.EisenbergD. (1998). A census of protein repeats. J. Mol. Biol.293, 151–160.10.1006/jmbi.1999.3136
- CrossRef
- Google Scholar
41
MarsellaL.SiroccoF.TrovatoA.SenoF.TosattoS. C. (2009). Repetita: detection and discrimination of the periodicity of protein solenoid repeats by discrete fourier transform. Bioinformatics25, i289–i295.10.1093/bioinformatics/btp232
42
MooreA. D.BjorklundA. K.EkmanD.Bornberg-BauerE.ElofssonA. (2008). Arrangements in the modular evolution of proteins. Trends Biochem. Sci.33, 444–451.10.1016/j.tibs.2008.05.008
43
MottR. (1999). Local sequence alignments with monotonic gap penalties. Bioinformatics15, 455–462.10.1093/bioinformatics/15.6.455
44
MularoniL.LeddaA.Toll-RieraM.AlbàM. M. (2010). Natural selection drives the accumulation of amino acid tandem repeats in human proteins. Genome Res.20, 745–754.10.1101/gr.101261.109
45
MularoniL.VeitiaR. A.AlbàM. M. (2007). Highly constrained proteins contain an unexpectedly large number of amino acid tandem repeats. Genomics89, 316–325.10.1016/j.ygeno.2006.11.011
46
MurrayK. B.GorseD.ThorntonJ. M. (2002). Wavelet transforms for the characterization and detection of repeating motifs. J. Mol. Biol.316, 341–363.10.1006/jmbi.2001.5332
47
MurrayK. B.TaylorW. R.ThorntonJ. M. (2004). Toward the detection and validation of repeats in protein structure. Proteins57, 365–380.10.1002/prot.20202
48
NewmanA.CooperJ. (2007). Xstream: a practical algorithm for identification and architecture modeling of tandem repeats in protein sequences. BMC Bioinformatics8:382.10.1186/1471-2105-8-382
49
PalidworG. A.ShcherbininS.HuskaM. R.RaskoT.StelzlU.ArumughanA.et al (2009). Detection of alpha-rod protein repeats using a neural network and application to huntingtin. PLoS Comput. Biol.5:e1000304.10.1371/journal.pcbi.1000304
50
ParkK.ShenB. W.ParmeggianiF.HuangP.-S.StoddardB. L.BakerD. (2015). Control of repeat-protein curvature by computational protein design. Nat. Struct. Mol. Biol.22, 167–174.10.1038/nsmb.2938
51
ParmeggianiF.HuangP.-S.VorobievS.XiaoR.ParkK.CaprariS.et al (2015). A general computational approach for repeat protein design. J. Mol. Biol.427, 563–575.10.1016/j.jmb.2014.11.005
52
ParraR. G.EspadaR.SánchezI. E.SipplM. J.FerreiroD. U. (2013). Detecting repetitions and periodicities in proteins by tiling the structural space. J. Phys. Chem. B117, 12887–12897.10.1021/jp402105j
53
PellegriniM.MarcotteE. M.YeatesT. O. (1999). A fast algorithm for genome-wide analysis of proteins with repeated sequences. Proteins35, 440–446.10.1002/(SICI)1097-0134(19990601)35:4<440::AID-PROT7>3.0.CO;2-Y
54
PellegriniM.RendaM. E.VecchioA. (2012). Ab initio detection of fuzzy amino acid tandem repeats in protein sequences. BMC Bioinformatics13(Suppl. 3):S8.10.1186/1471-2105-13-S3-S8
55
PlückthunA. (2015). Designed ankyrin repeat proteins (darpins): binding proteins for research, diagnostics, and therapy. Annu. Rev. Pharmacol. Toxicol.55, 489–511.10.1146/annurev-pharmtox-010611-134654
56
PontingC. P.MottR.BorkP.CopleyR. R. (2001). Novel protein domains and repeats in drosophila melanogaster: insights into structure, function, and evolution. Genome Res.11, 1996–2008.10.1101/gr.198701
57
PotenzaE.DomenicoT. D.WalshI.TosattoS. C. E. (2015). Mobidb 2.0: an improved database of intrinsically disordered and mobile proteins. Nucleic Acids Res.43, 315–320.10.1093/nar/gku982
58
RajatheiD. M.SelvarajS. (2013). Analysis of sequence repeats of proteins in the {PDB}. Comput. Biol. Chem.47, 156–166.10.1016/j.compbiolchem.2013.09.001
59
ReichenC.MadhurantakamC.PlückthunA.MittlP. R. (2014). Crystal structures of designed armadillo repeat proteins: implications of construct design and crystallization conditions on overall structure. Protein Sci.23, 1572–1583.10.1002/pro.2535
60
RichardF. D.KajavaA. V. (2014). Trdistiller: a rapid filter for enrichment of sequence datasets with proteins containing tandem repeats. J. Struct. Biol.186, 386–391.10.1016/j.jsb.2014.03.013
61
RubinsonE. H.EichmanB. F. (2012). Nucleic acid recognition by tandem helical repeats. Curr. Opin. Struct. Biol.22, 101–109.10.1016/j.sbi.2011.11.005
62
SabarinathanR.BasuR.SekarK. (2010). Prostrip: a method to find similar structural repeats in three-dimensional protein structures. Comput. Biol. Chem.34, 126–130.10.1016/j.compbiolchem.2010.03.006
63
SawyerN.ChenJ.ReganL. (2013). All repeats are not equal: a module-based approach to guide repeat protein design. J. Mol. Biol.425, 1826–1838.10.1016/j.jmb.2013.02.013
64
SchaperE.AnisimovaM. (2015). The evolution and function of protein tandem repeats in plants. New Phytol.206, 397–410.10.1111/nph.13184
65
SchaperE.GascuelO.AnisimovaM. (2014). Deep conservation of human protein tandem repeats within the eukaryotes. Mol. Biol. Evol.31, 1132–1148.10.1093/molbev/msu062
66
SchaperE.KajavaA. V.HauserA.AnisimovaM. (2012). Repeat or not repeat? Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Res.40, 10005–10017.10.1093/nar/gks726
67
SchaperE.KorsunskyA.MessinaA.MurriR.PečerskaJ.StockingerH.et al (2015). Tral: tandem repeat annotation library. Bioinformatics31, 3051–3053.10.1093/bioinformatics/btv306
68
ShihE. S.HwangM.-J. (2004). Alternative alignments from comparison of protein structures. Proteins56, 519–527.10.1002/prot.20124
69
SickmeierM.HamiltonJ. A.LeGallT.VacicV.CorteseM. S.TantosA.et al (2007). Disprot: the database of disordered proteins. Nucleic Acids Res.35(Suppl. 1), D786–D793.10.1093/nar/gkl893
70
SipplM. J. (2008). On distance and similarity in fold space. Bioinformatics24, 872–873.10.1093/bioinformatics/btn040
71
SodingJ.RemmertM.BiegertA. (2006). HHrep: de novo protein repeat detection and the origin of TIM barrels. Nucleic Acids Res.34(Suppl._2), W137–W142.10.1093/nar/gkl130
72
SokolD.BensonG.TojeiraJ. (2007). Tandem repeats over the edit distance. Bioinformatics23, e30–e35.10.1093/bioinformatics/btl309
73
StreetT. O.RoseG. D.BarrickD. (2006). The role of introns in repeat protein gene formation. J. Mol. Biol.360, 258–266.10.1093/bioinformatics/btl309
74
StumppM. T.ForrerP.BinzH. K.PluckthunA. (2015). Repeat Protein from Collection of Repeat Proteins Comprising Repeat Modules. US Patent 9,006,389.
- Google Scholar
75
SzklarczykR.HeringaJ. (2004). Tracking repeats using significance and transitivity. Bioinformatics20(Suppl. 1), i311–i317.10.1093/bioinformatics/bth911
76
TompaP. (2002). Intrinsically unstructured proteins. Trends Biochem. Sci.27, 527–533.10.1016/S0968-0004(02)02169-2
77
TompaP.FershtA. (2009). Structure and Function of Intrinsically Disordered Proteins. Baca Raton, FL: Chapman and Hall/CRC.
- Google Scholar
78
TurutinaV. P.LaskinA. A.KudryashovN. A.SkryabinK. G.KorotkovE. V. (2006). Identification of amino acid latent periodicity within 94 protein families. J. Comput. Biol.13, 946–964.10.1089/cmb.2006.13.946
79
WalshI.MartinA. J. M.Di DomenicoT.TosattoS. C. E. (2012a). Espritz: accurate and fast prediction of protein disorder. Bioinformatics28, 503–509.10.1093/bioinformatics/btr682
80
WalshI.SiroccoF. G.MinerviniG.Di DomenicoT.FerrariC.TosattoS. C. E. (2012b). Raphael: recognition, periodicity and insertion assignment of solenoid protein structures. Bioinformatics28, 3257–3264.10.1093/bioinformatics/bts550

Summary

Keywords

proteins, tandem repeats, biological significance, protein TR detection algorithms, protein TR properties

Citation

Pellegrini M (2015) Tandem Repeats in Proteins: Prediction Algorithms and Biological Role. Front. Bioeng. Biotechnol. 3:143. doi: 10.3389/fbioe.2015.00143

Received

12 June 2015

Accepted

07 September 2015

Published

24 September 2015

Volume

3 - 2015

Edited by

John Hancock, The Genome Analysis Centre, UK

Reviewed by

Silvio C. E. Tosatto, University of Padua, Italy; Michelle M. Simon, Medical Research Council, UK

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Marco Pellegrini, Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Via Moruzzi 1, Pisa 56124, Italy, marco.pellegrini@iit.cnr.it

Specialty section: This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Bioengineering and Biotechnology

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Computational Genomics

REVIEW article

Tandem Repeats in Proteins: Prediction Algorithms and Biological Role

Abstract

Introduction

Protein-TR Detection Algorithms Based on Sequence

Protein-TR Detection Algorithms Based on Structure

Databases for Protein-TR

Classification of Protein-TR

Mechanisms of Protein-TR Expansion during Evolution

Protein-TR Evolutionary Conservation

Protein-TR in Protein Design

Order, Disorder, and Protein-TR

Correlation of Protein-TR with Other Protein Properties

Conclusion

Funding

Statements

Conflict of interest

References

Summary

Outline

Cite article

Article metrics

REVIEW article

Tandem Repeats in Proteins: Prediction Algorithms and Biological Role

Abstract

Introduction

Protein-TR Detection Algorithms Based on Sequence

Protein-TR Detection Algorithms Based on Structure

Databases for Protein-TR

Classification of Protein-TR

Mechanisms of Protein-TR Expansion during Evolution

Protein-TR Evolutionary Conservation

Protein-TR in Protein Design

Order, Disorder, and Protein-TR

Correlation of Protein-TR with Other Protein Properties

Conclusion

Funding

Statements

Conflict of interest

References

Summary

Outline

Cite article

Share article

Article metrics