- Molinaroli College of Engineering and Computing, University of South Carolina, Columbia, SC, United States
Introduction: In this study, we introduce the design and implementation of PDBMine, a large-scale, queryable platform for mining sequence-structure statistics from the Protein Data Bank (PDB). PDBMine enables rapid analysis of local conformational trends across proteins by extracting dihedral angles and sequence patterns at scale. In addition to the design and implementation of PDBMine, we also present results validating its ability to return structurally meaningful information.
Methods: We first assess the accuracy of its dihedral angle distributions by comparing them to established Ramachandran space and verifying expected behaviors of residues such as glycine and proline. We then use PDBMine to analyze the statistical properties of amino acid subsequences of length
Results: Our findings reveal that longer
Discussion: These results highlight PDBMine’s potential as a versatile tool for structure validation, statistical modeling, and probing the principles that govern sequence-structure compatibility in proteins.
1 Introduction
Over the past 3 decades, significant advancements have been made in the field of protein structure determination, driven by both experimental and computational approaches (Kühlbrandt, 2014; Senior et al., 2020). Since the completion of the Human Genome Project (Collins et al., 2003), the demand for rapid and cost-effective structural elucidation of proteins has intensified, leading to the establishment of large-scale initiatives such as the Structural Genomics Initiative and the Protein Structure Initiative. These efforts aimed to accelerate the discovery of protein structures and expand our understanding of protein function, interactions, and evolution. However, experimental techniques such as X-Ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM) remain time-intensive and resource-demanding (Bax and Clore, 2019; Bai et al., 2015), necessitating the development of computational structure prediction methods as an alternative.
To advance computational modeling, initiatives such as the Critical Assessment of Structure Prediction (CASP) (Moult et al., 1995) were established, providing a rigorous benchmarking framework for evaluating protein structure prediction algorithms. Early computational approaches, such as homology modelling (Marti-Renom et al., 2000), leveraged evolutionary relationships between proteins to infer structural information. More recently, deep learning-based models, exemplified by AlphaFold2 (Jumper et al., 2021), have demonstrated remarkable accuracy in predicting protein structures directly from their primary sequences, revolutionizing the field of structural bioinformatics. These computational breakthroughs heavily depend on data mining and knowledge extraction from the Protein Data Bank (PDB) (Berman et al., 2000), which serves as the primary repository of experimentally determined structures.
However, despite its indispensable role in structure prediction, the PDB remains limited for large-scale, data-driven analyses. Originally conceived as an archival system for individual structures rather than for geometric pattern mining (Kleywegt and Jones, 1998), it requires extensive post-processing to extract features such as torsion angles or residue-level relationships. This limits its direct use in machine learning workflows that rely on structured, readily accessible geometric data (Shortle, 1999).
To overcome these limitations, we introduced PDBMine (Cole et al., 2019), a reformulation of the PDB designed for efficient large-scale structural analysis. PDBMine provides direct access to geometrically relevant structural attributes such as dihedral angle and omega angles, burial distances of residues, and water accessibility; enabling rapid query-based examination of sequence-structure relationships. In this report, we demonstrate the validity of PDBMine by comparing its results for Ramachandran restraints to the traditionally accepted structural restraints (Lovell et al., 2003), ensuring consistency with traditionally accepted structural distributions (Lovell et al., 2003). Additionally, we extend conventional structural analysis by introducing higher-dimensional Ramachandran data, providing previously unexplored insights into backbone conformations. Beyond validation, we illustrate PDBMine’s practical applications through a series of case studies that highlight its uses in structural bioinformatics, protein design, and predictive modeling. By expanding the scope of structural analysis to multi-residue dihedral distributions, PDBMine offers a new paradigm for understanding local protein structure formation behavior.
2 Materials and methods
2.1 Overview of PDB
Established in 1971, the Protein Data Bank (PDB) (Berman et al., 2000), is the world’s primary repository for three-dimensional structural data of biological macromolecules. Currently, PDB archives 238,089 protein structures (statistics obtained on 20 June 2025), it is an indispensable resource for structural and computational biologists, enabling advancements in drug discovery, protein engineering, and molecular dynamic simulations.
The PDB is well regarded for its archives, serving as a comprehensive repository of biomolecular structures with rich metadata that includes items such as the atomic coordinates, experimental methods, and structural resolutions among other important information. It supports a range of structural determination techniques as well, such as X-ray crystallography, nuclear magnetic resonance spectroscopy, and cryo-electron microscopy, ensuring a broad coverage of biomolecular diversity. All of this data is freely available and universally accessible, its intuitive search tools enable users to filter data by sequence or structure to facilitate efficient exploration of the database.
Despite its extensive data repository, the PDB’s original archival design limits its use for modern large-scale data mining. Its organizational framework does not inherently support complex queries for structural parameters such as dihedral angles or residue level spatial constraints, features essential for advanced structural analysis.
The advent of artificial intelligence (AI) and machine learning (ML) has brought transformative potential to structural biology. We have seen the predictive and generative modeling power of applications such as AlphaFold2, which requires large, well-organized datasets. While such predictive frameworks have revolutionized the field, their accuracy and reliability remain fundamentally grounded in experimentally determined structures. The training, validation, and benchmarking of models like AlphaFold2 and RoseTTAFold (Baek et al., 2021) depend directly on empirical data curated within the Protein Data Bank (PDB), making the continued accessibility and reformulation of experimental information critical for sustaining progress in predictive modeling. However, in its current state, PDB’s structure presents several challenges for development of AI/ML applications on its own. For AI-driven approaches, its success often depends on a dataset that enables systematic mining of patterns. And for machine learning models, they often rely on structured and interpretable input data. While the PDB stores Cartesian atomic coordinates, these coordinates are high-dimensional and often redundant for pattern recognition tasks. Dihedral angles provide a lower-dimensional and biologically interpretable representation of local backbone conformation, and are therefore a natural feature space for machine learning and predictive modeling.
Thus, while the PDB remains a vital resource for structural biology, its focus on archival functions limits its applicability in contemporary computational research. Addressing these limitations is crucial to harnessing the full potential of AI and ML technologies, and enhancing access to systematically organized datasets will be increasingly important as these fields evolve.
2.2 Overview of PDBMine
PDBMine was developed to address these limitations by providing a more efficient framework for large-scale structural data mining (Cole et al., 2019). It reformulates atomic coordinate data into an analytically tractable representation by systematically extracting backbone dihedral angles (
As noted earlier, PDB’s archiving of structural data typically falls into Cartesian representation of the atomic coordinates (X, Y, Z coordinates); these values are less intuitive for understanding local conformations, folding patterns, or secondary structure prediction. Instead, dihedral space representation (
Additionally, PDBMine extends Ramachandran-based restraints beyond just a single amino acid by including contextual sequence information. For instance, while it is well demonstrated that glycine exhibits broader dihedral flexibility than other amino acids, the influence of neighboring residues on its conformational preferences remains poorly understood (Lakshmi et al., 2014). PDBMine enables such analysis by grouping dihedral distributions across multi-residue motifs, including pairs (e.g., Gly-Pro), triplets (e.g., Gly-Pro-Gly), and longer k-mers. This approach allows for a richer characterization of local structural tendencies, capturing how sequence context modulates backbone geometry.
By systematically indexing these multi-residue dihedral spaces, PDBMine provides data-driven insights that can be used to guide protein modeling, constrain predictions of viable backbone conformations, and reduce the computational complexity associated with protein structure determination. The ability to mine dihedral distributions in the context of various k-mer sequences, allows PDBMine to provide statistically derived structural constraints. This information can be integrated into computational approaches for structure prediction, molecular dynamic simulations, and validation of machine learning models aimed at protein folding.
PDBMine was originally introduced in 2019 as a reformulated database of protein structural information optimized for dihedral angle analysis (Cole et al., 2019). Since then, substantial improvements have been made to the underlying infrastructure to enhance usability, scalability, and performance. In particular, a major update in 2021 re-engineered the platform to support RESTful web services, containerized deployment, and high-speed queries.
2.3 Software implementation aspects
2.3.1 Data preparation and processing
To construct a comprehensive local dataset of backbone conformations, we systematically downloaded all protein structures from the PDB. For each protein chain or model, backbone dihedral angles (
2.3.2 Sequence window search and indexing
PDBMine employs a window-based indexing strategy. When a user submits an amino acid sequence of length
Figure 1. Sequence windowing and dihedral angle retrieval in PDBMine. A query sequence of length
2.3.3 User interface and usability
PDBMine is distributed via a Docker container, which enables users to run the application locally without needing a dedicated server. This version supports the same interactive query capabilities, including the ability to submit amino acid sequences in either one-letter (e.g., E Y V) or three-letter (e.g., Glu Tyr Val) codes and to define custom window sizes. The system processes each query and returns downloadable CSV files containing the matched
Docker-based distribution simplifies deployment and ensures that PDBMine can be executed in a consistent, reproducible environment across operating systems. The platform is designed to support a wide range of structural biology applications, allowing users to perform reproducible and large-scale structural analyses without dedicated infrastructure. To complement the containerized distribution, we have also restored a publicly accessible instance of the current PDBMine backend at https://ifestos.cse.sc.edu/PDBMine, providing users with a convenient browser based interface for testing and exploratory queries without requiring a local build.
2.3.4 Performance and deployment
The computational performance of PDBMine depends primarily on the user’s available resources and deployment configuration. To assess system performance, we deployed PDBMine on an AWS t3.2xlarge instance (8 vCPUs, 32 GB RAM) and conducted several benchmark tests. A query on a 25-residue sequence with a 7-mer window completed in approximately 1 s, while the same sequence queried with a 2-mer window completed in 18 s. To further evaluate scalability, 100 proteins of varying lengths were selected at random from the PDB and queried sequentially with the same window size. The results remained consistent across sequence lengths, indicating that query time did not degrade with increasing input size.
The preprocessing step required to construct the full database (i.e., generating DSSP files and residue–position sets) took approximately 27 h on the same instance. Although precalculated DSSP annotations are publicly available, we chose to generate them locally to ensure consistent formatting with our parser and to avoid potential disruptions from future changes to external file formats or access methods. This design also enables seamless incremental updates, as only newly added or modified PDB entries require reprocessing. As such, this is a one-time cost incurred only during initial database construction or subsequent updates.
The complete Docker configuration, source code, and RESTful API framework are publicly available at https://github.com/ValafarLab/PDBMine. This containerized deployment allows users to reproduce performance benchmarks locally or on cloud infrastructure (e.g., AWS) and to access PDBMine’s functionality through its documented REST endpoints. The modular architecture ensures reproducible results while maintaining flexibility for users with different computational configurations.
As PDBMine continues to evolve, future updates will focus on integrating more advanced analytical tools and machine-learning models to enhance predictive insights into protein structure and function.
3 Results
In this section we explore various experiments to validate that PDBMine is functioning as intended and to demonstrate its utility in structural biology. We first validate the results of PDBMine by comparing the dihedral angles reported by PDBMine to the expected Ramachandran restraints. We then explore the abundance of different amino acids and their combinations in the PDB. Both of these exercises will serve to validate the accuracy of PDBMine by comparing its results to established structural benchmarks. Finally, we demonstrate the utility of PDBMine in a number of applications including structural motif discovery by examining the dihedral angle distributions of specific amino acid sequences and their surrounding residues.
Throughout the following sections, PDBMine is queried using the standard user interface method described in Section 2.3. Although the same PDBMine application is used throughout this work, we format our queries to PDBMine using two different methods. Whenever we query PDBMine with the entire amino acid sequence of a protein, we choose a window size and PDBMine applies the shifting window method as illustrated in Figure 1 to query the PDB with each window. This is referred to as Method 1 of querying PDBMine. Whenever we query PDBMine with a short sequence of amino acids, on the other hand, we specify a window size equal to the length of the subsequence. This way, PDBMine’s standard shifting window does not apply and the method only forms a single window. This method is used when we are only interested in the dihedral angles of a short subsequence of amino acids (4 residues, for example), rather than the entire sequence of a protein. This is referred to as Method 2 of querying PDBMine, and is summarized along with Method 1 as follows:
• Method 1 Query: PDBMine is queried with a full protein sequence of
• Method 2 Query: PDBMine is queried with a short subsequence of
3.1 Validation of PDBMine results
3.2 Diversity of dimeric, trimeric, tetrameric, and pentameric fragments
We next explored the performance and viability of PDBMine in global PDB searching tasks. Certain amino acid residues, such as Leucine and Alanine, are known to occur in higher relative abundance than others in the proteome (Shen et al., 2006; Nacar, 2023). To further validate PDBMine and investigate relative abundances of amino acids, we conducted a global search of every possible
In addition to exploring the frequency of individual
3.3 Restricted representation of contextualized sequences
PDBMine offers unique insights into the statistics of dihedral angles for
We define the target residue as the center residue in the case of an odd-length
Expanding on the idea of contextualized dihedral angle representations, we also examined the effect of increasing window size (i.e.,
3.4 Applications of PDBMine in structural motif discovery
PDBMine can be used to aid protein structure determination by mining the Protein Data Bank (PDB) for occurring sequence-structure motifs, making it especially useful in fragment-based modeling approaches. It can improve de novo structure prediction by providing high-quality, sequence-specific fragment libraries, and support homology modeling by identifying structural motifs that match regions of the target sequence, including poorly conserved loops. In crystallographic model building, PDBMine helps model ambiguous or missing regions by suggesting plausible structural fragments. Additionally, it supports structure-guided mutagenesis and protein design by evaluating the structural viability of new sequence motifs. PDBMine can also enhance machine learning-based structure prediction by contributing structural priors or features derived from known sequence-structure relationships.
In this work, we highlight the use of PDBMine in protein structure validation. The application stems from the concept of restricted representations that have been observed in other structures with a similar subsequence motifs. When queried with longer, more contextualized amino acid sequences, PDBMine identifies the restricted R-space that residues of interest tend to occupy. This concept can be expanded beyond a single residue by examining the dihedral angle distributions of all residues in a subsequence.
When queried, PDBMine reports the set of dihedral angles for a target residue given its neighboring residues and a window size. Equation 3 describes the likelihood of a dihedral angles for residue
In general, two approaches can be conceived to implement the sequence and structural constraints described in Equation 4. The first approach is to impose the constraints during the course of querying PDBMine’s database. The second approach is to utilize data analytics to organize and the results after querying. Since the first approach is computationally intensive, we adopt the second. More specifically, this approach involves clustering dihedral tuples from each fragment in a high-dimensional space. For example, for a window size of 5, each matching fragment found by PDBMine would contain
The hyperdimensional clustering algorithm that we have utilized in this report is summarized in Algorithm 1 below. Continuing the example with the subsequence NKPGD, we: (1) query PDBMine using Method 2 and collect, for example, 2000 full-length fragment matches, each with five dihedral angle pairs, one for each residue in the subsequence. (2) Assemble these results into a list of 2000 10-dimensional data points. (3) Apply Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) (Campello et al., 2015) to find groups of similar points in the 10-dimensional space. Clustering was performed using the angular distance on the
As illustrated in Figure 2, each cluster identified by Algorithm 1 represents a combination of the residue-level clusters found in the independent 2D distributions. For example, when querying PDBMine with a dimer subsequence
We hypothesized, however, that the majority of these potential clusters are not exhibited in the PDB—as in, certain combinations of individual clusters in a subsequence do not occur. The clusters that are represented then serve as baselines for what dihedral angles and, in turn, 3D conformations, are possible for different amino acid sequences. To gain insight into the behavior of these clusters, we applied Algorithm 1 to a large sample size of short amino acid sequences. For each
Figure 2. Demonstration of the high dimensional clustering found by Algorithm 1 for some dimer. Scatter points represent clusters found in each 2D distribution independently. In this demonstration, matches for the first residue only form a cluster in the Beta sheet region. Matches for the second residue form clusters in the Beta sheet and Alpha-helical regions. Across both residues, Algorithm 1 will find the two clusters that form in 4D space, as shown by dotted lines.
4 Discussion
4.1 Validation
The approximately 3 million dihedral angles retrieved from PDBMine are shown in Figure 3. The results are categorized into three groups based on the amino acid: Glycine, Proline, and all other residues. As expected, the distributions adhere to established Ramachandran-space restraints for each group. For the Glycine residues (174k), we observe high-density regions corresponding to the five distinct conformational clusters typically exhibited by this residue (Ho and Brasseur, 2005). Similarly, results for Proline residues (150k) adhere to the expected restriction of
Figure 3.
4.2 Results of dimer, trimer, and tetrameric fragments
We conducted a global search of all
Figure 4. Relative probability of each amino acid found by three methods. In blue are the probabilities in the PDB found by querying PDBMine for each individual amino acid. In orange are the probabilities of amino acids in the SwissProt database found by Shen et al. (Shen et al., 2006). In green are the probabilities in a subset of the PDB found by Nacar (Nacar, 2023). The red line shows the number of occurrences of each amino acid in the PDB, found through PDBMine.
Next, we examined the results of the global search for the 400 possible dimers (Ala-Ala, Ala-Arg,
Figure 5. (A) Shows the joint probability of Dimers solely based on the observed probability found for each acid acid independently:
In addition to their biological relevance, some of these overrepresented
These patterns continue with
Table 1. Number of occurrences found by PDBMine for each
Next, as before, the number of occurrences found in the PDB for each
Figure 6. Observed vs. Joint Probabilities of Dimers, Trimers, Tetramers, and Pentamers. On the x-axes are the joint probabilities of each
The fact that we begin to see unrepresented
4.3 Restricted representation of contextualized sequences
To investigate the influence of sequence context on dihedral angle distributions, we examined the residue Alanine (A) in two different pentameric contexts: DEAKK and GLALS. Using Method 2, we retrieved approximately 380 matches for each subsequence. The estimated 2D (
Figure 7. Dihedral angle distributions queried from PDBMine for the residue Alanine occurring in different pentamers from the protein 6POO: GLALS at residues 318-322 and DEAKK at residues 372-376. Overlaid are the Kernel Density Estimations of the dihedral angles for each. Here, the same residue demonstrates different patterns due to the amino acids that neighbor it. Both queries yielded about 380 matches.
We next examined the effect of increasing surrounding context when querying PDBMine for a residue of interest. Figure 8 shows the results of Method 2 queries using four subsequences of increasing length centered on the Lysine at position 75 of the protein 2RJ7. As the sequence length increases, the results become more context-specific and converge toward the dihedral angle values reported from X-ray crystallography. With three residues of context, the subsequence PKVL yields 2,360 matches. The resulting distribution of dihedral angles for the Lysine spans the entire Ramachandran space, indicating that a 4-mer is not specific enough to yield a meaningful conformational distribution. As subsequence length increases, however, the distribution of dihedral angle begins to converge around the beta sheet region. When queried with six residues of context, the subsequence PQPKVLT yields only 160 matches, but these matches cluster tightly around the experimentally determined dihedral angles for Lysine-75 in 2RJ7:
Figure 8. Ramachandran plots of PDBMine results for amino acid subsequences of the protein 2RJ7. Each plot shows the dihedral angles of Lysine (K) found when PDBMine is queried with the overlapping subsequences shown on subplot titles. The chosen subsequences demonstrate the effect of gradually increasing the number of surrounding residues sent to PDBMine. Shown in black X markers are the dihedral angles of the Lysine in the model of the protein found through X-ray crystallography. Shown in blue are the results from PDBMine. The background outlines the typical Ramachandran space of dihedral angles.
4.4 Applications of PDBMine - structure validation results
To demonstrate PDBMine’s potential to serve as a method of predicted structure validation, we employed Algorithm 1 to conduct a study of 10.8k amino acid sequences of lengths 4, 5, 6, and 7. For each sequence, the number of dihedral conformation clusters exhibited in the PDB was found through PDBMine and Algorithm 1. The results are shown in Figure 9.
Figure 9. The results of Algorithm 1 on PDBMine queries for 10,800 randomly selected
Given that each residue in a sequences of length
Table 2. Possible number of clusters found by PDBMine and Algorithm 1 for amino acid subsequences of different lengths compared to the observed number.
An example of PDBMine results clustered by Algorithm 1 is shown in Figure 10, illustrating the implications of these experimental results. While each individual residue-level subplot exhibits 1-3 clusters, not all combinations of these clusters across the full subsequence appear in the PDB, as expected. This provides valuable insight into which conformational formations of a
Figure 10. Example of a PDBMine distribution for each residue of a 5-residue window, NKPGD, which occurs at positions 98-102 in the protein 7EIK. Each subplot shows the dihedral angles queried from PDBMine for one residue of the subsequence. The red X marks show the dihedral angles of each residue in the X-ray Crystallography model of 7EIK, with red dotted lines outlining their connection. The remaining dotted lines are representative of four different clusters, found by PDBMine and Algorithm 1.
Figure 11. Protein 7EIK with residues 1-97 in blue, 98-102 in red, and 102-125 (end) in yellow. The first model (labelled X-Ray) shows the structure of 7EIK found through X-ray crystallography and retrieved from the PDB. In each of the following molecules, the dihedral angles of residues 98-102 (red) are set to the values found in each cluster shown in Figure 10. Clusters 0 to 3 are shown in order (labelled C0-C3).
One of the most persistent challenges in structure modeling lies in accurately reconstructing loops and flexible regions, where experimental density is often incomplete and conformational heterogeneity is high. The dihedral angle clusters retrieved by PDBMine provide an empirical foundation for constraining such regions during model building. Recent work by Pandala et al. (Pandala et al., 2025) leveraged PDBMine derived torsion angle distributions to evaluate and refine AlphaFold2 models from the CASP14 competition, demonstrating that PDBMine based likelihood analysis can pinpoint residues in flexible or mismodeled regions and guide structural correction. Similarly, in our own analyses, the distinct clusters identified for the NKPGD subsequence define a finite set of backbone conformations that can be mapped onto missing or ambiguous residues. These empirically observed
5 Conclusion
PDBMine provides a flexible and interpretable platform for analyzing large-scale sequence–structure relationships in the Protein Data Bank. In this work, we demonstrated how PDBMine can be used to explore amino acid sequence frequencies, evaluate the conformational effects of local sequence context, and perform high-dimensional clustering of dihedral angles across protein fragments. These applications highlight how PDBMine supports a range of research needs, from identifying enriched sequence motifs to validating predicted structures and modeling context-specific conformational variability.
Beyond demonstrating platform capabilities, our results reveal several structural principles underlying protein architecture. A key finding is the systematic breakdown of statistical independence in amino acid occurrence as
Additionally, we demonstrated that the structural preferences of individual residues are highly sensitive to their local context. As subsequence length increases, PDBMine returns more constrained dihedral angle distributions that converge toward experimentally observed geometries. High-dimensional clustering of full-sequence dihedral tuples revealed that, despite a combinatorial explosion in potential conformations (e.g.,
Together, these results demonstrate the range of structural questions that PDBMine can help address. What we have presented here is just a small sample of what is possible. PDBMine was designed to be general and extensible; the examples in this paper serve as starting points for more detailed applications. Future work could include building machine learning models based on PDBMine-derived features, integrating side-chain geometries such as
Data availability statement
Publicly available datasets were analyzed in this study. This data can be found here: The original structural data is publicly available from the Protein Data Bank (PDB) at: https://www.rcsb.org. The processed dataset used in this study is not hosted online due to size constraints. However the complete PDBMine source code, Docker configuration, and documentation for setup and usage are publicly available through the Valafar Lab GitHub repository at: https://github.com/ValafarLab/PDBMine. A publicly accessible instance of the current PDBMine backend is also available at: https://ifestos.cse.sc.edu/PDBMine for convenient browser based access. Because this deployment is hosted on departmental infrastructure, long-term availability is not guaranteed; the supported and fully reproducible access methods remain the Docker container and REST API.
Author contributions
MA: Writing – original draft, Writing – review and editing, Investigation, Validation, Visualization. CL: Writing – original draft, Writing – review and editing. AH: Methodology, Software, Writing – review and editing. CO: Conceptualization, Methodology, Software, Writing – review and editing. HV: Conceptualization, Writing – review and editing.
Funding
The author(s) declared that financial support was received for this work and/or its publication. Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number P20GM103499. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Alfaro, J. A., Zheng, R., Persson, M., Letts, J. A., Polakowski, R., and Bai, Y. (2008). ABO(H) Blood group A and B glycosyltransferases recognize substrate via specific conformational changes. J. Biol. Chem. 283 (15), 10097–10108. doi:10.1074/jbc.M708669200
Amrein, B., Schmid, M., Collet, G., Cuniasse, P., Gilardoni, F., Seebeck, F. P., et al. (2012). Identification of two-histidines one-carboxylate binding motifs in proteins amenable to facial coordination to metals. Metallomic. 4 (4), 379–388. doi:10.1039/c2mt20010d
Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G. R., et al. (2021). Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876. doi:10.1126/science.abj8754
Bai, X. C., McMullan, G., and Scheres, S. H. W. (2015). How cryo-em is revolutionizing structural biology. Trends Biochem. Sci. 40, 49–57. doi:10.1016/j.tibs.2014.10.005
Bax, A., and Clore, G. M. (2019). Protein nmr: boundless opportunities. J. Magnetic Reson. 306, 187–191. doi:10.1016/j.jmr.2019.07.037
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., et al. (2000). The protein data bank. Nucleic Acids Res. 28, 235–242. doi:10.1093/nar/28.1.235
Campello, R. J. G. B., Moulavi, D., Zimek, A., and Sander, J. (2015). Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data 10, 1–51. doi:10.1145/2733381
Cole, C., Ott, C., Valdes, D., and Valafar, H. (2019). “Pdbmine: a reformulation of the protein data bank to facilitate structural data mining,” in 2019 international conference on computational science and computational intelligence (CSCI) (IEEE), 1458–1463.
Collins, F. S., Green, E. D., Guttmacher, A. E., and Guyer, M. S.on behalf of the US National Human Genome Research Institute (2003). A vision for the future of genomics research. Nature 422, 835–847. doi:10.1038/nature01626
Grigoriu, S., Bond, R., Cossio, P., Chen, J. A., Ly, N., Hummer, G., et al. (2013). The molecular mechanism of substrate engagement and immunosuppressant inhibition of calcineurin. PLOS Biol. 11, 1–13. doi:10.1371/journal.pbio.1001492
Hao, Y., Amandine, C., Javier, N., Singh, A., Chen, H., Manzanillo, P., et al. (2021). Structures and mechanism of human glycosyltransferase β1,3-N-acetylglucosaminyltransferase 2 (B3GNT2), an important player in immune homeostasis. J. Biol. Chem. 296, 0021–9258. doi:10.1074/jbc.RA120.015306
Hekkelman, M. L., Salmoral, D. Á., Perrakis, A., and Joosten, R. P. (2025). Dssp 4: fair annotation of protein secondary structure. Protein Sci. 34, e70208. doi:10.1002/pro.70208
Hegg, E. L., and Que, L. Jr. (1997). The 2-His-1-carboxylate facial triad---an emerging structural motif in mononuclear non-heme iron(II) enzymes. Eur. J. Biochem. 250 (3), 625–629. doi:10.1111/j.1432-1033.1997.t01-1-00625.x
Ho, B. K., and Brasseur, R. (2005). The ramachandran plots of glycine and pre-proline. BMC Struct. Biol. 5, 14. doi:10.1186/1472-6807-5-14
Hochuli, E., Döbeli, H., and Schacher, A. (1987). New metal chelate adsorbent selective for proteins and peptides containing neighbouring histidine residues. J. Chromatogr. A 411, 177–184. doi:10.1016/S0021-9673(01)93640-5
Jing, Q., Li, H., Fang, J., Linda, J. R., Martásek, P., Thomas, L. P., et al. (2013). In search of potent and selective inhibitors of neuronal nitric oxide synthase with more simple structures. Bioorg. and Med. Chem. 21 (17), 5323–5331. doi:10.1016/j.bmc.2013.06.014
Joint Center for Structural Genomics (2009). Crystal structure of hypothetical protein from Thermobaculum terrenum. Available online at: https://www.rcsb.org/structure/3IHU.
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., et al. (2021). Highly accurate protein structure prediction with alphafold. Nature 596, 583–589. doi:10.1038/s41586-021-03819-2
Kabsch, W., and Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637. doi:10.1002/bip.360221211
Kleywegt, G. J., and Jones, T. A. (1996). Phi/Psi-chology: ramachandran revisited. Structure 4, 1395–1400. doi:10.1016/S0969-2126(96)00147-5
Kleywegt, G. J., and Jones, T. A. (1998). Databases in protein crystallography. Acta Crystallogr. Sect. D. Biol. Crystallogr. 54, 1119–1131. doi:10.1107/S0907444998009378
Kühlbrandt, W. (2014). The resolution revolution. Science 343, 1443–1444. doi:10.1126/science.1251652
Lakshmi, B., Sinduja, C., Archunan, G., and Srinivasan, N. (2014). Ramachandran analysis of conserved glycyl residues in homologous proteins of known structure. Protein Sci. 23, 843–850. doi:10.1002/pro.2468
Lovell, S. C., Davis, I. W., Arendall III, W. B., de Bakker, P. I. W., Word, J. M., Prisant, M. G., et al. (2003). Structure validation by Cα geometry: ϕ, ψ and Cβ deviation. Proteins Struct. Funct. Bioinforma. 50, 437–450. doi:10.1002/prot.10286
Lu, Y., Berry, S. M., and Pfister, T. D. (2001). Engineering novel metalloproteins: design of metal-binding sites into native protein scaffolds. Chem. Rev. 101 (10), 3047–3080. doi:10.1021/cr0000574
Manne, K., Chattopadhyay, D., Agarwal, V., Blom, A. M., Khare, B., Chakravarthy, S., et al. (2020). Novel structure of the N-terminal helical domain of BibA, a group B streptococcus immunogenic bacterial adhesin. Acta Crystallogr. D. 76 (8), 759–770. doi:10.1107/S2059798320008116
Marti-Renom, M. A., Stuart, A. C., Fiser, A., Sali, A., Melo, F., and Šali, A. (2000). Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophysics Biomol. Struct. 29, 291–325. doi:10.1146/annurev.biophys.29.1.291
Moult, J., Pedersen, J. T., Judson, R., and Fidelis, K. (1995). A large-scale experiment to assess protein structure prediction methods. Proteins Struct. Funct. Bioinforma. 23, ii–iv. doi:10.1002/prot.340230303
Nacar, C. (2023). The frequencies of amino acids in secondary structural elements of globular proteins. Clin. Exp. Health Sci. 13, 261–266. doi:10.33808/clinexphealthsci.1239176
Pandala, N., Brown, K. G., and Valafar, H. (2025). “Protein structure validation using pdbmine and data analytics approaches,” in Computational science and computational intelligence. Editors H. R. Arabnia, L. Deligiannidis, F. Shenavarmasouleh, S. Amirian, and F. Ghareh Mohammadi (Cham: Springer Nature Switzerland), 315–328.
Ramachandran, G., Ramakrishnan, C., and Sasisekharan, V. (1963). Stereochemistry of polypeptide chain configurations. J. Mol. Biol. 7, 95–99. doi:10.1016/S0022-2836(63)80023-6
Samuels, E. R., and Sevrioukova, I. F. (2021). Rational design of CYP3A4 inhibitors: a one-atom linker elongation in ritonavir-like compounds leads to a marked improvement in the binding strength. Int. J. Mol. Sci. 22, 852. doi:10.3390/ijms22020852
Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., et al. (2020). Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710. doi:10.1038/s41586-019-1923-7
Shen, S., Kai, B., Ruan, J., Torin Huzil, J., Carpenter, E., and Tuszynski, J. A. (2006). Probabilistic analysis of the frequencies of amino acid pairs within characterized protein sequences. Phys. A 370, 651–662. doi:10.1016/j.physa.2006.03.004
Shortle, D. (1999). Structure prediction: the state of the art. Curr. Biol. 9, R205–R209. doi:10.1016/S0960-9822(99)80130-4
Van Beusekom, B., Damaskos, G., Hekkelman, M. L., Salgado-Polo, F., Hiruma, Y., and Perrakis, A. (2021). LAHMA: structure analysis through local annotation of homology-matched amino acids. Acta Crystallogr. Sect. D. Struct. Biol. 77 (1), 28–40. doi:10.1107/S2059798320014473
Zheng, W., Kong, B., Tang, W., Zhu, J., and Chen, Y. (2022). Crystal structure of rd4-bd1 in complex with lt-872-297. Available online at: https://www.rcsb.org/structure/7EIK.
Keywords: dihedral angles, protein data bank, protein structure validation, ramachandran plot, sequence-structure relationship, structural bioinformatics
Citation: Azeem M, Lee C, Hein A, Ott C and Valafar H (2026) Reformulation of the protein databank for real-time search of geometrical attributes of protein structures. Front. Mol. Biosci. 12:1694750. doi: 10.3389/fmolb.2025.1694750
Received: 28 August 2025; Accepted: 04 December 2025;
Published: 13 January 2026.
Edited by:
Steffen P. Graether, University of Guelph, CanadaReviewed by:
Chang Liu, The University of Texas at Austin, United StatesValerio Piomponi, AREA Science Park, Italy
Copyright © 2026 Azeem, Lee, Hein, Ott and Valafar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Homayoun Valafar, aG9tYXlvdW5AY3NlLnNjLmVkdQ==
Christopher Ott