Protein structure prediction in the era of AI: Challenges and limitations when applying to in silico force spectroscopy

Mechanoactive proteins are essential for a myriad of physiological and pathological processes. Guided by the advances in single-molecule force spectroscopy (SMFS), we have reached a molecular-level understanding of how mechanoactive proteins sense and respond to mechanical forces. However, even SMFS has its limitations, including the lack of detailed structural information during force-loading experiments. That is where molecular dynamics (MD) methods shine, bringing atomistic details with femtosecond time-resolution. However, MD heavily relies on the availability of high-resolution structural data, which is not available for most proteins. For instance, the Protein Data Bank currently has 192K structures deposited, against 231M protein sequences available on Uniprot. But many are betting that this gap might become much smaller soon. Over the past year, the AI-based AlphaFold created a buzz on the structural biology field by being able to predict near-native protein folds from their sequences. For some, AlphaFold is causing the merge of structural biology with bioinformatics. Here, using an in silico SMFS approach pioneered by our group, we investigate how reliable AlphaFold structure predictions are to investigate mechanical properties of Staphylococcus bacteria adhesins proteins. Our results show that AlphaFold produce extremally reliable protein folds, but in many cases is unable to predict high-resolution protein complexes accurately. Nonetheless, the results show that AlphaFold can revolutionize the investigation of these proteins, particularly by allowing high-throughput scanning of protein structures. Meanwhile, we show that the AlphaFold results need to be validated and should not be employed blindly, with the risk of obtaining an erroneous protein mechanism.

(FnBPB), Serine-aspartate repeat protein C (SdrC), Serine-aspartate repeat protein D (SdrD), Serineaspartate repeat protein E (SdrE), Bone sialoprotein binding protein (BBP) and Collagen binding adhesin (Cna). S, signal peptide; N-ter domains N1 to N3; B, homologous repeats; R, serine-aspartate or fibronectin binding repeats; W, wall-spanning region; M, membrane anchor; C, cytoplasmic tail. The ligand binding region is formed in a cleft between the N2 and N3 domains. Regions colored in gray had no defined structure predicted by AlphaFold 2. Peptide binding domain normally involves a gap between N2 and N3 regions, with exception of Cna, where it is located between N1 and N2 regions.  (Fg ), colored in orange. The locking strand is highlighted in green. We notice its different conformation from what is expected for the "dock, lock and latch" mechanism as illustrated in Figure 2A.

Protein structure prediction and processing
We selected 54 S. aureus adhesins from the adhesion superfamily (InterPro: IPR008966) to have their full-length sequence (~1k residues) modelled. After eliminating redundancy, 42 sequences were retrieved from the Uniprot database according to the accession numbers described on Table S1. We used AlphaFold 2 (Jumper et al., 2021) through the VMD QwikFold plugin batch mode (Gomes et al., 2022) to construct the models for the full length apo proteins. Current implementation of AlphaFold 2 can generate up to 24 predictions for each sequence. All models were ranked by the predicted Local Distance Difference Test (pLDDT) quality scores.
From the same protein family, we selected 27 S. aureus adhesins to be modelled in complex with peptides from the human extracellular matrix (Table S2). The full-length sequences retrieved from Uniprot. Before structure prediction, the sequences were trimmed to contain only the N2 and N3 domains, necessary to dock and lock the peptides. To do that, all sequences were aligned using MAFFT (Nakamura et al., 2018). HMMER (Eddy, 2011) was used to generate a hidden-markov model profile using 3 S. aureus adhesins crystal structures that contained the Ig-like domains N2 and N3: bone sialoprotein binding protein, clumping factor B and serine-aspartate repeat-containing protein E (PDB IDs: 5CFA, 4F1Z, 5WTA, respectively). This profile was used to search against the aligned sequences and select the corresponding regions. Sequences for fibrinopeptides alfa, beta, gamma, in addition to complement factor H peptide were retrieved from available crystal structures for bone-sialoprotein binding protein (BBP), serine-aspartate repeat-containing protein G (SdrG from S. epidermidis), clumping factor A (ClfA) and serine-aspartate repeat-containing protein E (SdrE) (PDB IDs: 5CFA, 1R17, 2VR3, 5WTB, respectively). Proteins were later paired with each peptide based on information available on Uniprot. We used AlphaFold Multimer (Evans et al., 2022) through the QwikFold batch mode to construct the models. Five predictions were generated for each protein complex and ranked by the predicted interface template modelling (ipTM) scores, used by AlphaFold Multimer. The best ranked model for each complex was selected for steered molecular dynamics simulations.
Some adhesin:peptide complexes displayed very low force profiles upon SMD simulations. These were remodeled using Modeller (Eswar et al., 2008) using as templates BBP or SdrE crystal structures. Models were generated using standard parameters and the structures followed the same protocol described at the next session.

Steered molecular dynamics (SMD) simulations
SMD simulations were conducted using NAMD 3 (Phillips et al., 2020). All systems were prepared using the VMD QwikMD interface (Humphrey et al., 1996;Ribeiro et al., 2016) where the proteins were solvated with TIP3 water model (Jorgensen and Jenson, 1998) and the total charge neutralized using NaCl 0.15 mol/L ion concentration. The CHARMM36 (Best et al., 2012) force field was used to describe the system and the simulations were performed under periodic boundary conditions in the NpT ensemble with temperature maintained at 300 K using Langevin dynamics for temperature and pressure coupling, the latter kept at 1 bar. A distance cut-off of 11.0 Å was applied to short-range nonbonded interactions, whereas long-range electrostatic interactions were treated using the particle-mesh Ewald (PME) (Darden et al., 1993) method. Before the SMD simulations all the systems were submitted to an energy minimization protocol for 1,000 steps. Additionally, an MD simulation with position restraints in the protein backbone atoms was performed for 1 ns, with temperature ramping from 0k to 300 K in the first 0.5 ns, which served to pre-equilibrate the system. For SMD, adhesins were C-terminal anchored, while peptides were pulled at a constant speed of 5x10-5 Å/ps with a 5 kcal/mol/Å 2 spring constant for 10 ns. Production runs were generated in ten replicas for each complex. We also selected 3 S. aureus crystallographic structures of adhesin:peptide complexes: BBP, ClfA and SdrE (PDB IDs: 5CFA, 2VR3 and 5WTB) to be simulated as control. Root mean square deviation (RMSD) values were calculated for the equilibrium simulation pre-SMD, for all systems. Protein images were generated using VMD; plots were rendered using python scripts.