Ensemble deep learning models for protein secondary structure prediction using bidirectional temporal convolution and bidirectional long short-term memory

Protein secondary structure prediction (PSSP) is a challenging task in computational biology. However, existing models with deep architectures are not sufficient and comprehensive for deep long-range feature extraction of long sequences. This paper proposes a novel deep learning model to improve Protein secondary structure prediction. In the model, our proposed bidirectional temporal convolutional network (BTCN) can extract the bidirectional deep local dependencies in protein sequences segmented by the sliding window technique, the bidirectional long short-term memory (BLSTM) network can extract the global interactions between residues, and our proposed multi-scale bidirectional temporal convolutional network (MSBTCN) can further capture the bidirectional multi-scale long-range features of residues while preserving the hidden layer information more comprehensively. In particular, we also propose that fusing the features of 3-state and 8-state Protein secondary structure prediction can further improve the prediction accuracy. Moreover, we also propose and compare multiple novel deep models by combining bidirectional long short-term memory with temporal convolutional network (TCN), reverse temporal convolutional network (RTCN), multi-scale temporal convolutional network (multi-scale bidirectional temporal convolutional network), bidirectional temporal convolutional network and multi-scale bidirectional temporal convolutional network, respectively. Furthermore, we demonstrate that the reverse prediction of secondary structure outperforms the forward prediction, suggesting that amino acids at later positions have a greater impact on secondary structure recognition. Experimental results on benchmark datasets including CASP10, CASP11, CASP12, CASP13, CASP14, and CB513 show that our methods achieve better prediction performance compared to five state-of-the-art methods.


Introduction
As a major research hotspot in bioinformatics, protein secondary structure prediction (PSSP) is undoubtedly an important task (Yang et al., 2018). The protein primary structure consists of a linear arrangement of amino acid residues (Kumar et al., 2020). The secondary structure is a specific spatial structure formed by the peptide chain curling or folding according to a certain rule. Further folding based on the secondary structure can form the tertiary structure. As a bridge connecting the primary and tertiary structures, the improvement of PSSP not only helps us understand the structure and function of proteins but also better predicts the tertiary structure (Wang et al., 2008;Yaseen and Li, 2014a;Wang et al., 2017). In addition, PSSP can also facilitate drug design. However, biological techniques for PSSP are time-consuming and expensive, so we can use computers and deep learning (LeCun et al., 2015) methods to improve secondary structure prediction.
Generally, the eight classes of protein secondary structure can be divided into G (helix), H (α-helix), I (π-helix), E (β-sheet), B (βbridge), S (bend), T (turn) and C (coil) (Kabsch and Sander, 1983). Three classes of protein secondary structure can be formed by classifying H, G, and I as H (helix), E and B as E (strand), and other structures as C (coil) (Yaseen and Li, 2014b;Ma et al., 2018;Zhang et al., 2018). In recent years, the research on 3-state PSSP is more in-depth. However, it is important to obtain more abundant protein structural information about the 8-state secondary structure.
In the early days of research, machine learning methods such as support vector machines (Hua and Sun, 2001;Yang et al., 2011), neural networks (Qian and Sejnowski, 1988;Faraggi et al., 2012), and k-nearest neighbors (Salzberg and Cost, 1992;Bondugula et al., 2005) were widely used for PSSP. Furthermore, the PSIPRED server used two feedforward neural networks to predict secondary structure (McGuffin et al., 2000). The JPred4 server used the JNet algorithm to improve accuracy (Drozdetskiy et al., 2015). However, these methods cannot extract the global information in the sequence well.
With the development and improvement of deep learning in recent years, neural networks with deep architectures have achieved remarkable results in various fields. The deep learning method can not only reduce the computational complexity but also effectively utilize the extracted information to improve the prediction accuracy. The SSpro applied profiles, BRNN and structural similarity to PSSP (Magnan and Baldi, 2014). The SPIDER3 server used the LSTM-BRNN model for 3-state PSSP (Heffernan et al., 2017). The SPOT-1D used ResNet to improve the SPIDER3 server (Hanson et al., 2019). The SAINT combined the self-attention mechanism and the Deep 3I network to improve PSSP (Uddin et al., 2020). However, these methods have complex network structures and high computational costs. In addition, Zhou et al. proposed a supervised generative stochastic network to predict secondary structure (Zhou and Troyanskaya, 2014). The DeepCNF combined conditional neural fields and shallow neural networks for prediction (Wang et al., 2016). Wang et al. (2017) proposed a deep recurrent encoder-decoder network for classification. The Porter 5 classifier used multiple BRNNs for prediction (Torrisi et al., 2018). The DeepCNN used multi-scale convolution to extract secondary structure features (Busia and Jaitly, 2017). The NetSurfP-2.0 combined CNN and LSTM to extract local and long-range interactions (Klausen et al., 2019). These methods can improve PSSP performance, but they are not only insufficient for long-range feature extraction but also fail to establish a good balance between local features and long-range features.
In recent years, temporal convolutional network (TCN) (Bai et al., 2018) has achieved remarkable performance (Lea et al., 2017), while outperforming popular models such as recurrent neural networks in most fields. TCN can only extract unidirectional features, but secondary structure prediction is influenced by past and future amino acids. To this end, we propose a bidirectional TCN (BTCN) by improving TCN, which can extract bidirectional deep dependencies between amino acids. Due to the waste of hidden layer information in BTCN, we further propose a multi-scale BTCN (MSBTCN), which can not only extract bidirectional features but also better preserve the feature information of intermediate residual blocks. However, MSBTCN may also introduce unnecessary information.
For high-dimensional long protein sequences, most existing methods with deep architectures not only lack long-range feature extraction capability but also ignore deep dependencies. In addition, a single model cannot extract key information in complex residue sequences and has great limitations. Therefore, this paper proposes a novel deep learning model that uses BTCN, bidirectional long short-term memory (BLSTM) (Graves and Schmidhuber, 2005) network and MSBTCN to improve the accuracy of PSSP. In the proposed model, the BTCN module using the sliding window technique can extract bidirectional deep local dependencies in protein sequences. The BLSTM module can extract the global interactions between amino acid residues. The MSBTCN module can further capture bidirectional deep long-range dependencies between residues, while better fusing and optimizing features. Our method can effectively utilize longerterm bidirectional feature information to model complex sequence-structure relationships. Due to the close correlation between 3-state and 8-state PSSP, we also propose to fuse the features of 3-state and 8-state PSSP for classification based on the model. Furthermore, this paper compares our proposed six novel deep models for PSSP by combining BLSTM with TCN, reverse TCN (RTCN), multi-scale TCN (MSTCN), BTCN and MSBTCN, respectively. To evaluate the prediction performance of the model, we compare it with state-of-the-art methods on benchmark datasets. Experimental results show that our methods achieve better performance, which can effectively solve the shortcomings of incomplete and insufficient feature extraction.
The main contributions of this paper: 1) We propose BTCN by improving TCN, which can extract bidirectional deep dependencies in sequences. To enable BTCN to extract local features, we preprocess the sequences using a sliding window technique. 2) We further propose MSBTCN, which can not only extract bidirectional deep features between residues but also better preserve the information of hidden layers. 3) We propose a novel deep learning model using BTCN, BLSTM and MSBTCN, which outperforms five state-of-the-art methods and improves the prediction accuracy of secondary structure. 4) We propose multiple novel deep learning models by combining BLSTM with TCN, RTCN, MSTCN, BTCN, and MSBTCN respectively, which can effectively solve the disadvantage of low long-range dependency extraction ability in long sequences. 5) We experimentally demonstrate that the reverse prediction of secondary structure is superior to the forward prediction, suggesting that the recognition of secondary structure is more correlated with amino acids at later positions. 6) We experimentally demonstrate that the fusion of 3state and 8-state PSSP features can further improve the prediction performance of the secondary structure, which also provides a new idea for PSSP.
2 Materials and methods 2.1 Bidirectional long short-term memory networks (BLSTM) As shown in Figure 1, BLSTM consists of forward LSTM (Hochreiter and Schmidhuber, 1997) and backward LSTM. LSTM can automatically decide to discard unimportant information and retain useful information. For a standard LSTM cell at time t, the input feature is denoted as x t , the output is denoted as h t , and the cell state is denoted as c t . The forget gate f, the input gate i and the output gate o in the LSTM unit are calculated as follows: where σ is the sigmoid function, W is the weight matrix, b is the bias term, ☉ is the element-wise multiplication, and tanh is the hyperbolic tangent function.

Temporal convolutional networks (TCN)
TCN has superior performance in sequence processing while avoiding the gradient problem during training. In addition, TCN also has the characteristics of fast calculation speed, low memory, parallel operation and flexible receptive field.

Causal convolutions
TCN uses a one-dimensional fully convolutional network architecture, where the length of the input layer is the same as the length of each hidden layer, and zero padding is added to keep the front and back layers the same length. Therefore, TCN can map sequences of any length to output sequences of the same length. Furthermore, the network uses causal convolution, where the output at the current time is only determined by the feature inputs at the current time and past time. Therefore, information in TCN does not leak from the future to the past.

Dilated convolutions
However, causal convolution has inevitable limitations when dealing with sequences that require long-term historical information. Therefore, the network uses dilated convolution to increase the receptive field and obtain very long effective historical information, which is defined as: Where F(s) is the dilated convolution operation, x is the input feature, d is the dilation factor, f is the filter, s is the element of the sequence, k is the filter size, and s − d • i represents the past direction.
As the number of layers and the dilation factor continue to increase (d = 2 i at level i), the output of the top layer will contain a wider range of input information.

FIGURE 1
The architecture of BLSTM.
Frontiers in Bioengineering and Biotechnology frontiersin.org

Residual connections
As shown in Figure 2, the network introduces residual connections to ensure the training stability of high-dimensional input, which is defined as: where X represents the input of the block and F(X) represents the output of the block after a series of operations.
The TCN architecture consists of multiple residual blocks. As shown in Figure 2, the residual block contains dilated causal convolutional layers, weight normalization layers, ReLU layers, and dropout layers. The TCN adds the input of each block to the output of the block (including a 1 × 1 convolution on the input when the number of channels between the input and output do not match).

FIGURE 2
Architecture in TCN.

FIGURE 3
The architecture of BTCN.
Frontiers in Bioengineering and Biotechnology frontiersin.org 04

The proposed bidirectional TCN (BTCN)
Since TCN uses dilated causal convolution, it can only transmit information from the past to the future. However, the recognition of secondary structure is not only determined by amino acids at previous positions but also influenced by amino acids at later positions. The unidirectionally transported TCN obviously cannot satisfy the comprehensive extraction of amino acid features, so we propose a BTCN model to adequately capture the bidirectional deep dependencies between residues.
As shown in Figure 3, the architecture of BTCN consists of forward TCN and backward TCN. Since the dilated causal convolution performs one-way operation on the sequence, we input the reverse sequence to the backward TCN for reverse feature extraction of the network.
Letting X denote a protein sequence, L be the length of X, and The BTCN can be expressed as follows: where TCN → is the forward TCN whose input is the forward sequence X, TCN ← is the backward TCN whose input is the reverse sequence X ← , ⊕ is the addition operation of the matrix,Ŷ andŶ 1 are the outputs of the forward and backward TCN respectively,Ŷ 1 ← is the reverse matrix ofŶ 1 , 1DCov is the 1D convolution operation of the residual block, W and b are the weight matrix and bias term of the fully connected layer, softmax is the activation function for classification, and Output is the final output of BTCN.
The outputŷ t of the network at the current time t can be determined by the input of the entire sequence, which is calculated as: We denote the forward TCN → as TCN, because the input of the backward TCN ← is the reverse sequence, so it is also called reverse TCN and denoted as RTCN. Therefore, the architecture of the network is BTCN = 1DCov(TCN + RTCN), where 1DCov can further optimize the features. In the network, TCN operates on inputs at times t and before t (x 1 , x 2 , . . . , x t ), and RTCN operates on inputs at times t and after t (x L , x L-1 , . . . , x t ). Therefore, the network can utilize bidirectional deep interactions to facilitate secondary structure recognition through forward and backward extraction of residue features. Furthermore, BTCN is not limited to PSSP, it applies to all sequences that require global semantics.

The proposed multi-scale bidirectional TCN (MSBTCN)
In unidirectional TCN, the outputŷ q of the qth layer residual block is:ŷ where the inputŷ q−1 is the output of the previous layer. As the number of layers in BTCN increases, the receptive field of the network continues to expand. However, since BTCN adopts the dilated convolutional architecture, the hidden layer in the middle of the network loses a lot of important feature information. Therefore, we further propose the MSBTCN model to more comprehensively utilize residue features for classification. The improved MSBTCN can not only extract bidirectional multi-scale features but also better preserve the key information of the intermediate residual blocks. As shown in Figure 4, the MSBTCN can utilize the complete information of all layers for prediction, which effectively prevents the waste of weight information in hidden layers. The outputŷ of a unidirectional MSTCN with n residual blocks is: The improved MSBTCN can not only capture the bidirectional deep features but also utilize the key information of the intermediate residual blocks for prediction.

Overall architecture of the proposed model
To better improve the prediction of secondary structure, as shown in Figure 5, the proposed model uses BTCN, BLSTM and MSBTCN to extract deep interactions of residue sequences. The proposed model can be divided into five parts: input, BTCN module, BLSTM module, MSBTCN module and output.
In the input part, we first transform the protein data into 20dimensional PSSM features and 21-dimensional one-hot features.

FIGURE 4
The architecture of MSBTCN.
Frontiers in Bioengineering and Biotechnology frontiersin.org Then, we use the hybrid feature PSSM + one-hot of size 41 × L as the input of the model, where L is the length of the protein sequence.
In the BTCN module, to enable the network to extract local features, we use the sliding window technique to segment the input features into short sequences of 41 × W, where W is the window size. The input and output of BTCN for sequence-tosequence prediction must be the same length, so we put the secondary structure label corresponding to the segmented amino acid feature at the W position, and fill the remaining positions with 0. We then use the modified BTCN to extract bidirectional deep local dependencies in amino acid sequences. Since W is generally less than 20, our use of four residual blocks is sufficient to capture bidirectional amino acid information in the sequence. We use three dilated causal convolutional layers with the same dilation factor in the residual block. After the dilated causal convolutional layer, we add an instance normalization layer to accelerate model convergence, a ReLU activation layer to prevent vanishing gradient, and a spatial dropout layer to avoid model overfitting. We use the Transform layer to process the extracted local features into 20 × L sequences. Then, the Concatenate layer can merge the local features with the input features into 61 × L sequences.
In the BLSTM module, we use two bidirectional LSTM layers with powerful analytical capabilities to extract key global interactions in protein sequences. Additionally, we add two dropout layers to ensure gradient stabilization during training.
In the MSBTCN module, we use four and five residual blocks to optimize the extracted local and global features, respectively, while further capturing the deeper bidirectional long-range dependencies between amino acid residues, which can more comprehensively preserve the important information of the hidden layer. Since MSBTCN has a flexible receptive field and stable computation, it can interact and control sequence information more accurately, while quickly optimizing and fusing the extracted features.
In the output part, we use a residual block to process and optimize the features extracted by MSBTCN and BLSTM modules. Finally, we use a fully connected layer and softmax function to complete the classification.
It should be noted that we also extract the Concatenate layer features of the output part of the model in 3-state and 8-state PSSP respectively and fuse them into 80-dimensional features for secondary structure prediction, where the fused features are denoted as FF 3-8 . The FF 3-8 contains both 3-state and 8-state secondary structure label information, which can better model the sequence-structure mapping relationship between input features and secondary structures. The proposed method can effectively exploit more complex and longer-term global dependencies to improve the accuracy of PSSP through comprehensive processing of protein sequences.

Datasets
The PISCES (Wang and Dunbrack, 2005) server can produce lists of sequences from the Protein Data Bank (PDB) based on chain-specific criteria and mutual sequence identity, which are widely used for PSSP. Therefore, we selected 14,991 proteins from the PDB to compose the CullPDB (Wang and Dunbrack, 2005) dataset based on the percentage identity cutoff of 25%, the R-factor cutoff of 0.25, and the resolution cutoff of 3 Å. To ensure the accuracy of the 8-state secondary structure information, we use the division method of the DSSP (Kabsch and Sander, 1983) program. We removed proteins that were duplicated with the test set in the training set. In addition, we also removed proteins with lengths less than 40 or more than 800. The final CullPDB dataset contains 14,562 protein chains. For better evaluation, we further randomly divide the dataset into three parts: a training set (11,650), a validation set (1,456) and a test set (1,456). The results of all experiments are obtained from the average of three times independent experiments.
To evaluate the performance of the proposed model, we also use the CASP10 (Moult et al., 2014), CASP11 (Moult et al., 2016), CASP12 (Moult et al., 2018), CASP13 (Kryshtafovych et al., 2019), CASP14 (Kryshtafovych et al., 2021), and CB513 (Cuff and Barton, 1999) datasets as test sets, where the numbers of proteins and residues in the six benchmark datasets are shown in Table 1. The first five datasets are from the Critical Assessment of Protein Structure Prediction (CASP) website https://predictioncenter.org/.

Feature representation
In this study, we used two amino acid encoding methods: onehot encoding and position-specific scoring matrix (PSSM) (Jones, 1999). The database of protein sequences contains 20 standard amino acid types (A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, and V) and 6 non-standard amino acid types such as B, Z, and X. Since the occurrence frequency of the six non-standard amino acid types is particularly low, they can generally be classified as one type. Therefore, we consider that the protein sequence consists of 21 amino acid types.
An amino acid sequence of length L can be represented as a 21 × L feature matrix by one-hot encoding, where 21 represents the number of amino acid types, the position corresponding to the amino acid type is 1, and the other positions are 0. Each amino acid type in one-hot encoding has an independent number, which makes the vector representations of different amino acid types mutually orthogonal, so this method can also be called orthogonal encoding.
PSSM is a scoring matrix based on the alignment of the sequence itself with multiple sequences. This encoding method contains rich biological evolution information, so it is widely used for protein sequence representation in PSSP. In the experiments, PSSM was generated by PSI-BLAST (Altschul et al., 1997) with parameters including a threshold of 0.001 and 3 iterations. A 20 × L PSSM matrix represents a protein sequence of length L, where 20 is the number of standard amino acid types, that is, each row corresponds to one amino acid residue type.

Evaluation metrics
In this paper, we use four metrics to evaluate the performance of the proposed model: Q3 accuracy, Q8 accuracy and Segment overlap (Sov) (Zemla et al., 1999) score for 3-state and 8-state PSSP.
The 8-state secondary structure is H, G, I, E, B, S, T and C, while the 3-state secondary structure is H, E, and C. Q3 and Q8 accuracy are the ratios of the number of correct residues predicted to the number of all residues S, which are defined as: Q8 S H + S G + S I + S E + S B + S C + S T + S S S × 100 where S i (i ∈{H, E, C} or {H, G, I, E, B, C, T, S}) represents the correct number of predicted a single type i. Letting S S denote the total number of residues of a single type i. The prediction accuracy Q i of a single type i is defined as: Sov is a metric based on the ratio of overlapping segments, which is defined as: where N Sov is the total number of residues in the protein sequence, S 1 is all observed structural segments, S 2 is all predicted segments, S 0 is all segments of S 1 and S 2 with the same structure, length(S 1 ) is the residue length of S 1 , maxov(S 1 , S 2 ) is the union length of S 1 and S 2 segments, and minov(S 1 , S 2 ) is the intersection length of S 1 and S 2 segments. The factor σ(S 1 , S 2 ) allows variation at the segment edges, which is defined as:

Performance analysis of the proposed model
To make the proposed model have good performance when dealing with long protein sequences, we conduct extensive experiments on the CullPDB dataset without using FF 3-8 . For the three modules in the proposed model, we show and analyze the effect of different hyperparameters on the prediction performance in experiments.

Effect of BTCN module parameters
To explore the effect of sliding window size and filter parameters on the proposed model, we conduct comparative experiments on validation and test sets. Since the recognition of secondary structure is mainly influenced by amino acids at current and adjacent positions, we used different sliding window sizes 13, 15, 17 and 19 to segment protein sequences. As shown in Figures 6A, B, when the sliding window size is 19, the model achieves the highest Q3 and Q8 accuracy on the two datasets. Because when the window is too small or too large, important amino acid information at key positions will be lost or ignored. Figures 6C, D show the Q3 and Q8 accuracy of the proposed model under different numbers and sizes of filters. The figures show that when the number and size of filters are 512 and 5, the model achieves the best experimental results on the validation and test sets. The main reason is that the filter size determines the local extent of capture, which affects the extraction of key features between residues. Furthermore, the number of channels in the convolution not only affects the prediction performance but also determines the model size and training time.

Effect of BLSTM module parameters
To verify the effect of hidden units in the BLSTM layer on the proposed model, we conduct comparative experiments on the validation and test sets with different hidden unit numbers of 1, 000, 1,200, 1,500, 1800, 2000, 2,200, and2,500. Figures 7A, B show that the classification accuracy of the model on the two datasets increases as the number of hidden units increases. When the number of hidden units is 2,500, the model achieves the best prediction performance in 3-state and 8-state PSSP. The main reason is that the number of hidden units determines the expressivity of high-dimensional protein sequences. However, it should be noted that if the number of hidden units is too large, the model will not only slow down the training speed but also may have overfitting problems.

Effect of MSBTCN module parameters
The performance of the MSBTCN module composed of residual blocks is closely related to the number of blocks, so we optimize the extracted local features with 3-8 different numbers of residual blocks, respectively. Figures 7C, D show the 3-state and 8-state accuracy of the proposed model on the validation and test sets. It can be observed that the Q3 accuracy on the two datasets reaches the maximum when the model uses 4 residual blocks, while the Q8 accuracy on the two datasets reaches the maximum when 4 and 5 blocks are used, respectively. Because the number of residual blocks determines the depth of our model. The model cannot capture deeper dependencies when the depth is not enough, but the model increases complexity and the risk of overfitting as the depth increases.

FIGURE 8
Reverse representation of amino acid and secondary structure sequences.
Frontiers in Bioengineering and Biotechnology frontiersin.org performance of the BTCN model, which can effectively capture the bidirectional deep interactions between residues and improve the prediction accuracy.
In addition, the table shows that the prediction accuracy of RTCN is consistently better than TCN on seven datasets and the Q8 accuracy is improved by an average of 1.18%. The reverse amino acid sequence is shown in Figure 8. The results show that the reverse prediction of the secondary structure is superior to the forward prediction, which also indicates that the amino acid at the later position has a greater impact on the overall recognition of the secondary structure when the features are extracted unidirectionally. The main reason for the low prediction accuracy of TCN is that its broad receptive field ignores the important information of adjacent amino acids when the sliding window technique is not used, but the prediction of the whole sequence can better reflect the influence of the amino acids in the front and rear positions on PSSP. The singletype prediction accuracy of TCN and RTCN is shown in Table 3. The table shows that the accuracy of types H, G, B, C, S and T has improved while the accuracy of type E has decreased on most datasets. This also demonstrates that the recognition of most secondary structure types is more relevant to amino acid information from later positions.

Comparison of the six proposed models
To verify the performance of different feature extraction modules in PSSP, we propose six novel deep learning models by combining BLSTM with TCN, RTCN, MSTCN, BTCN and MSBTCN, respectively. We use the same feature extraction process and parameters in all models. As shown in Table 4, the BLSTM-BTCN-MSBTCN model achieves the highest Q3 and Q8 accuracy on the eight datasets except CASP13. In addition, it can be observed that BLSTM-RTCN has better prediction performance on most datasets than BLSTM-TCN, which indicates that the reverse prediction of secondary structure can achieve higher accuracy. After the sequence is processed by the sliding window method, the effect of the amino acids in the front and rear positions on the prediction performance is not much different, so the advantage of RTCN is not obvious. The table shows that the prediction performance of the BLSTM-BTCN and BLSTM-MSBTCN models is significantly better than the models with unidirectional feature extraction on all datasets, which proves that our proposed BTCN and MSBTCN can fully exploit the bidirectional long-range interaction to improve the prediction accuracy. Although BLSTM-MSBTCN can capture multi-scale feature information, its prediction

Comparison with state-of-the-art methods
In this section, we compare the proposed model with five state-of-the-art models on seven datasets CullPDB, CASP10, CASP11, CASP12, CASP13, CASP14, and CB513 using Q3 accuracy, Q8 accuracy and Sov score as evaluation measures. Among the compared models, DCRNN (Li and Yu, 2016) is an end-to-end deep network that uses convolutional neural networks with different kernel sizes and recurrent neural networks with gated units to extract multi-scale local features and long-range dependencies in protein sequences. CNN_BIGRU (Drori et al., 2018) combines convolutional network and bidirectional GRU to predict secondary structure. DeepACLSTM (Guo et al., 2019) combines asymmetric convolutional networks and bidirectional long short-term memory networks to improve secondary structure prediction accuracy. These three algorithms are all combinations of convolutional neural networks and recurrent neural networks, but their structures are different. MUFOLD-SS (Fang et al., 2018) uses a Deep inception-inside-inception (Deep3I) network to handle local and global dependencies in sequences. ShuffleNet_ SS (Yang et al., 2022) uses a lightweight convolutional network and label distribution aware margin loss to improve the network's learning ability for rare classes. For a fair comparison, we use our dataset for training in experiments, where the input is the hybrid feature PSSM + one-hot.
The prediction results of the proposed methods and five existing popular methods DCRNN, CNN_BIGRU, DeepACLSTM, MUFOLD-SS, and ShuffleNet_SS on benchmark datasets are shown in Tables 5, 6. The table shows that our model consistently outperforms five state-of-the-art methods on seven datasets in terms of Q3 accuracy, Q8 accuracy and Sov scores for 3-state and 8-state PSSP. This is mainly attributed to the powerful and comprehensive feature extraction capability of the proposed model, which enables bidirectional deep local and long-range interactions in residue sequences to be fully extracted and used for prediction. Compared to our model without FF 3-8 , FF 3-8 achieves the best 3-state PSSP performance on all datasets while FF 3-8 also achieves the highest 8-state PSSP accuracy in most cases. The experimental results show that the important correlation between the 3-state and 8-state PSSP can mutually promote the recognition of secondary structure. In particular, the accuracy of the 3-state PSSP is significantly improved after adding the 8-state PSSP feature. Furthermore, our model size is 13.8 MB while the model size using FF 3-8 is 14.3 MB. The model sizes of DCRNN, CNN_BIGRU, DeepACLSTM, MUFOLD-SS and ShuffleNet_SS are 18.1MB, 15.8MB, 20.6MB, 17.6 MB and 3.9MB, respectively. Although our model parameter size only outperforms four popular methods, it achieves state-of-the-art performance in PSSP. For high-dimensional long sequences, our model can also effectively utilize a broad and flexible receptive field to capture longer-term key dependencies between residues, so it can better model the complex relationship between sequence and structure.

The single-type accuracy of the 8-state PSSP
In 8-state PSSP, the single-type accuracy of the proposed model without and with FF 3-8 on seven datasets is shown in Table 7. It can be seen from the table that there are obvious differences in the prediction accuracy of the eight structures. The main reason is that the frequency of occurrence of various types is too different, and the number of structure type I is almost 0. It can be observed that the accuracy of structure types G, E, B and T is

Conclusion
In this paper, we propose a novel deep learning model for PSSP using BTCN, BLSTM and MSBTCN. In the proposed model, we use a modified BTCN module to extract bidirectional deep local dependencies in protein sequences segmented by the sliding window technique. Then, we use the BLSTM module to extract the global interactions between amino acids. We also use a modified MSBTCN module to further capture the bidirectional key long-range dependencies between residues while better optimizing and fusing the extracted features, which prevents information waste in hidden layers. The proposed model has strong stability and feature extraction ability, and it can not only effectively solve the shortcomings of insufficient extraction of deep long-range dependencies in sequences but also overcome the weaknesses of each module. Due to the close correlation between the 3-state and the 8-state, we also use the fusion feature FF 3-8 based on the proposed model to further improve the performance of PSSP, which is also a new idea for PSSP. Moreover, this paper compares the six PSSP models we propose by combining BLSTM with TCN, RTCN, MSTCN, BTCN, and MSBTCN, respectively. In addition, we experimentally demonstrated that the reverse prediction of secondary structure can achieve higher accuracy, which indicates that amino acids at later positions are more correlated with secondary structure recognition than amino acids at earlier positions. We evaluate the performance of the proposed model on benchmark datasets such as CASP10, CASP11, CASP12, CASP13, CASP14, and CB513 using Q3 accuracy, Q8 accuracy and Sov score. Experimental results show that our method has better prediction performance compared to state-of-the-art methods. Our methods can fully use the diverse deep features in residue sequences for prediction to better model the complex mapping relationship between sequences and structures, thereby improving the accuracy of PSSP. Our models are not limited to PSSP but are applicable to all data that rely on bidirectional information. When dealing with other real sequence data, BTCN may ignore some information while MSBTCN may introduce unimportant information. Therefore, in the future, we will