AUTHOR=Cheng Ning , Chen Yue , Gao Wanqing , Liu Jiajun , Huang Qunfu , Yan Cheng , Huang Xindi , Ding Changsong 

TITLE=An Improved Deep Learning Model: S-TextBLCNN for Traditional Chinese Medicine Formula Classification

JOURNAL=Frontiers in Genetics

VOLUME=Volume 12 - 2021

YEAR=2021

URL=https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2021.807825

DOI=10.3389/fgene.2021.807825

ISSN=1664-8021

ABSTRACT=Purpose: This study proposes a S-TextBLCNN model for the efficacy of traditional Chinese medicine (TCM) formulae classification. This model uses deep learning to analyze the relationship between herb efficacy and formula efficacy, which is helpful in further exploring the internal rules of formula combination. 
Methods: First, for the TCM herbs extracted from Chinese Pharmacopoeia, natural language processing (NLP) is used to learn and realize the quantitative expression of different TCM herbs. Three features of herb name, herb properties, and herb efficacy are selected to encode herbs and construct formula-vector and herb-vector. Then, based on 2664 formulae for Stroke collected in TCM literature and 19 formula efficacy categories extracted from Yifang Jijie, an improved deep learning model TextBLCNN consists of a bidirectional Long Short-Term Memory (Bi-LSTM) neural network and a Convolutional Neural Network (CNN) is proposed. Based on 19 formula efficacy categories, binary classifiers are established to classify the TCM formulae. Finally, aiming at the imbalance problem of formula data, the over-sampling method SMOTE is used to solve it and the S-TextBLCNN model is proposed. 
Results: The formula-vector composed of herb efficacy has the best effect on the classification model because there is a strong relationship between herb efficacy and formula efficacy. TextBLCNN model has the accuracy of 0.858 and F1-score of 0.762, both higher than Logistic Regression (acc=0.561, F1-score=0.567), SVM (acc=0.703, F1-score=0.591), LSTM (acc=0.723, F1-score=0.621) and TextCNN (acc=0.745, F1-score=0.644) models. In addition, the over-sampling method SMOTE is used in our model to tackle data imbalance, and F1-score is greatly improved by an average of 47.1% in 19 models. 
Conclusion: The combination of formula feature representation and S-TextBLCNN model realizes more accuracy in formula efficacy classification. It provides a new research idea for the study of TCM formulae compatibility.