AUTHOR=Zhang Hua , Gou Ruoyun , Shang Jili , Shen Fangyao , Wu Yifan , Dai Guojun TITLE=Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition JOURNAL=Frontiers in Physiology VOLUME=Volume 12 - 2021 YEAR=2021 URL=https://www.frontiersin.org/journals/physiology/articles/10.3389/fphys.2021.643202 DOI=10.3389/fphys.2021.643202 ISSN=1664-042X ABSTRACT=Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances betweendifferent speakers. The performances of SER are extremely reliant on the extracted features from speechsignals. How to establish features extracting and classification model effectively is still under intense research. Inthis paper, we proposed a new method for SER based on Deep Convolution Neural Network and BidirectionalLong Short-Term Memory with Attention model (ADCNN-BLSTM). We first preprocess the speech samples bydata enhancement and balancing datasets. Secondly, we extract three-channel of log Mel-spectrograms (static,delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied togenerate the segment-level features, we stack these features of a sentence into utterance-level features. Next,we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by anattention layer which can focus on emotionally relevant features. Finally, the learned high-level emotionalfeatures are fed to the deep neural network (DNN) to predict the final emotion. Experiments on EMO-DB andIEMOCAP database obtain the unweighted average recall (UAR) of 87.86% and 68.50% respectively, whichare better than most of popular methods and demonstrate the effectiveness of our proposed method for SER.