AUTHOR=Zhao Xiaoming , Liao Yuehui , Tang Zhiwei , Xu Yicheng , Tao Xin , Wang Dandan , Wang Guoyu , Lu Hongsheng TITLE=Integrating audio and visual modalities for multimodal personality trait recognition via hybrid deep learning JOURNAL=Frontiers in Neuroscience VOLUME=Volume 16 - 2022 YEAR=2023 URL=https://www.frontiersin.org/journals/neuroscience/articles/10.3389/fnins.2022.1107284 DOI=10.3389/fnins.2022.1107284 ISSN=1662-453X ABSTRACT=Recently, personality trait recognition has been an interesting and active topic in psychology, affective neuroscience and artificial intelligence. To effectively take advantage of spatio-temporal cues in audio-visual modalities, this paper proposes a new method of multimodal personality trait recognition integrating audio-visual modalities based on a hybrid deep learning framework, which is comprised of convolutional neural networks (CNN), bi-directional long short-term memory network (Bi-LSTM), and the Transformer network. In particular, a pre-trained deep audio CNN model is used to learn high-level segment-level audio features. A pre-trained deep face CNN model is leveraged to separately learn high-level frame-level global scene features and local face features from each frame in dynamic video sequences. Then, these extracted deep audio-visual features are fed into a Bi-LSTM and a Transformer network to individually capture long-term temporal dependency, thereby producing the final global audio and visual features for downstream tasks. Finally, a linear regression method is employed to conduct the single audio-based and visual-based personality trait recognition tasks, followed by a decision-level fusion strategy used for producing the final Big-Five personality scores and interview scores. Experimental results on the public ChaLearn First Impression-V2 personality dataset show the effectiveness of our method, outperforming other used methods.