AUTHOR=Wang Lu TITLE=Multimodal robotic music performance art based on GRU-GoogLeNet model fusing audiovisual perception JOURNAL=Frontiers in Neurorobotics VOLUME=Volume 17 - 2023 YEAR=2024 URL=https://www.frontiersin.org/journals/neurorobotics/articles/10.3389/fnbot.2023.1324831 DOI=10.3389/fnbot.2023.1324831 ISSN=1662-5218 ABSTRACT=The field of multimodal robotic musical performing arts has recently attracted a great deal of attention, \textcolor{red}{and while this field has significantly potential for innovation, there are also questions about how to better integrate visual, auditory, and affective information to achieve better performances and to process multimodal data. Therefore, this paper explores the innovative application of multimodal robots that integrate visual and auditory perception in music performance. } First, we introduce multimodal robots-advanced systems with multiple sensory modalities such as vision and hearing. Furthermore, we emphasize the limitations of conventional robots in understanding emotions and artistic expression in musical performances. To address these issues, we present the importance of integrating audiovisual perception, including emotion analysis, with which to improve the quality and artistic expression of robots in musical performances. Finally, we propose a new method for integrating GRU and GoogLeNet models for sentiment analysis. \textcolor{red}{The GRU model can effectively process audio data and capture the temporal dynamics of musical elements. It also effectively models long-term dependencies in audio signals, which we use to extract emotional information from music data. Meanwhile, the GoogLeNet model is adept at image processing, extracting complex visual details and capturing aesthetic features from visual data. This provides powerful image recognition and understanding for our model. This synergy helps deepen the understanding of musical and visual elements, resulting in more emotionally resonant and interactive robot performances. } By combining these modalities, our method comprehensively understands the meaning of music performance. Experimental results demonstrate the effectiveness of our approach, showing significant progress in music performance through multimodal robots. Multimodal robots that merge audio-visual perception in music performance not only enrich and expand the art form but also offer diverse human-machine interaction, providing personalized experiences. Additionally, our approach provides valuable insights for the development of multimodal robots in music performance. \textcolor{red}{This research vividly demonstrates the great potential of multimodal robots in the field of music performance. It promotes the deep integration of technology and art, opening up a new realm of performing arts and human-robot interactions. This creates a fascinating and innovative experience.