AUTHOR=Milling Manuel , Baird Alice , Bartl-Pokorny Katrin D. , Liu Shuo , Alcorn Alyssa M. , Shen Jie , Tavassoli Teresa , Ainger Eloise , Pellicano Elizabeth , Pantic Maja , Cummins Nicholas , Schuller Björn W. TITLE=Evaluating the Impact of Voice Activity Detection on Speech Emotion Recognition for Autistic Children JOURNAL=Frontiers in Computer Science VOLUME=Volume 4 - 2022 YEAR=2022 URL=https://www.frontiersin.org/journals/computer-science/articles/10.3389/fcomp.2022.837269 DOI=10.3389/fcomp.2022.837269 ISSN=2624-9898 ABSTRACT=Individuals with autism are known to face challenges with emotion regulation, and express their affective states in different ways. With this in mind, an increasing amount of research on automatic affect recognition from speech and other modalities has recently been presented to assist and provide support, as well as to improve understanding of autistic individuals’ behaviours. As well as the emotion expressed from the voice, for autistic children the dynamics of verbal speech can be inconsistent and vary greatly amongst individuals. The current contribution outlines a voice activity detection (VAD) system specifically adapted to autistic children’s vocalisations. The presented VAD system is a recurrent neural network with long short-term memory cells. It is trained on 130 acoustic Low-Level Descriptors (LLDs) extracted from more than 17 hours of audio recordings, which were richly annotated by experts in terms of perceived emotion as well as occurrence and type of vocalisations. The data consist of 25 English-speaking autistic children undertaking a human-robot interaction scenario and was collected as part of the DE- ENIGMA project. The VAD system is further utilised as a preprocessing step for a continuous speech emotion recognition (SER) task aiming to minimise the effects of potential confounding information such as noise, silence or non-child vocalisation. Its impact on the SER performance is compared to the impact of other VAD systems, including a general VAD system trained from the same data set, an out-of-the-box WebRTC VAD system, as well as the expert annotations. Our experiments show that the child VAD system achieves a lower performance than our general VAD system, trained under identical conditions, as we obtain ROC-AUC metrics of 0.662 and 0.850, respectively. The SER results show varying performances across valence and arousal depending on the utilised VAD system with a maximum CCC of 0.263 and a minimum RMSE of 0.107. Although the performance of the SER models is generally low, the child VAD system can lead to slightly improved results compared to other VAD systems and in particular the VAD-less baseline, supporting the hypothesised importance of child VAD systems in the discussed context.