AUTHOR=Coto-Solano Rolando , Stanford James N. , Reddy Sravana K. 

TITLE=Advances in Completely Automated Vowel Analysis for Sociophonetics: Using End-to-End Speech Recognition Systems With DARLA

JOURNAL=Frontiers in Artificial Intelligence

VOLUME=Volume 4 - 2021

YEAR=2021

URL=https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2021.662097

DOI=10.3389/frai.2021.662097

ISSN=2624-8212

ABSTRACT=In recent decades, computational approaches to sociophonetic vowel analysis have been steadily increasing, and sociolinguists now frequently use semi-automated systems for phonetic alignment and  vowel formant extraction, including FAVE (Forced Alignment and Vowel Extraction, Rosenfelder et al. 2011; Evanini et al. 2009), Penn Aligner (Yuan & Liberman 2008), and DARLA (Dartmouth Linguistic Automation, Reddy & Stanford 2015). Yet these systems still have a major bottleneck: manual transcription. For most modern sociolinguistic vowel alignment and formant extraction, researchers must first create manual transcriptions. This human step is painstaking, time-consuming, and resource intensive. If this manual step could be replaced with completely automated methods, sociolinguists could potentially tap into vast datasets that have previously been unexplored, including legacy recordings that are underutilized due to lack of transcriptions. 

How close are the current technological tools to achieving such groundbreaking changes for sociolinguistics? Prior work (Reddy & Stanford 2015) showed that an HMM-based Automated Speech Recognition (ASR) system, trained with CMU Sphinx (Lamere et al. 2003), was accurate enough for DARLA to uncover evidence of the US Southern Vowel Shift without any human transcriptions. Even so, because that automatic speech recognition (ASR) system relied on a small training set, it produced numerous transcription errors. Six years have passed since that study, and since that time numerous end-to-end ASR algorithms have shown considerable improvement in transcription quality. One example of such a system is the RNN-based DeepSpeech from Mozilla (Hannun et al. 2014). The present paper combines DeepSpeech with DARLA to push the technological envelope and determine how well contemporary ASR systems can perform in completely automated vowel analyses with sociolinguistic goals. When comparing a completely automated system against a semi-automated system involving human manual work, there will always be a tradeoff between accuracy on the one hand versus speed and replicability on the other hand (Kendall & Fruehwald 2014). Nonetheless, our study shows that, for certain large-scale applications and research goals, a completely automated approach using publicly available ASR can produce meaningful sociolinguistic results across large datasets, and these results can be generated quickly, efficiently, and with full replicability.