AUTHOR=Hsieh Shu-Kai , Tseng Yu-Hsiang , Lian Da-Chen , Wang Chi-Wei TITLE=Self-supervised learning for Formosan speech representation and linguistic phylogeny JOURNAL=Frontiers in Language Sciences VOLUME=Volume 3 - 2024 YEAR=2024 URL=https://www.frontiersin.org/journals/language-sciences/articles/10.3389/flang.2024.1338684 DOI=10.3389/flang.2024.1338684 ISSN=2813-4605 ABSTRACT=Formosan languages, spoken by the indigenous peoples of Taiwan, have unique roles in the reconstruction of Proto-Austronesian Languages. This paper presents a real-world Formosan language speech dataset, including 144 hours of news footage for 16 Formosan languages. One merit of the dataset is the ability to look into the relationships among Formosan languages in vivo.With the help of deep learning models, such as wav2vec 2.0, we are able to analyze the speech data without transcriptions. Using 13 hours of validated speech as our dataset, we specifically first train a language classifier based on XLSR-53 to classify the 16 Formosan languages with an accuracy of 86%. Then, we extract the speech vector representations learned from the model and compare them with 152 manually coded linguistic typological features. The comparison suggests that the speech vectors reflect the phonological and morphological aspects of Formosan languages. In addition, these linguistic features are used to construct a linguistic phylogeny, and the resulting genealogical grouping corresponds with previous literature. We believe the dataset will open up possibilities to investigate the current real-world use of the Formosan languages and provide fertile grounds for future interdisciplinary collaboration.