AUTHOR=Zhang Shengwen , Zhang Yanxia , Liu Chao TITLE=Listening to stars: audio-inspired multimodal learning for star classification JOURNAL=Frontiers in Astronomy and Space Sciences VOLUME=Volume 12 - 2025 YEAR=2025 URL=https://www.frontiersin.org/journals/astronomy-and-space-sciences/articles/10.3389/fspas.2025.1659534 DOI=10.3389/fspas.2025.1659534 ISSN=2296-987X ABSTRACT=Stellar spectral classification plays a crucial role in understanding the intrinsic properties of stars, such as their temperature, composition, and luminosity. Current methods for star classification primarily rely on template fitting, color-magnitude cuts, or machine learning models that process raw 1D spectra or 2D spectral images. These approaches, however, are limited by two main factors: (i) degeneracies in spectral features that lead to confusion between adjacent spectral types, and (ii) an overreliance on flux-versus-wavelength representations, which may overlook complementary structural information. To address these limitations, we propose a novel multimodal framework for stellar spectral classification that combines 1D and 2D spectral data with audio-derived features. Motivated by the structural similarities between stellar spectra and audio signals, we introduce—for the first time—audio-inspired feature extraction techniques, including Mel spectrograms, MFCC, and LFCC, to capture frequency-domain patterns often ignored by conventional methods. Our framework employs an eight-layer CNN for processing spectral data, an EPSANet-50 for spectral images, and a three-layer CNN for audio-derived features. The outputs of these models are mapped to 256-dimensional vectors and fused via a fully connected layer, with attention mechanisms further enhancing the learning process. Experimental results demonstrate that while 1D spectral data with Coord Attention achieves an accuracy of 89.75±0.28%, the Mel spectrogram alone outperforms spectral data, reaching 90.23±0.36%. Combining 1D and 2D modalities yields 91.26±0.35%, and integrating audio features with spectra results in 89.09±0.43%. The fully multimodal approach delivers the best performance, achieving an overall accuracy of 91.79±0.11%. These findings underscore the effectiveness of incorporating audio-derived features, offering a fresh and promising approach to improving stellar spectral classification beyond existing methods.