Augmenting Sparse Spaceflight Mass Spectra Datasets for Machine Learning Applications

Da Poian, Victoria; Hörst, Sarah  M.; Lyness, Eric  I.; Li, Xiang; Danell, Ryan  M.; Brinckerhoff, William  B.; Theiling, Bethany  P.

doi:10.3389/fspas.2025.1706125

ORIGINAL RESEARCH article

Front. Astron. Space Sci.

Sec. Planetary Science

This article is part of the Research TopicMachine Learning Applications in the Search for Life Beyond EarthView all articles

Augmenting Sparse Spaceflight Mass Spectra Datasets for Machine Learning Applications

Provisionally accepted

Victoria Da Poian^1,2,3*

Sarah M. Hörst³

Eric I. Lyness^1,2

Xiang Li¹

Ryan M. Danell^1,4

William B. Brinckerhoff¹

Bethany P. Theiling¹

¹National Aeronautics and Space Administration (NASA), Washington D.C., United States
²Tyto Athene LLC, Herndon, United States
³Johns Hopkins University, Baltimore, United States
⁴Danell Consulting Inc, Winterville, United States

The final, formatted version of the article will be published soon.

Mass spectrometers are powerful instruments that aim to identify unknown compounds via their mass-to-charge ratio and perform quantitative and semi-quantitative analysis. These instruments have been essential to space missions over the past several decades (e.g., Pioneer Venus, Viking, Galileo, Cassini, Mars Science Laboratory) with several more en route (e.g., JUpiter ICy moons Explorer (JUICE), Europa Clipper) or under development (e.g., Rosalind Franklin, Dragonfly). However, future missions targeting remote planetary bodies increasingly face limited data transmission rates and volumes, which limit the amount of information that can be sent back to Earth. These challenges highlight the need for onboard science autonomy to optimize science return. Machine learning (ML) and data science tools can significantly contribute to the development of science autonomy by enabling rapid interpretation and prioritization of science data. Yet, these efforts for planetary science applications are hindered by the scarcity of representative datasets for training models, especially for complex flight instruments. In this work, we build on our earlier science autonomy work using the Mars Organic Molecule Analyzer (MOMA) instrument for the Rosalind Franklin (ExoMars) mission as a proof-of-concept (Da Poian et al., 2022). We investigate the generation of artificial mass spectra through "manual" augmentation techniques and evaluate their performance on mass spectrometer (MS) data using the laser desorption/ionization mass spectrometry (LDMS) mode of the flight-like MOMA Engineering Test Unit (ETU). We implement basic transformation-based augmentation methods such as peak intensity randomization, peak shifting (by limited and realistic m/z values), etc. We assess their scientific integrity in collaboration with instrument experts and investigate how the inclusion of generated data affects the performance of ML algorithms for mass spectral analysis. We compare the performance of supervised learning models on predicting the chemical categories of new input mass spectra, both with and without augmented data, to evaluate the impact of these techniques. Our work provides guidelines for developing realistic augmented mass spectra without compromising scientific validity, while also contributing to the development of a mature framework for ML tools in MS data analysis, advancing science autonomy for existing and future planetary missions.

Keywords: machine learning, Planetary science, Data augmentation, Mass Spectrometry, ExoMars, Sparse dataset, data science, Spaceflight instrumentation

Received: 15 Sep 2025; Accepted: 31 Oct 2025.

Copyright: © 2025 Da Poian, Hörst, Lyness, Li, Danell, Brinckerhoff and Theiling. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Victoria Da Poian

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.