Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Astron. Space Sci.

Sec. Astrostatistics

This article is part of the Research TopicAstrostatistics and AI: Assessing Deep Learning’s Role and Ethics in AstrophysicsView all 4 articles

Habitable Exoplanet - A Statistical Search For Life

Provisionally accepted
  • University of Calcutta, Kolkata, India

The final, formatted version of the article will be published soon.

The identification of habitable exoplanets is an important challenge in modern space science, requiring the combination of planetary and stellar parameters to assess conditions that support life. Using a dataset of 5867 exoplanets from the NASA Exoplanet Archive (as of April 3, 2025), we have applied Random Forest and eXtreme Gradient Boosting (XGBoost) to classify planets as habitable or non-habitable based on 32 continuous parameters, including orbital semi-major axis, planetary radius, mass, density, and stellar properties. Habitability is defined through physics-based criteria rooted in the presence of liquid water, stable climates, and Earth-like characteristics using seven key parameters: planetary radius, density, orbital eccentricity, mass, stellar effective temperature, luminosity, and orbital semi-major axis. To make the classification accurate, we deal with multicollinearity and we checked the Variance Inflation Factor (VIF). We selected parameters with VIF < 5: planetary orbital period, semi-major axis, density, eccentricity, inclination; stellar effective temperature, radius, mass, metallicity, age, density, and total proper motion. Although the defining parameters are used for labeling, only those with low VIF (orbital semi-major axis and eccentricity, planetary density, and stellar effective temperature) are retained for modeling, supple-mented by additional low-VIF parameters. Class imbalance is addressed using the Random Over-Sampling Examples (ROSE) technique with both over-and under-sampling to create a balanced dataset. The models achieve classification accuracies of 99.99% for Random Forest and 99.93% for eXtreme Gradient Boosting (XG-Boost) on the test set, with high sensitivity and specificity. We analyze the data distributions of the key defining parameters, revealing skewed distributions typical of exoplanet populations. Parameter uncertainties are incorporated through Monte Carlo perturbations to assess prediction stability, showing minimal impact on overall accuracy but possible biases in borderline cases. We consider the intersection of habitable exoplanets identified by the seven defining parameters and verify with the twelve low-VIF parameters, confirming consistent classification and making habitability assessments more reliable. Our findings highlight the potential of machine learning techniques to prioritize exoplanet targets for future observations, providing a fast and understandable approach for habitability assessment.

Keywords: Exoplanets, habitability, machine learning, random forest, XG-Boost, Habitable zone, Planetary science, Variance inflation factor

Received: 28 Jul 2025; Accepted: 12 Nov 2025.

Copyright: © 2025 Banerjee and Chattopadhyay. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Asis Kumar Chattopadhyay, akcstat@gmail.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.