1 Introduction
The Open Access Series of Imaging Studies (OASIS) provides one of the most widely used neuroimaging datasets for dementia and Alzheimer's disease research. The first release, OASIS-1, introduced cross-sectional T1-weighted MRI scans and clinical assessments from young, middle-aged, and older adults, both with and without dementia (Marcus et al., 2007). This dataset established a benchmark for studying brain morphology and cognitive decline across the adult lifespan. Subsequently, OASIS-2 expanded upon this framework by including longitudinal data from older adults with repeated imaging sessions, allowing researchers to examine disease progression over time (Marcus et al., 2010). The distinguishing factor between the two datasets is the presence of the “group” attribute in OASIS-2 dataset. This attribute clearly identifies each patient as dementia, non-dementia or converted, giving a key insight into the patients disease status. Together, these datasets have served as foundational resources for machine learning (ML) and deep learning studies in dementia prediction.
Over the past decade, researchers have applied a variety of computational models to OASIS data to improve diagnostic accuracy and early detection of dementia. Baglat et al. (2020) and Shrivastava et al. (2023) demonstrated that multiple ML classifiers, including Support Vector Machines, Random Forests, and Logistic Regression, can effectively distinguish people with dementia from people without dementia using OASIS-derived clinical and imaging features. Similarly, Basheer et al. (2021) employed deep neural network architectures, such as convolutional and capsule networks, achieving high predictive accuracy for Alzheimer's disease classification. Aslan and Özüpak (2025) compared a range of ML algorithms for automated Alzheimer's disease prediction, highlighting that gradient boosting and ensemble models outperform traditional classifiers in feature-rich neuroimaging datasets. Consistent with this, Shaik et al. (2025) used the OASIS dataset to predict dementia stages via deep learning, emphasizing the relevance of brain volumetric and cognitive test features in classification. Earlier, Lee and Abdullah (2019) had already demonstrated that combining neuroimaging with cognitive metrics can substantially improve predictive performance.
Despite these studies, most existing studies focus on improving accuracy or model complexity, while less attention is given to bias propagation, particularly along demographic lines such as sex. Sex-based differences in brain morphology, cognitive aging, and dementia prevalence are well established, yet many ML models trained on neuroimaging data inherit or even amplify these biases. Such biases can arise from unequal representation, data preprocessing, or label inference procedures, issues that are rarely documented or quantitatively evaluated. The OASIS dataset is primarily an imbalanced dataset with respect to sex. Both OASIS-1 and 2 contain higher number of female subjects compared to males, inducing a class-imbalance in the trained AI models.
This study constructs a unified, cross-sectional, and sex-balanced dataset by integrating the OASIS-1 and OASIS-2 studies. The longitudinal OASIS-2 data were reformatted into cross-sectional-snapshot data and used to train a predictive model that generated dementia labels for OASIS-1, after which both datasets were merged. A K-Nearest Neighbors-based Synthetic Minority Oversampling Technique (KNN-SMOTE) method was applied to correct sex imbalance, and statistical evaluation confirmed that the combined dataset retained the original sex-related patterns while expanding sample coverage for bias-aware dementia research.
This research, therefore, contributes a novel harmonized dataset OASIS-SB for dementia research and also an empirical demonstration of how preprocessing and augmentation pipelines can retain real-world sex-linked patterns despite class balancing. The dataset serves as a transparent and reproducible foundation for future studies exploring fairness, bias quantification, and equitable modeling in computational neuroscience. Importantly, OASIS-SB is derived exclusively from phenotypic and volumetric measurements extracted from OASIS and does not include MRI images; rather, it represents a statistically validated synthetic augmentation intended for cross-sectional analysis.
2 Methods
2.1 Dataset description
Two complementary subsets from the Open Access Series of Imaging Studies (OASIS) were used in this work: OASIS-1 and OASIS-2. The first dataset (Marcus et al., 2007) consists of cross-sectional magnetic resonance imaging (MRI) and clinical measures for adults spanning a wide age range, while the second (Marcus et al., 2010) contains repeated MRI sessions for older participants, enabling longitudinal observation of cognitive decline.
OASIS-2 includes explicit dementia ratings for each visit, whereas OASIS-1 does not. This structural difference restricts their combined use for statistical analysis. This study uses the OASIS-2, serving as a labeled reference for predictive model training and the OASIS-1 providing additional unlabeled samples for inference. The integration of the two datasets provides us with a significantly larger dataset with statistics similar to the real-world data. Across both datasets, key clinical and anatomical variables were retained: age, sex, education (EDUC), socioeconomic status (SES), Mini-Mental State Examination (MMSE), Clinical Dementia Rating (CDR), estimated total intracranial volume (eTIV), normalized whole-brain volume (nWBV), and atlas scaling factor (ASF).
2.2 Data preprocessing and target definition
To maintain consistency across both datasets, variable names and formats were standardized before merging. Continuous variables were z-scored, and categorical attributes such as sex were numerically encoded (female = 1, male = 0).
The longitudinal structure of OASIS-2 was flattened by treating each visit as an independent observation. Session identifiers were retained to allow optional subject-level aggregation. This transformation enabled compatibility with the single-timepoint structure of OASIS-1 and ensured that each row corresponded to one MRI-clinical snapshot.
Missing entries in SES continuous-attribute in the OASIS-2 dataset were imputed using a Random Forest Regressor trained on complete cases within OASIS-2. The model was optimized via five-fold cross-validation to minimize the mean absolute error (MAE) across numerical variables, achieving ROC-AUC = 0.9263. This approach preserved multivariate dependencies more effectively than mean substitution or deletion.
The clinical dementia rating (CDR) served as the ground-truth label for supervised learning. To construct a binary target suitable for classification, subjects with CDR = 0 were assigned to the class of people without dementia, and those with CDR ≠0 to the class of people with dementia. Participants labeled as Converted, those who transitioned from 0 to having dementia (CDR ≤ 0.2) between visits were excluded to avoid ambiguity in disease state. The resulting distribution produced a clean two-class binary training target.
Final predictors included demographic, cognitive, and volumetric measures [Age, Sex, EDUC, SES, MMSE, CDR, eTIV, nWBV, ASF], each normalized to unit scale prior to model fitting. This standardized preprocessing pipeline yielded a high-quality, analysis-ready dataset from OASIS-2 that could be used to train predictive models and infer dementia labels in OASIS-1.
2.3 Model training and label prediction
Multiple machine learning models like Random Forest, Regression models and Boosting models were tested to identify the best fit model for the OASIS-1 target prediction. After the initial study and previous studies from published mental health prediction studies (Dhariwal et al., 2024), an Extreme Gradient Boosting (XGBoost) classifier was trained on the preprocessed OASIS-2 dataset to predict dementia status. The model is a tree-ensemble-based framework optimized for binary logistic loss, with a learning rate of η = 0.03, a maximum depth of 3, and both subsample and column-sampling ratios set to 0.9. The XGBoost model was put through hyperparameter optimization, conducted using a grid search and five-fold stratified cross-validation. The data were divided into training (80%) and testing (20%) partitions, stratified by dementia label, and model performance was assessed using accuracy, recall, F1-score, and AUC.
Using five-fold cross-validation on the training data, the XGBoost classifier achieved a mean accuracy of 97.59% ( ± 1.55%), an F1-score of 97.31% ( ± 1.70%), and a recall of 97.02% ( ± 1.86%). When evaluated on a held-out 20% test set, the final model obtained an accuracy of 97.33%, a recall of 94.12%, an F1-score of 96.97%, and a ROC-AUC of 0.9706. Additional calibration and discrimination metrics on the test set included a PR-AUC of 0.9794.
As shown in Figure 1, the model exhibited stable convergence with minimal overfitting across training iterations. These results duly substantiate the training of the model and proves the model is a good-fit, without over or under-fitting. Feature importance scores extracted from the fitted model indicated that CDR, SES and EDUC were the top three strongest predictors of dementia classification.
Figure 1. Training and validation log loss across boosting iterations showing stable model convergence.
The optimized hyperparameter-tuned XGBoost model was then used to predict the “group” attribute in the unlabeled OASIS-1 dataset to infer dementia status for each participant. Since, the OASIS-2 data was converted to snapshot-based cross-sectional data, the use of the trained XGBoost model to predict OASIS-1 data statistically robust. For every subject, the classifier predicted a binary prediction (y = 0 or 1). These inferred labels were then appended to OASIS-1 dataset, generating a complete diagnostic column “group” consistent with the OASIS-2 real-world correlations. Finally, both datasets were concatenated to form an integrated cross-sectional dataset. This dataset had 558 patient records, with a stark class-imbalance between male and female patients, having 163 more female subjects than male. This unified dataset served as the foundation for subsequent class-balancing and sex-bias evaluation procedures.
2.4 Class balancing by sex
A KNN-based SMOTE algorithm was applied to synthetically augment the phenotypic feature space for class balancing. The core idea at this step was to up-sample the male subjects without down-sampling the female samples and losing valuable insights. The new up-sampling was done with statistically substantiated methods like KNN-SMOTE. This algorithm generates synthetic instances for the minority class by interpolating between each male sample and its k = 5 nearest male neighbors in feature space, producing statistically real-world-based non-duplicated records. All numeric features were standardized prior to oversampling to ensure that distance computations remained scale-independent. The resampling process was restricted to the male subgroup, thereby achieving a 1:1 female-to-male ratio. Following resampling, basic statistical diagnostics confirmed that the means and variances of key variables (Age, MMSE, eTIV, and nWBV) remained within ± 2% of their pre-KNN-SMOTE values, indicating preservation of the dataset's core statistical structure. The resulting sex-balanced dataset was then used for subsequent correlation analyses to evaluate whether equal representation altered the intrinsic relationships between sex and other neuroanatomical or cognitive features.
3 Data analysis
The KNN-SMOTE balanced dataset, referred to as OASIS-SB dataset (with 712 records) was then subjected to Pearson's correlation test to evaluate whether the sex-balanced dataset follows the original statistical structure of OASIS-2. The Pearson's correlation coefficient (r) was used to quantify the linear association between each attribute of the dataset with the “sex” attribute and significance was assessed using two-tailed p-values. The correlation was specifically calculated with the “sex” column as the KNN-SMOTE sampling was applied to the sex column, therefore, the correlation analysis substantiates that the over-sampling was statistically correct. Table 1 summarizes the correlation coefficients before and after application of the KNN-SMOTE procedure.
Both the OASIS-2 and the OASIS-SB datasets show that the direction and magnitude of sex-feature correlations remained nearly identical with strong p-values. This strongly indicates that the KNN-SMOTE oversampling process retained the core underlying sex-related feature relationships. Thus, the new data with sex-balanced class (equal male and female subjects), shows strong similarity to real-world data. The strong negative associations remained consistent between sex and total intracranial volume (eTIV; r = −0.57) and between sex and atlas scaling factor (ASF; r = 0.56), uniform with known anatomical differences between male and female brain volumes. Cognitive measures such as MMSE and CDR exhibited weak but statistically significant correlations with sex in both datasets, while socioeconomic (SES) and age variables showed no significant association.
The preservation of correlation strength and significance across all key variables demonstrates that the class-balancing procedure successfully equalized representation without altering intrinsic dataset structure. This outcome confirms that the resulting OASIS-1+2 dataset maintains the original sex-linked statistical characteristics of OASIS-2, thereby retaining the natural bias present in real-world neuroimaging data while providing a larger and more balanced sample for downstream analyses.
4 Conclusion
This study presents a unified, cross-sectional, and sex-balanced synthetic dataset OASIS-SB derived from the OASIS-1 and OASIS-2 studies, enabling large-scale analysis of dementia prediction while maintaining transparency in bias propagation. By converting the longitudinal OASIS-2 data to a cross-sectional format, imputing missing values, and employing an XGBoost classifier to infer dementia status for OASIS-1 participants, we generated a comprehensive dataset with consistent diagnostic labeling. Subsequent KNN-SMOTE balancing achieved equal representation of male and female participants without distorting the underlying feature distributions.
Correlation analyses demonstrated that the relationships between sex and key anatomical or cognitive variables, particularly eTIV, ASF, nWBV, and CDR, remained statistically consistent with those observed in the original OASIS-2 dataset. This indicates that while the dataset was successfully balanced, the intrinsic sex-linked structure was preserved, providing a realistic foundation for fairness-aware modeling.
The resulting OASIS-SB dataset expands available cross-sectional data for dementia research and offers a transparent benchmark for studying sex bias in neuroimaging-based prediction models. Future work will extend this approach to multi-cohort harmonization and evaluate bias mitigation strategies using fairness-aware learning frameworks.
Data availability statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary material.
Ethics statement
The data analyzed in this study were obtained from the Open Access Series of Imaging Studies (OASIS-1 and 2) dataset, which comprises human neuroimaging and clinical data collected by the Washington University Alzheimer's Disease Research Center. All participants provided written informed consent at the time of enrollment, and all study procedures were approved by the Institutional Review Board (IRB) of Washington University School of Medicine. The OASIS-2 dataset is publicly released in a fully de-identified form. The present study involved secondary analysis of anonymized data and therefore did not require additional ethical approval.
Author contributions
ND: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing.
Funding
The author(s) declared that financial support was not received for this work and/or its publication.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fncom.2025.1744217/full#supplementary-material
References
Aslan, E., and Özüpak, Y. (2025). Comparison of machine learning algorithms for automatic prediction of Alzheimer disease. J. Chin. Med. Assoc. 88, 98–107. doi: 10.1097/JCMA.0000000000001188
Baglat, P., Salehi, A. W., Gupta, A., and Gupta, G. (2020). “Multiple machine learning models for detection of Alzheimer's disease using oasis dataset,” in Re-Imagining Diffusion and Adoption of Information Technology and Systems: A Continuing Conversation, Volume 617 of IFIP Advances in Information and Communication Technology, eds. S. K. Sharma, Y. K. Dwivedi, B. Metri, and N. P. Rana (Cham: Springer), 614–622. doi: 10.1007/978-3-030-64849-7_54
Basheer, S., Bhatia, S., and Sakri, S. B. (2021). Computational modeling of dementia prediction using deep neural network: analysis on oasis dataset. IEEE Access 9, 42449–42462. doi: 10.1109/ACCESS.2021.3066213
Dhariwal, N., Sengupta, N., Madiajagan, M., Patro, K. K., Kumari, P. L., Abdel Samee, N., et al. (2024). A pilot study on AI-driven approaches for classification of mental health disorders. Front. Hum. Neurosci. 18:1376338. doi: 10.3389/fnhum.2024.1376338
Lee, K. L., and Abdullah, A. A. (2019). Machine learning approach for Alzheimer's disease detection using MRI data. J. Phys. Conf. Ser. 1372:012065. doi: 10.1088/1742-6596/1372/1/012065
Marcus, D. S., Fotenos, A. F., Csernansky, J. G., Morris, J. C., and Buckner, R. L. (2010). Open access series of imaging studies: longitudinal MRI data in nondemented and demented older adults. J. Cogn. Neurosci. 22, 2677–2684. doi: 10.1162/jocn.2009.21407
Marcus, D. S., Wang, T. H., Parker, J., Csernansky, J. G., Morris, J. C., Buckner, R. L., et al. (2007). Open access series of imaging studies (OASIS): cross-sectional MRI data in young, middle aged, nondemented, and demented older adults. J. Cogn. Neurosci. 19, 1498–1507. doi: 10.1162/jocn.2007.19.9.1498
Shaik, N. J., Mohammad, R., Mylavarapu, H., Modukuru, H., and Mallireddy, R. R. (2025). “Predicting the stages of dementia using the oasis dataset,” in 2025 12th International Conference on Emerging Trends in Engineering & Technology- Signal and Information Processing (ICETET-SIP) (Nagpur), 1–6. doi: 10.1109/ICETETSIP64213.2025.11156215
Shrivastava, R. K., Singh, S. P., and Kaur, G. (2023). “Machine learning models for Alzheimer's disease detection using oasis data,” in Data Analysis for Neurodegenerative Disorders, Cognitive Technologies, eds. D. Koundal, D. K. Jain, Y. Guo, A. S. Ashour, and A. Zaguia (Singapore: Springer). doi: 10.1007/978-981-99-2154-6_6
Keywords: computational neuroscience, dementia prediction, fairness in AI, neuroimaging, OASIS dataset, sex bias, SMOTE, XGBoost
Citation: Dhariwal N (2026) OASIS-SB: a sex-balanced, distribution-preserving, synthetic phenotypic dataset for bias-resilient clinical prediction. Front. Comput. Neurosci. 19:1744217. doi: 10.3389/fncom.2025.1744217
Received: 11 November 2025; Revised: 20 December 2025;
Accepted: 23 December 2025; Published: 16 January 2026.
Edited by:
Natasha Clarke, Université de Montréal, CanadaReviewed by:
Hao-Ting Wang, Université de Montréal, CanadaCopyright © 2026 Dhariwal. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Naman Dhariwal, bmFtYW5kQHVtaWNoLmVkdQ==