AUTHOR=Hassani Hossein , Entezarian Mohammad Reza , Zaeimzadeh Sara , Marvian Leila , Komendantova Nadejda TITLE=An oversampling-undersampling strategy for large-scale data linkage JOURNAL=Frontiers in Big Data VOLUME=Volume 8 - 2025 YEAR=2025 URL=https://www.frontiersin.org/journals/big-data/articles/10.3389/fdata.2025.1542483 DOI=10.3389/fdata.2025.1542483 ISSN=2624-909X ABSTRACT=Effective record linkage in big data, particularly in imbalanced datasets, is a critical yet highly challenging task due to the inherent complexity involved. This article utilizes an oversampling-undersampling strategy to address linkage imbalances, enabling more accurate and efficient record linkage within large-scale datasets. It tries to increase the instances of the minority class and decrease the dominance of the majority classes to try to reach a more balanced dataset that can be used for training and testing. Sensitivity testing was carried out by varying the training-test ratio and degree of imbalance.