- 1Department of Computer Science, Green International University, Lahore, Pakistan
- 2Department of Computer Science and IT, The University of Lahore, Lahore, Punjab, Pakistan
- 3College of Information Science and Technology, Hainan Normal University, Haikou, China
- 4Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Rabigh, Saudi Arabia
- 5OTEHM, Manchester Metropolitan University, Manchester, United Kingdom
- 6Department of Computer and Electrical Engineering, MidSweden University, Sundsvall, Sweden
Millions of people worldwide suffer from β-thalassemia, an inherited blood disorder that requires precise carrier screening to avoid serious health issues. Conventional centralized screening techniques rely on combining patient data, which raises privacy and legal issues under regulations like General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA). Although machine learning has increased the accuracy of diagnoses, its reliance on shared data raises issues with data security and makes it more difficult to develop collaborative models. Federated learning provides a solution by allowing multi-center collaboration and protecting privacy by training models locally at each clinical site and sharing only model parameters. In this work, a federated multi-kernel support vector machine (SVM) framework is developed, which aggregates updates through federated averaging and deploys linear, polynomial, radial basis function, and Deep kernel (DK) on client devices. We also incorporate explainable AI methods SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME) to decipher forecasts and pinpoint important hematological characteristics. Our federated model performs on par with centralized methods, achieving 98.4% accuracy, 99.2% sensitivity, and 98.8% specificity when tested on 5,066 complete blood count records. The most significant predictors, according to the explainability analyses, are hemoglobin level and mean corpuscular volume. Our results open the door for scalable, transparent, and compliant β-thalassemia screening across dispersed healthcare systems by demonstrating that federated multi-kernel SVMs with Explainable Artificial Intelligence (XAI) can provide high diagnostic performance while protecting patient privacy.
1 Introduction
Thalassemia is blood related disease, derived from two Greek words, i.e., “Thalassa” means sea and “Hema” means blood. In 1936, it was first identified by George Whipple (1). It is a widespread disease affecting most of the population globally, specifically Mediterranean, Africa, the Middle East, Asia, and Southeast Asia (2, 3). Annually, around 68,000 newborns are diagnosed with various types of Thalassemia, with approximately 80–90 million people being subjected to β-Thalassemia, which makes it 1.5% of the total world’s population. It has been revealed by the reports from the World Health Organization (WHO) that about 5.1% of the global population carries the β-Thalassemia gene. Moreover, recent research has depicted that over 350 mutations are linked to β-Thalassemia, out of which only 20 account for over 80% of these caused by geographical clustering that influences mutation distributions with migration patterns (4). Another prevalent disease that is affecting people worldwide is Anemia, closely associated with Thalassemia. Anemia is defined as hemoglobin levels below 12.0 g/dL in women and 13.0 g/dL in men, as defined by WHO, with factors, i.e., sex, ethnicity, and physiological status, necessitating more refined guidelines. Multiple causes for anemia require doctors to assess hematologic parameters, health conditions, and patient history for accurate diagnosis (5). Thalassemia results in reduced hemoglobin production that leads to severe anemia as mutations affect the hemoglobin structure, primarily alpha or beta globin chains, thereby disrupting their function and further complicating the health condition (6). Thalassemia is classified based on its severity (trait, carrier, intermediate, major) and affected hemoglobin components (alpha or beta). In addition, the structure of hemoglobin is composed of two alpha and two beta chains that are crucial for oxygen transportation. Thus, deficiency in either component may lead to alpha or beta Thalassemia, respectively (7). In Pakistan, approximately 9.8 million individuals are carriers of β-Thalassemia. Thus, early detection is necessary. The high prevalence of carriers is a key concern, as there is a 25% probability of a child inheriting the disease if both parents are carriers. Compounding this, many carriers are unaware of their status due to negligence. However, timely screening could prevent severe health complications and fatalities (8). Diagnosing and distinguishing Thalassemia from other Hemoglobin (Hb) diseases is currently challenging because clinical data, including blood smear images can be unreliable. Moreover, the traditional screening techniques, for example, Deoxyribonucleic acid (DNA) testing, high-performance liquid chromatography (HPLC), hemoglobin electrophoresis, and Polymerase Chain Reaction (PCR) mutation analysis, are proven to be very effective but costly that requiring specialized equipment, making accessibility difficult in an area with limited resources (9, 10).
However, these challenges are overcome by the advancements in machine learning (ML) in various domains, including medical diagnostics, autonomous systems, and language processing. ML operates on self-improving algorithms that are capable of analyzing large datasets, recognizing complex patterns, and optimizing predictive accuracy. These algorithms model refined data distributions using statistical and probabilistic approaches, making them effective in the medical sector. It has been found that ML-based diagnostic approaches have significant applications in the medical field, such as diagnosing brain tumors, lung disorders, and iron deficiency anemia (11–14). Several AI based models have been implemented for the improvement of disease detection, including K-nearest neighbors (KNN) (15–17), deep neural networks (DNN) (18, 19), and support vector machines (SVM) (20, 21). Moreover, to combat the widespread prevalence of thalassemia, it is important to adopt advanced diagnostic systems instead of relying on conventional screening. The necessity for low-cost, accessible, and efficient screening solutions becomes challenging, especially in underprivileged regions. This study aims to establish a cost-effective and accurate method for detecting β-Thalassemia carriers using machine learning techniques on red blood cell indices obtained from complete blood count (CBC) tests. The study leverages a dataset of 5,066 patient blood tests and Federated Learning to improve model performance while keeping data decentralized. In parallel, tightening federal (HIPAA) and emerging state-level health privacy statutes (e.g., New York and California) are driving the integration of privacy-preserving and AI-based monitoring mechanisms to ensure regulatory compliance in medical AI systems (22).
This study introduces Federated Learning for β-thalassemia carrier detection to enable collaborative model training on distributed CBC datasets while avoiding the centralization of raw patient data, thereby mitigating HIPAA/GDPR-related privacy and security risks. By training multi-kernel SVMs across heterogeneous clinical sites and aggregating models rather than data, the framework aims to achieve high diagnostic accuracy, robust generalization, and regulatory-compliant, privacy-preserving large-scale screening of β-thalassemia carriers.
The proposed study integrates Federated Learning with machine learning algorithms, including Support Vector Machine (SVM) with multiple kernel variations, to develop a highly accurate β-Thalassemia screening model. This research introduces a novel method that ensures high accuracy as compared to previous approaches that relied on preprocessing techniques. This model is constructed to classify individuals as β-Thalassemia carriers with greater precision by training three machine learning models. The proposed method is reliable and affordable that providing an effective solution for early detection and prevention of the disease. Moreover, this paper is structured with Section 2 reviewing existing literature, Section 3 describing the method and material, Section 4 presenting the proposed framework, Section 5 outlining the results and discussion, and Section 6 concluding the article. The focus of this study is to provide an innovative and privacy-preserving framework for the detection of β-Thalassemia by utilizing Federated Learning to improve public health outcomes.
2 Related work
Detecting a thalassemia carrier is challenging as the symptoms are not very obvious, also medical expertise and detailed patient data are required for its identification, which makes it is difficult and at the same time expensive to assess the early detection. Thus, an automated system for predicting thalassemia carriers is crucial to prevent this disease from spreading to future generations in future. The effectiveness of machine learning models has various applications involving a significant importance of these models in carrier prediction for thalassemia patients (20). In one of the recent studies conducted in 2024 by Uçucu and Azik, a machine learning model to differentiate between iron deficiency anemia (IDA) and β-thalassemia minor (BTM) using only complete blood count (CBC) data was introduced. It was found that artificial neural networks (ANNs) outperformed by offering a rapid and cost-effective alternative to conventional methods. Thus, by the incorporation of AI, healthcare providers can more efficiently and accurately distinguish between IDA and BTM, which leads them to an improvement in patient care (23). Recent advancements in thalassemia treatments are transforming patient care. For example, in North America, these new techniques are utilized for measuring tissue iron and innovative therapies for iron removal. Life expectancy has been increased by these medical advancements, but with these survival rates, new challenges are rising, including endocrinopathies and hepatitis, significantly impacting quality of life (24). In light of these challenges, machine learning and federated learning offer substantial potential for enhancing the diagnostic process of thalassemia. Federated learning, in particular, is a promising approach, allowing multi-center collaborations while ensuring patient data security by training models locally and sharing only model parameters. This method addresses concerns related to centralized data aggregation, which often raise issues regarding data privacy under regulations like GDPR and HIPAA. Recent studies have demonstrated the effectiveness of machine learning techniques, such as multi-kernel support vector machines (SVMs), in diagnosing β-thalassemia, incorporating explainable AI (XAI) methods like SHAP and LIME to provide transparency and highlight key diagnostic features such as hemoglobin levels and mean corpuscular volume (25–27). Recent studies have demonstrated the effectiveness of machine learning techniques, such as multi-kernel support vector machines (SVMs), in diagnosing β-thalassemia, incorporating explainable AI (XAI) methods like SHAP and LIME to provide transparency and highlight key diagnostic features such as hemoglobin levels and mean corpuscular volume (25–27).
Incorporating federated learning with SVMs can significantly improve diagnostic performance across decentralized healthcare systems without compromising data privacy. For instance, federated multi-kernel SVM models have shown high diagnostic accuracy and sensitivity when tested on large datasets, making them ideal for large-scale β-thalassemia screening programs (28–30). Moreover, research into gene therapies and disease prediction models underscores the potential of combining AI with medical treatments, opening a path for more personalized care (31, 32). Furthermore, the integration of AI into medical diagnostics aligns with global regulatory frameworks, ensuring compliance with privacy standards while enhancing diagnostic capabilities (33, 34). The ongoing development of AI models, including advancements in protein localization prediction and gene-disease relationships, is expected to further enhance the effectiveness of AI-based diagnostics for β-thalassemia and other genetic disorders (35–38).
The study entitled “Identifying β-thalassemia carriers using a data mining approach: The case of the Gaza Strip, Palestine” emphasized the significance for the identification of identifying β-thalassemia carriers as a critical factor in effective genetic counseling and disease management, particularly in highly prevalent areas. The research employed data mining techniques and machine learning algorithms to analyze a dataset that contains genetic and hematological information of the Gaza Strip individuals. The results indicated that carriers can be effectively identified with these techniques to improve early detection. The application of data mining in a healthcare context in this paper introduces a novel approach that can enhance the screening programs in high-risk areas. However, this study is limited in generalizability because of the dataset’s demographic and geographic specificity (15). In 2023, Tran et al. explored the prevalence of thalassemia in Vietnam and developed a clinical decision support system (CDSS) for prenatal screening involving the analysis of over 10,000 medical records of first-time pregnant women and their partners. Four AI-based CDSSs, training machine learning models on 1,992 cases, were designed and evaluated with 1,555 cases. These AI-based systems demonstrated high accuracy in identifying thalassemia carriers, thereby suggesting AI for enhancing prenatal screening processes (39). Similarly, a study by Das et al. presented a valuable decision support scheme for beta thalassemia and HbE carriers that was effective, but while addressing data privacy, ethical implications are necessary for protecting sensitive genetic information (40). In Ferih et al. (41), researchers investigated AI-driven enhancements in thalassemia diagnosis with a review of machine learning and deep learning techniques used on clinical datasets. The results proved the effectiveness of AI in enhancing diagnostic accuracy and personalized treatment. This research’s limitations include dataset variability and broader validation for diverse populations (41).
The application of artificial intelligence and machine learning techniques has improved the identification of thalassemia gene carriers, specifically in non-anemic populations, which are often overlooked by conventional screening methods. Machine learning models are more trained on blood sample datasets, thereby ensuring higher accuracy in detecting carriers in comparison to traditional approaches. These models produce efficient results by the analysis of erythrocyte morphology and hematological parameters, i.e., hemoglobin levels, mean corpuscular volume, and red blood cell count. Many studies underscore the potential of deep learning, i.e., deep neural networks, in the early detection of thalassemia with greater prediction accuracy. However, the size and diversity of datasets remain a challenging factor for greater scalability in this research. As the AI-driven methods offer greater screening efficiency and more benefits to public health, these models also require further validation with larger and diverse datasets. This issue has been addressed by recent approaches that overcome the challenges found in identifying thalassemia gene carriers for individuals with no anemia, as all the characteristics of red blood cells can be analyzed via these innovations. For instance, research applied a deep neural network on indices of red blood cells, including hemoglobin level, mean corpuscular volume, and cell count. The data was sourced from thalassemia patients and healthy controls, where AI-based model results in improved carrier detection accuracy. These models examined erythrocyte morphology with index data, thereby surpassing conventional methods and indicating a more efficient screening method. However, the success of this methodological framework has a limitation with limited sample diversity and potential algorithmic biases that can alter the results (42).
Another article with the title as “Role of Red Cell Indices in Screening for Beta Thalassemia Trait: An Assessment of the Individual Indices and Application of Machine Learning Algorithm” utilized a dataset with red cell indices having parameters, i.e., mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), and red blood cell count from beta thalassemia patients. In this research, machine learning algorithms are incorporated to improve screening accuracy. However, dataset details are not discussed in this article, which potentially affects the generalizability of results (43). Similarly, Bharath et al. (44) analyzed a dataset containing clinical and laboratory data from patients with alpha thalassemia and healthy individuals. The study demonstrated the effectiveness of machine learning classifiers to improve the screening accuracy. This research also utilized explainable AI techniques to interpret results with a greater understanding. Limitations still exist with details on dataset size and diversity (44). Another study classifies beta-thalassemia major and HbE/beta-thalassemia with the aid of deep learning techniques by applying them to image structure function images for accurate classification. The results demonstrated that the proposed study has achieved a higher accuracy, thereby showcasing the potential for deep learning in hematological diagnostics with improved precision and automated classification. The limitations include the need for extensive datasets across diverse populations (45).
Compliance measures are necessary for the Privacy concerns of AI in medical applications. Moore and Frye reviewed HIPAA’s historical background, patient health information (PHI), and privacy and security rules regarding healthcare data, emphasizing ongoing compliance challenges. The review depicted the critical role of HIPAA in protecting patient privacy and security (46). In the study (47), the authors analyzed the intersection of AI technologies and the HIPAA Privacy Rule, which focused on AI applications’ navigation in the health sector for patient privacy regulations. The article also discussed a few challenges while integrating AI into healthcare practices, along with guaranteed compliance with HIPAA standards. As digital dependence is rising, extensive data growth continues to increase with global privacy laws becoming imperative. For this purpose, Bakare et al. compared the EU’s General Data Protection Regulation (GDPR) with the U. S. privacy frameworks and analyzed all compliance challenges across jurisdictions (48). Moreover, ethical and regulatory concerns are also rising in AI integration for radiology. Pesapane et al. analyzed these challenges of AI as medical devices in the EU and U. S. frameworks and highlighted ethical considerations and regulatory landscapes that shape the AI deployment in healthcare (49). Similarly, Mennella et al. also explored the ethical and regulatory complexities of AI in healthcare and emphasized the frameworks that include AI adoption in clinical settings (50).
Ensemble learning reduces errors and manages the bias-variance trade-off to enhance robustness against noise and identify intricate data patterns by voting or weighted averaging the predictions of numerous base models. Because ensembles use different learners, they typically perform better than models in high-dimensional or class-imbalanced environments. By reducing errors and maintaining model strengths, these strategies transform weak learners into strong ones. In sensitive or remote settings, traditional ensemble systems require centralized access to all training data, which can present logistical and privacy challenges. Emerging federated and distributed ensemble frameworks balance performance gains with data confidentiality requirements by enabling collaborative training without sharing raw data, thereby overcoming these limitations. Recent developments in the classification of cancer from microarray data have demonstrated the effectiveness of hybrid feature selection techniques, which combine metaheuristic algorithms like Particle Swarm Optimization (PSO), Cuckoo Search, and Artificial Bee Colony (ABC) with filter techniques like minimum redundancy maximum relevance (mRMR). Through ensemble learning, Panigrati et al. showed how combining these techniques with classifiers such as Support Vector Machines (SVM), Random Forests, and Multilayer Perceptron (MLP) improves accuracy and lessens single-model bias. Sahu et al. built on this by using a two-stage mRMR–Binary Portia Spider Optimisation Algorithm (BPSOA) pipeline, which significantly reduced false-negative and misclassification rates while achieving peak accuracies of up to 99.79% across six cancer microarray datasets, with F1-scores exceeding 99% and Matthews Correlation Coefficients (MCCs) exceeding 96% (51, 52).
Sahoo et al., Despite targeted therapies, between 30 and 40 percent of patients with HER2-positive breast cancer experience metastases or relapse. They trained six base classifiers (SVM, logistic regression, decision tree, random forest, AdaBoost, and XGBoost) using TCGA data from 123 HER2-positive cases. They then combined the results using ensemble weighted averaging, soft voting, and hard voting. With the best performance, the weighted-averaging ensemble had 88.46% accuracy, 89.74% precision, 94.59% sensitivity, 73.33% specificity, 92.11% F-value, a Matthews correlation coefficient of 0.7107, and AUC = 0.903. This shows that combining H&E histopathology with clinical data can accurately predict metastasis and relapse, leading to better treatment decisions (53). Pand et al. describe that globally, the incidence of cancer is on the rise. Machine learning (ML) applied to high-dimensional microarray data can facilitate this screening, but without efficient feature selection, performance deteriorates. Therefore, we propose BIMSSA, a pipelined feature-selection framework that optimizes feature size using the Swarm Optimisation Algorithm (SSA) after applying Boruta and Improved Maximum Relevance Minimum Redundancy (IMRMR) to isolate relevant gene subsets. As base learners, five classifiers are used: SVM, Random Forest, Extreme Learning Machine, AdaBoost, and XGBoost. The top three are chosen by majority vote to create the final ensemble. Tested on four cancer microarray cohorts (ALL-AML, Lymphoma, MLL, and SRBCT), BIMSSA demonstrated a strong ability to predict a variety of cancer types and provided a potent diagnostic tool for researchers and clinicians with accuracies of 96.7, 96.2, 95.1, and 97.1%, respectively (54).
A framework to guarantee the confidentiality and security of patient datasets has emerged, which is privacy-preserving machine learning (PPML). Various advanced algorithms have been developed to address these concerns, while federated learning technique stands out as it enables model training across diversely distributed data sources without centralizing raw data. Differential privacy further has enhanced the security by controlling noise in datasets and preserving the privacy of all individuals. Other notable techniques that anonymize predictions through ensemble models include Private Aggregation of Teacher Ensembles (PATE). Moreover, Secure Multi-Party Computation (SMPC) allows collaborative computation without exposure to the data. De-identification processes are also important to strip datasets of personally identifiable information. In our study, federated learning is prioritized as it has greater scalability, ability to preserve data locality, and efficiency, which makes it suitable for healthcare applications.
Li et al. characterized federated learning as a decentralized machine learning that can facilitate collaborative model training across multiple entities without the exchange of data, as it is critical in privacy-sensitive sectors including healthcare, finance, and telecommunications (55). Furthermore, federated learning applications can enhance data security and improve model adaptability, thereby reducing operational costs, aligning with regulatory standards, i.e., GDPR. Both of the sources have emphasized their dual role to reduce privacy risks with diverse data resources, underscoring the significance of enabling secure and collaborative learning without compromising sensitive information (56). Shen et al. explored the transformation of distributed machine learning to federated learning and emphasized its role in addressing privacy concerns in data. It also described federated learning as a decentralized mechanism enabling multiple participants to collaboratively train machine learning models without sharing any raw data while proving its significance in healthcare, finance, and telecommunications. The research emphasized that federated learning not only alleviates privacy concerns by maintaining data locality but also enhances model generalization with diverse datasets. However, the risks for potential privacy, such as inferring sensitive information, persist with model updates (57).
The Contribution of the proposed work is the creation of a sophisticated, privacy-preserving federated learning framework specifically designed for the early identification of thalassemia carriers. The proposed model represents a significant advancement in medical diagnosis and privacy-focused AI applications, decentralized training across multiple data sources without compromising sensitive patient information.
The key contributions and achievements of this model as:
• Proposed federated learning architecture ensures the privacy of patients’ data is maintained accordingly to international laws like HIPAA/GDPR-compliant privacy. It allows several clinical sites to work together to train SVM models on local CBC data without exchanging raw patient records.
• Multi-Kernel SVM Optimization at the Client Side: We apply and contrast three SVM kernels (Linear, Polynomial, and RBF) on decentralized data, showing how each identifies unique hematological patterns and offers supplementary information.
• Proposed Global Model Aggregation attain 98.4% accuracy and strong generalization across diverse patient demographics by combining client-side kernel models into a global RBF-based classifier.
• Explainable AI Integration (SHAP and LIME): To promote openness and clinical trust, we use SHAP summary plots and LIME local explanations to clarify feature significance and individual predictions.
3 Methodology
This section discusses the details of datasets, preprocessing techniques, and machine learning utilized in developing a predictive model for the identification of β-thalassemia carriers to ensure a rigorous training and validation process.
3.1 Dataset description
The dataset employed in this research was gained from the Punjab Thalassemia Prevention Programme PTPP, which is a reliable and comprehensive source of thalassemia records (21, 58). PTPP conducts screening of over 300,000 patients annually for identifying thalassemia carriers and also provides medical care to thalassemia major patients. This project prioritizes screening in extended family members to eliminate any risk of carrier prevalence, thereby emphasizing the strategies for preventive measures. It also recommends evaluations when someone in the family is diagnosed with thalassemia to ensure early detection.
The attributes of the dataset used in this research are shown in Figure 1. It contains the data of 5,066 patients, with 3,051 classified as non-carriers and 2,015 identified as carriers of thalassemia in 2024. The categories of the data are based on gender and age, where 53% of the patients were male and 47% were female. Moreover, it shows that 54% were adults and 46% were children.
3.2 Features of the dataset
Combination of complete blood count (CBC) parameters is incorporated as the key features in this study where CBC parameters include red blood cell count (RBC), white blood cell count (WBC), hematocrit (HCT), red cell distribution width (RDW), mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration (MCHC), platelet count (PLT) and hemoglobin levels (HB) and further required details are also mentioned in Table 1. It also includes demographic factors, i.e., age and gender.
3.3 Data preprocessing
The data was preprocessed to remove any inconsistencies or errors, as in this way the data has a significant impact on results. To certify clean data, missing records and an unrated. The healthiness system of measurement was eradicated. Non-essential identifiers such as the name of the patients, interaction details, and test dates were also removed. The data are classifying patient ages obsessed by two groups: children (0–18 years) and grown persons (above 18 years), characterized by binary values 0 and 1, correspondingly, and biological genders were prearranged with female as 0 and male as 1. Further, the target class is also represented as 0 and 1; β-thalassemia carriers were categorized as 1 and non-carriers as 0. This preprocessing approach minimized the noise and enhanced data readiness, and well-looked-after analytical consistency, as long as a robust foundation for model training and assessment.
3.4 Exploratory data analysis
To better understand the relationship of different features with each other, various graphical techniques are used. These visuals help in identifying the trends, clusters, and unusual data points in the dataset. A correlation heatmap is utilized in the proposed research as depicted in Figure 1. For the analysis of features, Correlation Heatmap, the histograms are shown in Figure 2, and the boxplots are shown in Figure 3. These figures illustrate the identification of correlations between variables and detecting potential outliers, and offer a more comprehensive understanding of the dataset.
Figure 2 shows the overview of the feature values ‘distribution with data points spread across different ranges. Most of the observations accumulate within a central range and some of the observations extend toward the higher values to form a right-skewed distribution. Potential outliners are indicated by a few distant bars on the extreme right. Thus, for exploratory data analysis, histograms are considered as a fundamental component as they enhance the reliability and accuracy of the prepared ML models.
4 Proposed framework
This study introduces a framework for classifying β-thalassemia carriers (Jane classification) by integrating clinical biomarkers with a multi-kernel Support Vector Machine (SVM) architecture. The methodological workflow is structured into a systematic three-phase design to ensure accuracy and efficiency in classification, as shown in Figure 3.
Exploratory analysis involves data profiling and visualization to identify trends, outliers, and correlations among features.
Preprocessing: applies robust normalization and imputation techniques to address inconsistencies such as missing values and skewed distributions.
Model development: benchmarks four SVM kernels (linear, radial basis function, polynomial, and Deep Kernel) to capture both linear and nonlinear feature interactions, with hyperparameter optimization performed using grid search and 10-fold cross-validation to enhance model accuracy and performance.
To tackle scalability and privacy challenges in clinical data sharing, this framework adopts a federated learning approach, where locally trained SVM models (linear, poly, and RBF kernels) at healthcare facilities periodically merge insights into a global polynomial kernel model. This collaborative method reduces raw data exchange while improving diagnostic generalizability across diverse populations. Model performance is systematically assessed using accuracy, precision, recall, and F1 scores, and specificity, with SHAP values providing interpretability by identifying fundamental biomarkers. The incorporation of federated learning ensures compliance with GDPR while facilitating practical implementation in resource-constrained clinical environments.
4.1 Model development using SVM with multiple kernels
Support Vector Machines (SVMs) are perfect for β-thalassemia carrier detection because they maximize the margin between classes, which optimizes the decision boundary and encourages strong generalization even when there are a lot of features compared to the sample size, as in our case. This study also employed SVMs with three distinct kernel functions: Linear, Radial Basis Function (RBF), and Polynomial for recognizing specific data patterns. The efficiency of these kernels is assessed individually at the client side, which guarantees evaluation of their classification performance. Cross-validation is utilized to validate generalization capabilities, and performance metrics such as accuracy, F1-score, and confusion matrices are analyzed. Moreover, diverse analytical techniques, i.e., Receiver Operating Characteristic (ROC) curves, Principal Component Analysis (PCA) based decision boundary visualizations, and learning curve analyses, were applied for the enhancement in the model’s interpretability and comparison using three different kernels.
The line Equation can be expressed as
is the slope and s is the intercept. is the output given the input
Let and , and then Equation 1. Can be written as
Equation 2 is obtained from the vectors in two dimensions only, and this equation is also valid for higher dimensions.
The direction the vector is defined in Equation 3 as
Where is the magnitude of vector P and is written for the double vectors detailed in Equation 4 as
For n vectors, Equation 4, may also be rewritten as:
And in terms of angle, it will be written in Equation 5 as:
By putting the values of Equation 5 into Equation 5 and will be rewritten as:
Where , so , By applying the formula , we get.
. By applying this in Equation 5 we have
After cancelation the above equation will be written as
Let
Here u is the functional margin of the dataset and is defined as
The hyperplane with the largest u will then be further chosen that depicts the margin of the dataset geometrically. Once the primary objective of finding the optimal hyperplane has been achieved then it is easy to find out the corresponding values of r and s. The Lagrange function is expressed in Equation 6.
By expanding Equation 6 we get
So the Lagrange problem of SVM with L regularization is depicted in Equation 7, as
Where l is the number of training data points and while taking the derivative w and s the Lagrange in Equation 7 we get
So
Now we have to solve this optimization problem for the dual of the original problem. In the dual problem instead of minimizing over w, b subject to the constraints involves a’s. we can minimize over a (the dual variable) also subject to the same variables relations that are obtained previously for r as in Equation 8, and s. The minimization of the loss function over a is given in Equation 9, as
The Lagrange multiplier for the S. V. M can be further extended to the Kuhn-tucker method. And will be explained as
Here is the optimal point and the term shows the positive behavior. So, we have
Equation10, depicted the closest support vectors to the hyperplane
The value of r in terms of y is detailed in Equation 11, To compute for the value of y, we have
Multiply on both sides we get and after simplification the final form is given in Equation 12.
The hypothesis function represents the support vectors for the optimal hyperplane, is written in Equation 13 as
4.1.1 Client models
In this study, we evaluated three SVM kernel functions, Linear, Polynomial, and RBF, using a uniformly preprocessed, standardized, and class-balanced dataset.
4.1.1.1 Linear
The Linear kernel, valued for its simplicity and its effectiveness in separating data with a linear decision boundary (59), yielded steady performance with aggregated metrics (precision, recall, and F1 score) averaging around 0.957, and a mean cross-validation accuracy of approximately 0.948.
Linear kernel function, Equation 14 written as
Weight update using Linear kernel is detailed in Equation 15 as
4.1.1.2 Polynomial
The Polynomial kernel, designed to capture higher-order feature interactions in data that lack clear linear separability (60), achieved slightly improved outcomes, with overall metrics reaching near 0.985 and a mean CV accuracy of about 0.988.
Describe the Poly kernel function in Equation 16 written as.
Equation 17 specifies the weight update using a Poly kernel.
4.1.1.3 Radial basis function
Notably, the RBF kernel, renowned for mapping data into an infinite-dimensional space to uncover complex nonlinear patterns (61, 62) delivered the highest performance, with aggregated scores close to 0.989 and a mean cross-validation accuracy near 0.991. Table 2 encapsulates these fundamental results, clearly demonstrating the incremental benefits of each kernel.
Equation 18 defines the RBF kernel.
Equation 19 shows the weight update procedure utilizing the RBF kernel.
4.1.1.4 Deep kernel
DK techniques create adaptable similarity functions that are tailored for particular data tasks by combining kernel learning with deep neural network architectures. In order to extract representations before calculating similarity, learnable Deep Kernels map input features through hidden layers. A Deep Kernel can enhance classification in β-thalassemia carrier detection by automatically identifying the most important features and managing feature interactions more skillfully.
Equation 20 illustration the weight update procedure applying the DK kernel.
These results in Table 3 highlight not only the importance of choosing an appropriate kernel based on dataset characteristics but also underscore the critical impact of preprocessing steps such as standardization and class balancing in enhancing model robustness and generalization.
The algorithms below show the description of training the models on the client side. Client-side SVM training algorithm.
4.2 Federated learning framework: global models
In the final stage of our Experiment, the insights from the three SVM kernels are integrated within a federated learning framework, enabling multiple client models to train locally on their respective datasets while a global model consolidates their insights. In the proposed framework, the linear and RBF kernels function as client models, each specializing in capturing distinct data patterns. The linear kernel provides a clear and interpretable decision boundary, whereas the RBF kernel effectively models nonlinear relationships. Together, these models offer complementary perspectives on the classification task. The outputs and insights derived from these client models are subsequently aggregated into a global model, where the RBF kernel is employed, with its results presented in Table 4. This federated learning approach creates a strong and adaptable classification system by harnessing the strengths of individual client models and benefiting from the global model’s ability to aggregate insights. Server-side federated algorithm.
5 Results and discussion
Computer Hardware: Intel Xeon CPU @ 2.20 GHz (virtualized), NVIDIA Tesla T4 GPU with 16 GB HBM2 memory, and 12.7 GB of system RAM. Operating System: Ubuntu 18.04 LTS (via Google Colab’s managed backend). Software Frameworks: Python 3.8, TensorFlow 2.13.0 for model implementation and training, scikit-learn 1.2 for preprocessing and SVM kernels, Pandas 1.5, and NumPy 1.24 for data manipulation.
We determined a number of evaluation metrics, including Precision, Recall, F1 Score, Accuracy, and Specificity, in order to thoroughly evaluate the performance of our model. Particularly when there is a class imbalance or variable costs of misclassification, these metrics offer a comprehensive picture of the model’s efficacy. These metrics have the following definitions and formulas:
The percentage of positive identifications that were truly accurate is known as precision (positive predictive value).
Recall (also known as sensitivity or true positive rate): Shows the percentage of real positives that the model correctly detected.
F1 Score: A balance between precision and recall, calculated as the harmonic mean of the two metrics. When there is an unequal distribution of classes, it is especially helpful.
Accuracy: By calculating the percentage of true results both true positives and true negatives—among all the cases analyzed, accuracy indicates the model’s overall correctness.
Specificity: The percentage of true negatives that were accurately identified is known as specificity (or true negative rate). In situations where false positives are especially undesirable, it is essential.
We hope to ensure the dependability and efficacy of our model in real-world applications by using these metrics to provide an open and comprehensive assessment of its performance.
At the client side, the kernels of the RBF variant achieved the highest test accuracy of 0.99%, followed closely by the polynomial kernel at 0.98 and the linear kernel at 0.95. The federated ensemble SVM framework produced exceptional classification performance across all kernels. These findings were validated by cross-validation, which revealed strong generalization under 5-fold splits with mean accuracies of 0.991 for RBF, 0.988 for polynomial, 0.948 for linear kernels, and 0.961 for the DK. Accuracy was matched by precision, recall, and F1-scores, all of which exceeded 98.4% for nonlinear kernels and showed balanced sensitivity and specificity, important for medical diagnostics. By combining these local kernels, the global federated model obtained a consolidated accuracy of 98.4%, confirming that federated averaging maintains privacy without sacrificing predictive ability. Near-perfect class separation and little overlap between carrier and non-carrier distributions were highlighted by ROC AUC values of 1.00 for polynomial and RBF kernels and 0.98 for linear. The clinical reliability of the model in identifying β-thalassemia carriers from CBC indices was confirmed by confusion matrices, which showed less than 1% misclassification errors. All things considered, these findings demonstrate how well multi-kernel SVMs work in a federated environment, providing a high-accuracy, privacy-preserving option for extensive carrier screening. Table 5 shows the detailed results of client-side kernels.
5.1 Results of client-side kernels
The in-depth comparison of the classification performance and generalization ability of the three SVM and DK, i.e., Linear, Polynomial, RBF, and DK, is depicted by the Receiver Operating Characteristic (ROC) curve analysis of all kernels such as Figure 4 that the Linear kernel has an Area Under Curve (AUC) rate of 0.98 and DK achieved the 0.98, demonstrating its efficient classification. However, the other nonlinear kernels surpass it to indicate that it struggles with complex and high-dimensional feature relationship that leads to inefficiencies in minor classifications. The Polynomial and RBF Kernels achieved an AUC of 1.00 and indicated that no misclassification errors were observed, thereby showcasing their potential in modeling nonlinear relationships. This near-perfect classification makes them ideal for datasets with intricate class separations. Therefore, the results of these area under curve values show the potential of these nonlinear kernels with superior classification reliability, but they are limited to dataset complexity and the optimal generalizability of the data, as there is a need for balanced bias.
Figure 5 illustrates the projection of different kernel classification decision boundaries onto a two-dimensional space using principal component analysis (PCA). The filled circles (●) represent the training data used to train the model, and the crosses (×) represent the test data used specifically to test the model. The Support Vector Machines (SVMs) work by utilizing various kernel functions to shape decision boundaries that can influence model adaptability, performance, and generalization. A linear kernel forms a straightforward boundary to make it computationally efficient with linearly separable data, but it still struggles with more intricate patterns. The polynomial kernel introduces curvature that allows it to handle moderate non-linearity, but it is also limited to require careful selection of its degree for the prevention of overfitting. By learning task-specific feature representations and capturing intricate interactions, deep kernels enhance tabular data classification and are useful for β-thalassemia carrier detection. However, scalability on large datasets may be limited due to the high computational resources and careful tuning required.
Moreover, the Radial Basis Function (RBF) kernel maps data in a space with higher dimensions to create highly flexible boundaries for capturing complex relationships effectively. RBF is often the most accurate for complex datasets as it requires meticulous hyperparameter tuning and greater computational resources. A solid understanding of these kernels can help in choosing the best approach for optimal classification and model generalization.
The learning curve in Figure 6 shows that the x-axis represents the size of the training data, and the y-axis represents the classification performance metric. The solid blue line labeled “Training Score” shows the model’s performance on the training dataset, and the solid orange line labeled “Cross-Validation Score” represents the model’s average performance on unknown data and serves as a measure of its generalizability. A small difference between the two curves indicates good generalizability and low overfitting, while a large difference indicates high model variance. The Linear Kernel (a) shows noticeable fluctuations in both training and validation scores and suggests moderate variance as it is sensitive to data changes. Although its performance improves with more data, the inconsistency in validation results indicates that it struggles with capturing complex patterns. The Polynomial Kernel (b) and Deep Kernel (d) exhibits a steady improvement with training and validation scores that gradually align with large datasets. This attribute shows increased model stability and better classification accuracy. The RBF (c) Kernel stands out as the most effective in achieving the highest accuracy to display a smooth learning curve. The small gap between training and validation scores proves its generalizability with less overfitting.
Comparing in Figure 7, the Mean Cross-Validation Accuracy of different SVM and Deep kernels helps in understanding how they perform across various decision boundaries. The bar chart shows that all three kernels achieve strong accuracy; the RBF kernel consistently performs the best among all the kernels. As the higher accuracy often indicates better performance, it is also important to consider its limitations, i.e., the risk of overfitting with more complex kernels. Thus, choosing the right kernel requires balancing accuracy and generalization based on data characteristics.
Table 4 shows the results of the proposed client-side models. It is clearly shown that RBF achieved the maximum accuracy, precision, recall, F1 Score, and specificity as compared to the Linear and Poly kernels. Both the accuracy value of 0.984 and the specificity value of 0.989 show the model’s performance efficiency. This federated learning setup comes with several key benefits, i.e., it uses distributed learning that allows local models to focus on specific patterns for their data, which is especially useful for diverse datasets. Moreover, the RBF kernel acts as a global average that effectively combines decision boundaries from the linear and Poly kernels. The confusion matrix confirms that strong performance leads to better classification accuracy and generalization, as depicted in Figure 8. At last, this approach enhances both data privacy and scalability that enabling models to train on local data without sharing raw information with a central server.
5.2 Results of global model
Table 6 shows the low results proposed federated learning model with two different agnostic models, i.e., Artificial Neural Network (ANN), three different activation functions Linear, Poly, and RBF, along with Kernel Logistic Regression (KLR), has also been trained with the proposed model and the results are shown.
5.3 Explainable AI
SHAP analysis was applied to the baseline RBF-kernel SVM to quantify the contribution of each input variable to the predicted class probabilities. The resulting importance profiles identify hemoglobin, RBC count, MCV, RDI, and hematocrit as the dominant determinants of classification, with higher values increasing the likelihood of the positive target class and lower values favoring the alternative class. These methods offer reliable ways to explain AI decisions at both local and global levels with no modifications in the model (63–65). This study incorporates advanced explainable AI techniques for the improvement in the interpretability of the SVM model with an RBF kernel.
In Figure 9, the SHAP summary plot, features on the y-axis are ordered by their mean absolute SHAP value, while the x-axis encodes the SHAP value, with positive (right) and negative (left) contributions, respectively, increasing or decreasing the predicted probability of the target class; points are color-mapped from blue (low feature values) to red (high feature values), jointly conveying the magnitude and direction of each feature’s effect on the model output.
Figure 10, LIME was applied to the RBF-SVM to obtain instance-specific local surrogate models, revealing that red blood cell indices (MCV, RDI, Hb, RBC) consistently exert the strongest positive contributions to individual classifications, whereas PLT and WBC frequently attenuate the predicted probability, acting as negative or moderating factors. In the illustrated example, the explanation bar plot shows a high predicted probability for class 0 (0.94 vs. 0.06), with left-extending bars decreasing, and right-extending bars increasing, the likelihood of class 1 in proportion to their length. The feature importance ranking confirms the model prioritizes relevant hematological parameters, validating its alignment with medical expertise (44).
By rigorously optimizing our global model through techniques like RandomizedSearchCV and validating it on a comprehensive dataset of 5,066 patients, we achieved an outstanding accuracy of 98.4% (see Table 7). Additional performance metrics further highlight the model’s reliability. ROC curves with AUC values consistently exceeding 0.98, with detailed confusion matrices and error histograms, indicate a well-balanced sensitivity and specificity. Visual tools, i.e., PCA plots, depict distinct decision boundaries, and learning curves confirm stable convergence with minimal overfitting risk. Furthermore, combining outputs from multiple kernels through majority voting significantly improved the model’s ability to generalize across diverse datasets.
In Table 7. The performance of models used in (15, 20) in terms of accuracy is higher than our proposed model, but they did not undergo federated learning, and also did not apply interpretable artificial intelligence. In AlAgha et al. (15), the dataset used is obtained from the Gaza Strip, Palestine. Whereas the dataset used in Das et al. (20) is the antenatal dataset that is gathered from pregnant ladies. In a decentralized federated learning approach, there might be some loss due to the transfer of only the model parameters from the client side to the server node, and there will be no biased approach in federated learning, but in a traditional machine learning model, compromises a biasness. So, the novelty of the proposed work is the use of the SVM algorithm in federated learning for the prediction of β-Thalassemia Carrier, and different kernels of SVM have been used in the client side for model training.
6 Conclusion
In this work, we showed that a federated SVM and DK can accurately screen for β-thalassemia carriers using complete blood count (CBC) data. By combining linear, polynomial, and RBF and DK, our model captured complex patterns in red-cell indices (MCV, MCH, RBC count, RDW, etc.) and achieved a classification accuracy of 98.4%, on par with the best prior SVM-based formulas (which reported 98.4% Accuracy). Importantly, training was done in a federated manner, meaning that only model parameters were exchanged, protecting patient privacy. We also used explainable Using LIME and SHAP, feature importance analyses demonstrate that the model relies heavily on blood indices that are already established as differentiating carriers from non-carriers.
Further research should focus on increasing the model’s sensitivity with additional biomarkers, validating performance across diverse areas in the biomedical industry, exploring other federated learning types for robustness, and conducting clinical deployment studies to measure usability and trust. Federated learning technique, based on Visual data, will be developed to help medical practitioners review particular cases with the same latest prescription with the help of a global model generated by the FL.
While promising, the study has limitations that the dataset used in this study is region-specific and primarily represents the population of Pakistan; this may limit the model’s ability to be applied to larger, more diverse populations. The safe transfer of encrypted trained models from client devices to the central server presents difficulties when federated learning is put into practice. To increase the suggested model’s generalizability and robustness across a range of demographics, we plan to expand the dataset in subsequent research by adding multi-regional and multi-ethnic samples. To further ensure more secure and effective model updates between clients and the central server, we also intend to integrate sophisticated privacy-preserving techniques like differential privacy, secure aggregation, and homomorphic encryption into the federated learning framework.
Data availability statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.
Ethics statement
The studies involving humans were approved by the Green International University. The studies were conducted in accordance with the local legislation and institutional requirements. The human samples used in this study were acquired from previously published articles with the following DOI’s “10.1038/s41598-022-22011-8” and “10.1109/ACCESS.2021.3066782.” Written informed consent for participation was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and institutional requirements.
Author contributions
HY: Writing – review & editing, Writing – original draft, Project administration, Formal analysis, Supervision, Investigation, Methodology, Conceptualization, Software, Data curation, Validation, Visualization. BS: Supervision, Validation, Methodology, Investigation, Writing – review & editing, Visualization, Writing – original draft. AK: Visualization, Methodology, Investigation, Conceptualization, Software, Supervision, Writing – review & editing, Validation, Project administration. AB: Supervision, Project administration, Writing – review & editing, Investigation, Visualization, Resources, Software, Formal analysis. AA: Project administration, Writing – review & editing, Data curation, Investigation, Visualization, Resources. RS: Supervision, Writing – review & editing, Conceptualization, Funding acquisition, Formal analysis, Visualization, Methodology, Resources, Project administration, Validation, Investigation. SM: Writing – review & editing, Project administration, Funding acquisition, Resources.
Funding
The author(s) declared that financial support was not received for this work and/or its publication.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that Generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
1. Shah, S. Assessment of serum calcium and phosphorus levels among transfusion-dependent beta-thalassemia major patients on chelation therapy. J Postgrad Med Inst. (2015) 29:1–4.
2. Taher, AT, Weatherall, DJ, and Cappellini, MD. Thalassaemia. Lancet. (2018) 391:155–67. doi: 10.1016/S0140-6736(17)31822-6,
4. Rao, E, Chandraker, SK, Singh, MM, and Kumar, R. Global distribution of β-thalassemia mutations: an update. Gene. (2024) 896:148022. doi: 10.1016/j.gene.2023.148022,
5. Forget, BG, and Bunn, HF. Classification of the disorders of hemoglobin. Cold Spring Harb Perspect Med. (2013) 3:a011684. doi: 10.1101/cshperspect.a011684,
6. Nabi, AT, Muttu, J, Chhaparwal, A, Mukhopadhyay, A, Pattnaik, SJ, and Choudhary, P. Implications of β-thalassemia on oral health status in patients: a cross-sectional study. J Family Med Prim Care. (2022) 11:1174–8. doi: 10.4103/jfmpc.jfmpc_1215_21,
7. deDreuzy, E, Bhukhai, K, Leboulch, P, and Payen, E. Current and future alternative therapies for beta-thalassemia major. Biom J. (2016) 39:24–38. doi: 10.1016/j.bj.2015.10.001,
8. Waheed, U, Saba, N, Wazeer, A, and Ahmed, S. A systematic review and meta-analysis on the epidemiology of hepatitis B and hepatitis C virus among beta-thalassemia major patients in Pakistan. J Lab Physicians. (2021) 13:270–6. doi: 10.1055/s-0041-1731110,
9. Baird, DC, Batten, SH, and Sparks, SK. Alpha-and beta-thalassemia: rapid evidence review. Am Fam Physician. (2022) 105:272–80.
10. Chokchaipermpoonphol, P, Lamtanthong, S, and Nokkaew, S. Digital imaging of peripheral blood smear with MC-80 as a screening tool for thalassemia. J Clin Lab Anal. (2025) 39:e25135. doi: 10.1002/jcla.25135,
11. Muneer, S, and Rasool, MA. AA systematic review: explainable artificial intelligence (XAI) based disease prediction. Int J Adv Sci Comput. (2022) 1:1–6.
12. Mohammed, MQ, and Al-Tuwaijari, JM. A survey on diverse machine learning approaches for thalassemia detection and classification. Turkish J Comput Math Educ. (2021) 12:7866–71.
13. Amin, J, Sharif, M, Raza, M, Saba, T, and Anjum, MA. Brain tumor detection using statistical and machine learning method. Comput Methods Prog Biomed. (2019) 177:69–79. doi: 10.1016/j.cmpb.2019.05.015,
14. Palaniappan, R, Sundaraj, K, and Ahamed, NU. Machine learning in lung sound analysis: a systematic review. Biocybern Biomed Eng. (2013) 33:129–35. doi: 10.1016/j.bbe.2013.07.001
15. AlAgha, AS, Faris, H, Hammo, BH, and Ala’M, AZ. Identifying β-thalassemia carriers using a data mining approach: the case of the Gaza strip, Palestine. Artif Intell Med. (2018) 88:70–83. doi: 10.1016/j.artmed.2018.04.009
16. Patgiri, C, and Ganguly, A. Adaptive thresholding technique based classification of red blood cell and sickle cell using naïve bayes classifier and K-nearest neighbor classifier. Biomed Signal Process Control. (2021) 68:102745. doi: 10.1016/j.bspc.2021.102745
17. Paokanta, P, Ceccarelli, M, and Srichairatanakool, S. The efficiency of data types for classification performance of machine learning techniques for screening β-thalassemia In: 2010 3rd International Symposium on Applied Sciences in Biomedical and Communication Technologies (ISABEL 2010). (Rome, Italy: IEEE) (2010) 1–4.
18. Mo, D, Zheng, Q, Xiao, B, and Li, L. Predicting thalassemia using deep neural network based on red blood cell indices. Clin Chim Acta. (2023) 543:117329. doi: 10.1016/j.cca.2023.117329
19. Ismaeel, AG. Diagnose mutations causes Β-thalassemia: biomining method using an optimal neural learning algorithm. Int J Eng Technol. (2019) 8:1–8. doi: 10.14419/ijet.v8i1.11.28082
20. Das, R, Saleh, S, Nielsen, I, Kaviraj, A, Sharma, P, Dey, K, et al. Performance analysis of machine learning algorithms and screening formulae for β–thalassemia trait screening of Indian antenatal women. Int J Med Inform. (2022) 167:104866. doi: 10.1016/j.ijmedinf.2022.104866,
21. Sadiq, S, Khalid, MU, Ullah, S, Aslam, W, Mehmood, A, Choi, GS, et al. Classification of β-thalassemia carriers from red blood cell indices using ensemble classifier. IEEE Access. (2021) 9:45528–38. doi: 10.1109/access.2021.3066782
22. U.S. Department of Health and H. S. (HHS) Updates to HIPAA regulations for enhanced data security. U.S.: Health and Human Services (HHS) (2023).
23. Uçucu, S, and Azik, F. Artificial intelligence-driven diagnosis of β-thalassemia minor & iron deficiency anemia using machine learning models. J Med Biochem. (2024) 43:11–8. doi: 10.5937/jomb0-38779,
24. Cohen, AR, Galanello, R, Pennell, DJ, Cunningham, MJ, and Vichinsky, E. Thalassemia. Hematology. (2004) 2004:14–34. doi: 10.1182/asheducation-2004.1.14,
25. Tao, X, Wang, J, Yan, Y, Cheng, P, Liu, B, du, H, et al. Optimal Sca-1-based procedure for purifying mouse adipose-derived mesenchymal stem cells with enhanced proliferative and differentiation potential. Front Cell Dev Biol. (2025) 13:1566670. doi: 10.3389/fcell.2025.1566670,
26. Li, S, Ling, S, Wang, D, Wang, X, Hao, F, Yin, L, et al. Modified lentiviral globin gene therapy for pediatric $β_0/β_0$ transfusion-dependent $β$-thalassemia: a single-center, single-arm pilot trial. Cell Stem Cell. (2024) 31:961–973.e8. doi: 10.1016/j.stem.2024.04.021,
27. Tang, L, Wang, Y, Gong, X, Xiang, J, Zhang, Y, Xiang, Q, et al. Integrated transcriptome and metabolome analysis to investigate the mechanism of intranasal insulin treatment in a rat model of vascular dementia. Front Pharmacol. (2023) 14:1182803. doi: 10.3389/fphar.2023.1182803,
28. Yang, H, He, C, Bi, Y, Zhu, X, Deng, D, Ran, T, et al. Synergistic effect of VEGF and SDF-1$α$ in endothelial progenitor cells and vascular smooth muscle cells. Front Pharmacol. (2022) 13:914347. doi: 10.3389/fphar.2022.914347,
29. Sun, D, Hu, Y, Li, X, Peng, J, Dai, Z, and Wang, S. Unlocking the full potential of memory T cells in adoptive T cell therapy for hematologic malignancies. Int Immunopharmacol. (2025) 144:113392. doi: 10.1016/j.intimp.2024.113392,
30. Li, B, Cai, L, Liao, B, Fu, X, Bing, P, and Yang, J. Prediction of protein subcellular localization based on fusion of multi-view features. Molecules. (2019) 24:919. doi: 10.3390/molecules24050919,
31. Xiang, J, Meng, X, Zhao, Y, Wu, F, and Li, M. HyMM: hybrid method for disease-gene prediction by integrating multiscale module structure. Brief Bioinform. (2022) 23:bbac72. doi: 10.1093/bib/bbac072,
32. Xu, Z, Li, EH, Liu, J, Zhang, YJ, Xiao, R, Chen, XZ, et al. Postpartum hemorrhage emerges as a key outcome of maternal SARS-CoV-2 omicron variant infection surge across pregnancy trimesters. J Infect Public Health. (2025) 18:102733. doi: 10.1016/j.jiph.2025.102733,
33. Zhang, HQ, Cao, BZ, Cao, QT, Hun, M, Cao, L, and Zhao, MY. An analysis of reported cases of hemophagocytic lymphohistiocytosis (HLH) after COVID-19 vaccination. Hum Vaccin Immunother. (2023) 19:2263229. doi: 10.1080/21645515.2023.2263229,
34. Yilin, Z, Haiquan, F, Chen, H, and Juan, S. Hemodynamics of asymmetrically stenotic vertebral arteries based on fluid–solid coupling. J Biol Phys. (2025) 51:10. doi: 10.1007/s10867-025-09673-x,
35. Bilal, A, Alarfaj, FK, Khan, RA, Suleman, MT, and Long, H. m5c-iEnsem: 5-methylcytosine sites identification through ensemble models. Bioinformatics. (2025) 41:btae722. doi: 10.1093/bioinformatics/btae722
36. Feng, X, Xiu, YH, Long, HX, Wang, ZT, Bilal, A, and Yang, LM. Advancing single-cell RNA-seq data analysis through the fusion of multi-layer perceptron and graph neural network. Brief Bioinform. (2023) 25:1–13. doi: 10.1093/bib/bbad481,
37. Bilal, A, Suleman, MT, Almohammadi, K, Alzahrani, A, and Liu, X. 2OM-Pred: prediction of 2-O-methylation sites in ribonucleic acid using diverse classifiers. Brief Bioinform. (2025) 26:bbaf282. doi: 10.1093/bib/bbaf282,
38. Yu, X, Ren, J, Long, H, Zeng, R, Zhang, G, Bilal, A, et al. iDNA-OpenPrompt: openprompt learning model for identifying DNA methylation. Front Genet. (2024) 15:1377285. doi: 10.3389/fgene.2024.1377285,
39. Tran, DC, Dang, AL, Hoang, TNL, Nguyen, CT, Dinh, TNM, Tran, VA, et al. Prevalence of thalassemia in the Vietnamese population and building a clinical decision support system for prenatal screening for thalassemia. Mediterr J Hematol Infect Dis. (2023) 15:e2023026. doi: 10.4084/mjhid.2023.026,
40. Das, R, Datta, S, Kaviraj, A, Sanyal, SN, Nielsen, P, Nielsen, I, et al. A decision support scheme for beta thalassemia and HbE carrier screening. J Adv Res. (2020) 24:32368356:183–90. doi: 10.1016/j.jare.2020.04.005
41. Ferih, K, Elsayed, B, Elshoeibi, AM, Elsabagh, AA, Elhadary, M, Soliman, A, et al. Applications of artificial intelligence in thalassemia: a comprehensive review. Diagnostics. (2023) 13:1551. doi: 10.3390/diagnostics13091551,
42. Zhang, F, Zhan, J, Wang, Y, Cheng, J, Wang, M, Chen, P, et al. Enhancing thalassemia gene carrier identification in non-anemic populations using artificial intelligence erythrocyte morphology analysis and machine learning. Eur J Haematol. (2024) 112:692–700. doi: 10.1111/ejh.14160,
43. Jahan, A, Singh, G, Gupta, R, Sarin, N, and Singh, S. Role of red cell indices in screening for beta thalassemia trait: an assessment of the individual indices and application of machine learning algorithm. Indian J Hematol Blood Transfus. (2021) 37:453–7. doi: 10.1007/s12288-020-01373-x,
44. Bharath, M, Gowtham, S, Kodipalli, A, and Rao, T. Enhancing alpha thalassemia screening: a comparative study of multiple machine learning classifiers and interpretation using explainable AI In:. 2023 4th International Conference on Intelligent Technologies (CONIT). (Bangalore, India: IEEE) (2024). 1–7.
45. Jarujareet, U, Wiratchawa, K, Petiwathayakorn, T, Koonyosying, P, Hantrakool, S, Srichairatanakool, S, et al. Classification of beta-thalassemia major and HbE/beta-thalassemia via deep learning of image structure function image. Biomed Signal Process Control. (2025) 102:107265. doi: 10.1016/j.bspc.2024.107265
46. Moore, W, and Frye, S. Review of HIPAA, part 1: history, protected health information, and privacy and security rules. J Nucl Med Technol. (2019) 47:269–72. doi: 10.2967/jnmt.119.227819,
47. Tovino, SA. Artificial intelligence and the HIPAA privacy rule: a primer. Houst J Health Law Policy. (2025) 24:77–126.
48. Bakare, SS, Adeniyi, AO, Akpuokwe, CU, and Eneh, NE. Data privacy laws and compliance: a comparative review of the EU GDPR and USA regulations. Comput Sci IT Res J. (2024) 5:528–43. doi: 10.51594/csitrj.v5i3.859
49. Pesapane, F, Volonté, C, Codari, M, and Sardanelli, F. Artificial intelligence as a medical device in radiology: ethical and regulatory issues in Europe and the United States. Insights Imaging. (2018) 9:745–53. doi: 10.1007/s13244-018-0645-y,
50. Mennella, C, Maniscalco, U, De Pietro, G, and Esposito, M. Ethical and regulatory challenges of AI technologies in healthcare: a narrative review. Heliyon. (2024) 10:1–20. doi: 10.1016/j.heliyon.2024.e26297,
51. Panigrahi, A, Pati, A, Sahu, B, Das, MN, Nayak, DSK, Sahoo, G, et al. En-MinWhale: an ensemble approach based on MRMR and whale optimization for Cancer diagnosis. IEEE Access. (2023) 11:113526–42. doi: 10.1109/ACCESS.2023.3318261
52. Sahu, B, Panigrahi, A, Pati, A, Das, MN, Jain, P, Sahoo, G, et al. Novel hybrid feature selection using binary Portia spider optimization algorithm and fast mRMR. Bioengineering. (2025) 12:291. doi: 10.3390/bioengineering12030291,
53. Sahoo, G, Nayak, AK, Tripathy, PK, Panigrahi, A, Pati, A, Sahu, B, et al. Predicting breast cancer relapse from histopathological images with ensemble machine learning models. Curr Oncol. (2024) 31:6577–97. doi: 10.3390/curroncol31110486,
54. Panda, P, Bisoy, SK, Panigrahi, A, Pati, A, Sahu, B, Guo, Z, et al. BIMSSA: enhancing cancer prediction with salp swarm optimization and ensemble machine learning approaches. Front Genet. (2025) 15:1491602. doi: 10.3389/fgene.2024.1491602,
55. Li, Q, Wen, Z, Wu, Z, Hu, S, Wang, N, Li, Y, et al. A survey on federated learning systems: vision, hype and reality for data privacy and protection. IEEE Trans Knowl Data Eng. (2021) 35:3347–66. doi: 10.1109/TKDE.2021.3124599
56. Zhang, C, Xie, Y, Bai, H, Yu, B, Li, W, and Gao, Y. A survey on federated learning. Knowl Based Syst. (2021) 216:106775. doi: 10.1016/j.knosys.2021.106775
57. Shen, S, Zhu, T, Wu, D, Wang, W, and Zhou, W. From distributed machine learning to federated learning: in the view of data privacy and security. Concurr Comput. (2022) 34:e6002. doi: 10.1002/cpe.6002
58. Rustam, F, Ashraf, I, Jabbar, S, Tutusaus, K, Mazas, C, Barrera, AEP, et al. Prediction of β-thalassemia carriers using complete blood count features. Sci Rep. (2022) 12:19999. doi: 10.1038/s41598-022-22011-8
59. Daemen, A., and De Moor, B., “Development of a kernel function for clinical data,” in 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Minneapolis, MN, USA: IEEE, 5913–5917.
60. Samosir, RS, Gaol, FL, Abbas, BS, Sabarguna, BS, and Heryadi, Y. Comparation between linear and polynomial kernel function for ovarium cancer classification. J Phys Conf Series. (2019) 1253:12038. doi: 10.1088/1742-6596/1235/1/012038
61. Rajkumar, N, and Jaganathan, P. A new RBF kernel based learning method applied to multiclass dermatology diseases classification In: 2013 IEEE conference on Information & Communication Technologies. (Thuckalay, India: IEEE) (2013). 551–6.
62. Nawaz, MS, Shoaib, B, and Ashraf, MA. Intelligent cardiovascular disease prediction empowered with gradient descent optimization. Heliyon. (2021) 7:1–8. doi: 10.1016/j.heliyon.2021.e06948,
63. Minh, D, Wang, HX, Li, YF, and Nguyen, TN. Explainable artificial intelligence: a comprehensive review. Artif Intell Rev. (2022) 55:1–66. doi: 10.1007/s10462-021-10088-y
64. Angelov, PP, Soares, EA, Jiang, R, Arnold, NI, and Atkinson, PM. Explainable artificial intelligence: an analytical review. WIREs Data Min Knowl Discov. (2021) 11:e1424. doi: 10.1002/widm.1424
Keywords: explainable AI, federated learning, LIME, SVM, β-thalassemia carrier
Citation: Younas HA, Shoaib Khan B, Khan AH, Bilal A, Algarni A, Sarwar R and Mousavirad SJ (2026) Prediction of β-thalassemia carrier using federated learning and explainable AI. Front. Med. 13:1687773. doi: 10.3389/fmed.2026.1687773
Edited by:
Haigang Gong, University of Electronic Science and Technology of China, ChinaReviewed by:
Prashanth N. Suravajhala, Manipal University Jaipur, IndiaMuhammad Iqbal Hossain, BRAC University, Bangladesh
Copyright © 2026 Younas, Shoaib Khan, Khan, Bilal, Algarni, Sarwar and Mousavirad. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Seyed Jalaleddin Mousavirads, c2V5ZWRqYWxhbGVkZGluLm1vdXNhdmlyYWRAbWl1bi5zZQ==
Seyed Jalaleddin Mousavirad6*