Prediction of Alzheimer's disease stages based on ResNet-Self-attention architecture with Bayesian optimization and best features selection

Alzheimer's disease (AD) is a neurodegenerative illness that impairs cognition, function, and behavior by causing irreversible damage to multiple brain areas, including the hippocampus. The suffering of the patients and their family members will be lessened with an early diagnosis of AD. The automatic diagnosis technique is widely required due to the shortage of medical experts and eases the burden of medical staff. The automatic artificial intelligence (AI)-based computerized method can help experts achieve better diagnosis accuracy and precision rates. This study proposes a new automated framework for AD stage prediction based on the ResNet-Self architecture and Fuzzy Entropy-controlled Path-Finding Algorithm (FEcPFA). A data augmentation technique has been utilized to resolve the dataset imbalance issue. In the next step, we proposed a new deep-learning model based on the self-attention module. A ResNet-50 architecture is modified and connected with a self-attention block for important information extraction. The hyperparameters were optimized using Bayesian optimization (BO) and then utilized to train the model, which was subsequently employed for feature extraction. The self-attention extracted features were optimized using the proposed FEcPFA. The best features were selected using FEcPFA and passed to the machine learning classifiers for the final classification. The experimental process utilized a publicly available MRI dataset and achieved an improved accuracy of 99.9%. The results were compared with state-of-the-art (SOTA) techniques, demonstrating the improvement of the proposed framework in terms of accuracy and time efficiency.


Introduction
Dementia is the seventh-greatest root cause of mortality and the main reason for impairment and vulnerability in elderly individuals (Koul et al., 2023).It is a rapidly spreading disorder among the elderly population, becoming increasingly common over the last decade (Sisodia et al., 2023).Dementia greatly impairs intellectual performance, interfering with daily tasks and interpersonal interactions (Nagdee, 2011).Alzheimer's disease (AD) is an inseparable subclass of dementia that can cause memory loss in a person (Mahmud et al., 2024).An individual affected by AD may struggle to recognize family members and experience difficulties in remembering daily activities.Moreover, it can cause ultimately lead to the death of the patient (Mohammad and Al Ahmadi, 2023).Due to these worse health conditions, it is also referred to as a progressive neurodegenerative disease.It affects behavioral functions, thinking abilities, decision-making, and language skills, often leading to memory loss in older people (Kellar and Craft, 2020).
Brain cell alteration may occur a decade or more before clinical signs appear.In the beginning, patients with AD experience unnoticed changes in their brains (Jansi et al., 2023).Throughout the early AD stage, the brain undergoes destructive transformations, including ectopic protein deposition that produces amyloid plaques and tau tangles.Neurons that were once fully functional cease to function properly, losing connections to other neurons and eventually undergoing cell death (Hoozemans et al., 2006).Several additional intricate alterations in the brain can also lead to Alzheimer's (Kasula, 2023).The hippocampus and entorhinal cortex, critical for cognitive control, seem to be the initial regions of impairment (Shrager et al., 2008).Furthermore, the signs of AD begin to manifest when nerve cells (neurons) in certain areas of the brain gradually shrink and eventually become destroyed or damaged (Khalid et al., 2023).In the final phase of AD, damage becomes widespread, and a large amount of brain tissue is destroyed (Bloniecki Kallio, 2002;Carle, 2022).
AD is such a serious brain disease that it can result in a patient's death if not effectively treated (Gómez-Isla and Frosch, 2022).To overcome this disease, patients need good care, regular exercise, and some memory-sharpening activities as there is currently no specific medication for AD (Shamrat et al., 2023).In recent years, a significant increase has been observed in AD (Mirzaei and Adeli, 2022;Stevenson-Hoare et al., 2023).The number of deaths from Alzheimer's disease in 2020 increased by 15,925 compared to the 5 years before 2023, and 44,729 more deaths were recorded for all dementias, including Alzheimer's disease (Chua, 2023).Traditional machine learning (ML) techniques such as pre-processing (Wen et al., 2020), feature extraction (Rathore et al., 2017), feature selection (Balaji et al., 2023), feature fusion (Jia andLao, 2022), andclassification (Tanveer et al., 2020) have been employed by researchers as a four-step channel in the past few years.Classification is the bottommost step in which each object accredits a label, in either a supervised or unsupervised ML technique (Bondi et al., 2017).Deep learning (DL) (Shaukat et al., 2022) is a subtype of machine learning that falls under the umbrella of artificial intelligence, but DL is way more vigorous and flexible in comparison with ML (Fabrizio et al., 2021).Techniques such as shallow CNN (Marwa et al., 2023), DNN (Hazarika et al., 2023), MultiAz-Net (Ismail et al., 2023), hybridized DL method (Hashmi, 2024), and RVFL (Goel et al., 2023) have been used in recent years, but these techniques yield low accuracy as compared to our proposed model (Shamrat et al., 2023).

. Major challenges and gaps
Recent advances in ML and DL have opened up new avenues for assessing AD, but researchers are still grappling with the diagnosis of the disease (Shamrat et al., 2023).Few of them are related to insufficient and unbalanced datasets.Furthermore, major problems with AD patients are the complexity, diversity, and complicated neurobiological underlying AD (Dhakhinamoorthy et al., 2023).Architectural variation in scans is another main challenge to diagnosing and detecting AD.However, the influence of these challenges may vary from patient to patient.This research will focus on AD stages for classification using deep learning and feature optimization techniques.

. Major contributions
The main contributions of this study are as follows: • A fine-tuned ResNet-50 architecture has been modified by adding a self-attention layer and trained from scratch for feature extraction.• Hyperparameters of the trained model are initialized using an optimization technique named Bayesian optimization.• Improved the extracted self-attention features using an improved pathfinder optimization named the Fuzzy entropy-controlled path-finding algorithm (FEcPFA).The optimization algorithm selects the best features and improves the efficiency.• The optimized selected features are finally classified using machine learning to classify the stages of AD.
This article is organized as follows: Section 2 reviews ML and DL techniques that have been applied to Alzheimer's disease, and Section 3 provides a comprehensive description of the datasets.The testing outcomes are shown in detail in Section 4. Section 5 summarizes our findings, and Section 6 discusses future work.

Related work
Due to the brain's intricacy, classifying AD is difficult (Dhakhinamoorthy et al., 2023).Thus, researchers are improving medical image processing to identify AD correctly.This section presents relevant literature in the domain of AD detection and diagnosis, which focuses primarily on classification techniques based on deep learning for MRI tissue structure analysis (Mohi et al., 2023).The deep belief network (DBN) was utilized by AI-Atroshi et al. (2022) to extract feature vectors from detected speech samples, which has an output accuracy of 90.2%.Shankar et al. (2022) Albright (2019) presented the all-pairs pre-processing algorithm to train the model.For this experiment, setting data were taken from ADNI and divided into three datasets, i.e., LB1, LB2, and LB3, with an mAUC of 0.866.The 3D-CNN networks by Soliman et al. (2022) predicted AD.It learned basic traits that catch AD indicators to identify brains with Alzheimer's disease from healthy and normal brains using MRI scans.ADNI provided 3,013 photographs with 96.5% training accuracy and 80.6% tested accuracy.Samhan et al. (2022) adopted CNNs, VGG16, Adam, activation, and softmax optimizers.The Kaggle dataset of 10,432 images yielded 100% training accuracy, 0.0012 training loss, 97% validation accuracy, and 0.0832 verifying loss.Jo et al. (2022) proposed a unique deep learning-based genome-wide approach called SWAT-CNN that found SNPs associated with AD and a classification model for AD.It may be useful for a variety of biomedical applications and was tested on the GWAS dataset by the AD Neuroimaging Initiative (ADNI).Zhang et al. (2022) adopted CNN models of various designs and capacities and assessed them thoroughly.The most appropriate model was then applied for AD diagnosis.To increase the transparency of the model, an explanation heatmap was produced for AD vs. cognitive normal (CN) classification tasks and pMCI vs. sMCI using two publicly available datasets.Interestingly, the study found that a moderately sized model could outperform one with the largest capacity.Ghazal et al. (2022) proposed the system named ADDLTA, in which the transfer learning (TL) approach was used in conjunction with brain medical resonance imaging (MRI) to classify the image into four categories: mildly demented (MD), moderately demented (MD), non-demented (ND), and very mildly demented (VMD), which gave 91.70% accuracy on simulation results based on the publicly assessable dataset by the Kaggle repository.Shanmugam et al. (2022) focused on detecting different phases of cognitive impairment and AD in the early stages by utilizing TL in neuroimaging.GoogLeNet, AlexNet, and ResNet-18 were three pre-trained models adopted for classification, giving an accuracy of 96.39, 94.08, and 97.51%, respectively, on the ADNI dataset.Prasath and Sumathi (2024) suggested a compact architecture by merging two models, LeNet and AlexNet, that outperform DenseNet.Three parallel tiny filters (1 × 1, 3 × 3, and 5 × 5) replaced the convolution levels to recover key features that achieved 93.58% accuracy on the dataset taken from ADNI.Sorour et al. (2024) proposed a system for the automated diagnosis of Alzheimer's disease that integrates multiple customized deeplearning models to provide an objective evaluation.The very first methodology addresses AD diseases using SVM and KNN.The second approach combines rs-fMRI datasets from the ADNI repository with modified AlexNet and Inception blocks.This architecture gave 96.61% accuracy.A new optimized ensemblebased DNN learning model called MultiAz-Net is used by Ismail et al. (2023) with diverse PET and MRI data to identify AD.The Multi-Objective Grasshopper Optimization Algorithm (MOGOA) optimizes MultiAz-Net layers, which produced 92.3% accuracy on the ADNI dataset.Balaji et al. (2023) suggested a DL approach to detect AD in its initial stages using multimodal imaging and the LSTM algorithm, combining MRI, PET, and traditional neuropsychological examination results.The suggested technique adjusted the learning weights to improve accuracy and employed Adam's optimization.The proposed architecture achieved 98.5% accuracy on 512 MRI and 112 PET scans.

Materials and methods
This section provides a comprehensive exposition of the experimental dataset and methodologies employed within.It elucidates the specifics of the experiments, including the nature of the dataset utilized and the methodologies adopted.

. Dataset
A well-characterized repository has a significant role in the performance evaluation of a diagnosis system.In this experiment, a dataset was obtained from Kaggle.This dataset, known as Alzheimer's disease, consists of specimens of anonymously affected individuals with MRI scans and their appropriate class label details.This multiclass dataset contains four distinct classes and offers many different views, comprising over 5,000 MRI images.The four classes are shown in Figure 1: mildly demented (Shanmugam et al., 2022), moderately demented (Prasath and Sumathi, 2024), non-demented (Sorour et al., 2024), and very mildly demented (Ismail et al., 2023).A brief explanation of the four classes of AD is given in Table 1 for testing and training purposes.The data were imbalanced in each class.Each class consisted of a different number of images.
These datasets are the most prominent and effective for this publicly available domain.The major aim of this study is to yield high accuracy.Original MRI scans and augmented image distribution were utilized in the training and testing of the experiment.Mild demented contained 896 images; moderate demented contained 64 images; non-demented comprised 3,200 images; and very mild had 2,240 images.After the augmentation, we took 2,000 images from each class for further proceedings.Figure 2 illustrates the AD stages with a brief description.Moreover, an image description that lists the number of classes and augmented images utilized in this study is found in Table 2.

. Proposed methodology
Our proposed study presents a deep learning-based methodology for classifying AD grades.First, the dataset was taken from Kaggle, a public repository.The data were unbalanced in each class, so different augmentation techniques were applied.The data have been enhanced by applying different enhancement methods.After the enhancement, we fine-tuned the ResNet-50

References
Classes Description No of images .

. Data augmentation
Augmentation is creating modified image variants from an existing image dataset to improve its variety artificially.Images are nothing more than a 2D collection of numbers for a computer.These numbers indicate intensity values, which may be modified to produce new, enhanced images.The primary goal of augmentation is to maintain parity among each group.It improved the outcomes and made them more precise and effective.In most cases, it was only useful for very small data sets.Images may be flipped  horizontally, vertically, or rotated using this method.Both of these techniques expand the quantity of the dataset by producing images that have been flipped at various angles. .

. . Horizontal flip
Complete rows and columns of image pixels are set aside horizontally.If the image on the right is flipped, the outcome will be on the left.The mathematical representation for the horizontal flip is shown by Equation 1.
The given formula illustrates the horizontal flip of an image scan.HF shows the flipping function , while H O represents the real image.The first half x, y displays the actual image, while quadrant two −x, y displays the replica image.Therefore, the unedited version of the image resides within the first quarter, which is the right side, and after horizontal flipping, the image has been flipped to the second phase, which is the left side.

. . . Vertical flip
Complete rows and columns of image pixels are set aside vertically.When an image is now displayed in the upward position and flipped, the resulting image will be displayed in the downward motion.The mathematical representation for the vertical Flip is shown below: The given formula illustrates the vertical flip of an image scan.HV shows the flipping function, while H O represents the real image.
The first half lies in x, y which displays the actual image, while the third quadrant third x, −y displays the replica image.Therefore, Equation (2) demonstrates that the initial image resides in the first half on the right side.When the vertical flip is enforced, the image goes to the third half, which is in a downward direction.In short, it flipped the image along with the X-axis.

. . . Rotate flip
A 3D graphic item is flipped by rotating it.The following is a mathematical representation by Equation 3. Consequently, image (I) is rotated by angle degrees counterclockwise around its center.To rotate the image counterclockwise, input a negative angle value, then (imrotate) will extend the resultant image (J) to encompass the entire rotated image.The methods for data augmentation used in the experiment are shown in Figure 4.

. . Contrast enhancement
Increasing contrast is one of the most important and useful techniques for improving the essential elements of an image.Normally, raw images contain noise, distortion, and low contrast that lower the image quality, which sometimes causes the loss of useful information (Perumal and Velmurugan, 2018).Contrast enhancement improves the image qualities for further processing.More relevant characteristics may be extracted from the improved photographs for the classification stage than from the input image.The datasets chosen in this study have poor-quality images with low contrast levels.Due to this issue, we could end up incorrectly categorizing things.Contrast enhancement is further divided into two main groups, i.e., the spatial and frequency domains, such as morphological enhancement, histogram equalization, contrast stretching, contrast slicing, and some contrast enhancement.In our proposed experiment, two types of contrast enhancement have been adopted, one by one.First, a fast local Laplacian filter is applied to the augmented dataset.After this, a top-bottom hat filter is applied to the enhanced dataset to get better-quality MRI scans.

. . . Fast local Laplacian filter
There are two main functions of FLLF.The first is applied to the raw images to boost the boundary detail and reduce the noise artifacts.The second is that the images are transformed from the RGB color system to the YUV color space to isolate the Y factor.Multiscale adjustments are crucial to photo editing but are especially vulnerable to halos.Advanced edge-aware algorithms and careful parametric adjustments are needed to get outcomes without artifacts.These deficiencies were subsequently remedied through local Laplacian filters.These filters use typical Laplacian pyramids to generate a wide variety of effects.However, these filters are time-consuming, and their link to other methods is obscure.

. . . Top-bottom hat filter
In this filter, the top-hat part is employed for objects with a light color on a darker backdrop, whereas the bottom-hat part is utilized for images with a dark color on a light background.The correction of the effects of non-lighting is a key purpose of the top-hat modification.When the shade is evident in an image, this filtering technique can effectively highlight the information in the image.The methods for contrast enhancement are visually presented in Figure 5.

. . Bayesian optimization
Bayesian optimization incorporates Bayes' theorem to guide the search as a method for minimizing or maximizing an optimization technique.This method can be very helpful for optimization algorithms that are difficult to evaluate due to their complexity, noise, or cost.BO differs from other methods in that it considers previous parameter data by changing the baseline using Gaussian progress (GP).Additionally, BO has minimal iterations and a rapid convergence time.The BO approach may also eliminate local optimum in non-convex optimization circumstances.BO is a perfect pick for optimizing HPs due to its high convergence and resilience.All hyperparameters must be tuned to gain classification precision while utilizing DL architectures.The choice of hyperparameters substantially affects the accuracy and precision of the prediction.When optimizing hyperparameters, the objective is to choose the values that provide the highest quality validation findings.Hyperparameter optimization is written mathematically by Equation 4.
where f(x) shows the cost-minimizing objective score for evaluating hyperparameter optimization relative to the validation set, and x is Optimizing several parameters is faster by using it.
In contrast to other methods, BO updates the prior with Gaussian progress to adjust for past parameter values (GP).Additionally, BO converges quickly and with a small number of iterations.When addressing non-convex optimization problems, the BO approach may be able to sidestep localized optimality.BO is a great option for optimizing HPs due to its high convergence and resilience.The stopping condition of the BO algorithm is based on MaxTime.The BO algorithm stops when it reaches the MaxTime, which is 54,000 s.This time is approximately equivalent to 15 h.In this study, we utilized BO along with DCNN, which fine-tuned the hyperparameters to generate the lowest error rate with optimal results in an architecture.Optimizing parameters such as L2Regularization, Section Depth, Momentum, and Learning Rate have been used in this study, shown along with their ranges in Table 3, which represents the Bayesian optimization workflow.Figure 6 illustrates the BO.

. . Deep transfer learning
Transfer learning is applying a learned model to a different situation.The fact that it has the potential to train deep neural networks on very little training data has recently made it more famous in deep learning.Deep transfer learning is becoming more prominent in handling image classification issues as it is feasible to use built-in CNNs on publicly available datasets such as ImageNet to achieve top classification accuracy in several application domains.After transfer learning (TL), the framework is fine-tuned (FT) to relearn all FE and C. FT is performed by initializing feature extraction parameters and ImageNet weights, and classification parameters are updated along with TL weights.Figure 7 illustrates the deep transfer learning workflow.

. . Proposed ResNet-Self architecture
In this study, we proposed a modified ResNet-50 architecture based on the self-attention module named ResNet-Self.Initially, we consider the ResNet-50 architecture based on the residual blocks.In this network, 48 convolutional layers have been originally added, along with one max-pooling layer, one average pooling layer, and one fully connected layer.The residual blocks added in this network contain skip connections.In this network, bottleneck filters are applied, such as 1 × 1, which reduces the number of parameters.
The depth size of this model is originally between 64-and 2,048, and filter sizes of 3 × 3.Moreover, the stride is used 2 out of the residual blocks, and in the residual blocks, 1 stride is employed.The average pooling layer has been added at the end of this model for the features extraction that followed the fully connected and softmax layers.The initial performance of this model for AD stage classification was insufficient; therefore, we modified it with the latest concept named Self-Attention.
The proposed ResNet-Self architecture is illustrated in Figure 8.This figure shows that the self-attention layer was added after the global average pooling layer.A flattening layer has been added before the self-attention layer that converts the input into 1D.The first channel is passed to the Softmax function that combines with the second channel for the attention map creation.After that, the generated attention map is combined with a third channel for final attention features that are further utilized to classify AD stages.

. . . Self-attention
The internal attention approach, sometimes called the selfattention (SA) strategy, uses internal information to automatically identify and highlight relevant information without needing external information.SA has low computational complexity and allows parallel computing.It consists of three characteristics matrices such as X, Y, and V, where these are defined by Equations 5-9.

{Y, X} ∈ R T×T
(5) Initially, the correlation score has been computed among all rows of Y and X as follows: where Y τ denotes the transpose of Y and P ∈ R T×T .The softmax function is applied in the next step, which converts the correlation score into probability values.Mathematically, it is formulated as follows: SM (P) i, j = e P(i,j) T−1 j=0 e P(i,j) Hence, the final attention map has been obtained as follows: .

. . Proposed network training
After the design of the proposed model, the next step is training a model using the deep transfer learning concept.The entire model is trained from scratch, instead of any frozen layer.The hyperparameters of this network are presented in Table 3.Based on the selected hyperparameters using BO, the proposed model is trained on the augmented dataset.The best-returned value of the learning rate using BO is 0.00032, and the momentum value is 0.773.After the training process, the test data are employed for the extraction of the features.

. . . Deep features extraction
Typically, CNNs return the three levels of feature maps: low-level feature maps, mid-level feature maps, and high-level feature maps.All of these levels contain different information.Low-level feature maps contain simple patterns such as edges, corners, and textures.These maps have high dimensions.Mid-level feature maps contain abstract and more structured patterns like specific regions of the objects or textures, but high-level features contain discriminative and semantically significant information.The features (high-level) are extracted from the last stage of CNNs due to their lower dimensions.The low dimensionality reduces the memory requirement and computational complexity.
This study extracts deep features using the self-attention layer, instead of the global average pooling layer.The self-attention layer returns the prominent and relevant global information within the images.The testing images are utilized, and a trained model is opted for.The batch size was 128 during the deep features extraction.The self-attention layer features contain deeper information about the AD stages.The size of the extracted feature vector is N × 2,048.The extracted features are analyzed and optimized using an improved pathfinder optimization algorithm.

. . Best features selection
In this study, we utilized an evolutionary optimization algorithm named Entropy Path Finder Optimization (EPFO) for the best feature selection.Features are selected at the initial step through original pathfinder optimization and later refined using an entropy approach that handles the uncertainty.

. . . Path finder optimization
In contrast to previously suggested swarm intelligence, the Pathfinder approach does not specify which species group it belongs to.For instance, the seagull optimization algorithm restricts the number of seagulls, whereas the gray wolf optimization technique restricts the number of gray wolves, etc.The Pathfinder algorithm is based on many surviving traits and regulations of animals.Based on the best fitness of the organism, the Pathfinder algorithm divides cluster animals among two sorts of tasks: the leader (only with the lowest fitness value) and the follower.The leader must find the greatest food and label it for the followers.The markings left behind by the Pathfinder are used as a reference point by the followers, who then proceed to follow the Pathfinder.Hence, both the Pathfinder and the follower are skeptical.That is why the two distinct sorts of responsibilities may switch places with one another depending on the individual's level of search capability as the number of iterative steps of the method rises; that is, those who lead the way sometimes get followers.Similarly, followers may also play the role of a pathfinder.To optimize a task, the PFA is split into two segments.The initial stage is a period of exploration.The PFA changes the location using the following Equation (10): shows the former position of the vector of PFA.The ongoing iteration count is denoted by the variable k.R3 is a random vector that is created in a uniform manner in the range [0,1], whereas A is produced within every iteration by applying Equation (11).Step 2 is really the exploitation step, which is immediately preceded by the location change.The following update formula applies Equation ( 11): where x k+1 i indicates the updated location vector of the i-th integer just after location modification.x k i is the location vector of the i-th individual, x k j is the neighboring individual, and x k p is the Pathfinder.The variable k denotes the ongoing iteration count.Each of the vectors R1 and R2 is completely unpredictable.In this situation, R 1 = (αr1) and R 2 = (βr2), and here, R 1 and R 2 are random vectors that are created uniformly in the range [0,1].α determines the degree to which each component travels about its neighbors and is hence called the coefficient of iteration.β establishes a randomized spacing to make the herd fairly constant along with the leader and hence called the coefficient of attraction k max .Mathematically, it is formulated by Equation ( 12).
Therefore, here µ 1 and µ 2 are randomly generated two vectors in the interval of [1,1], D ij is the gap between both individuals, k denotes the present iteration range, and k max is the maximal quantity of repetitions.(A) and h may give random walk strides for all persons when the second part of Equations ( 10) and ( 11) and the third part of Equation ( 12) are equal to zero.As a result, in order to ensure that the motion will be in several directions and completely random, the values of.(A) and h should be within the proper span.
After every update in the position, the KNN classifier is employed to measure the fitness value.The cost function of KNN is mathematically formulated as: where α and β are denoted, the coefficient having values are 0.94 and 0.014, respectively.The ǫ err presented the error value that is calculated by employing an Equation ( 14): . . .Entropy selection Assume U is a discrete random variable, and it is represented as u = {u 1 , u 2 , . . ., u n }, then if an element u i occurs with p(u i ), the entropy H(U) of U is formulated by Equation ( 15):  11) and check the bound end calculate the new fitness of members find the best fitness using Equations ( 13) and ( 14) if best fitness < fitness of Pathfinder pathfinder = best member fitness = best fitness end for i=2 to maximum number of populations if new fitness of member (i) < fitness of member (i) update member -Find entropy of updated member using Eq. 15.
-Compute the fuzziness using Eqs.16-18 -Find the fuzzy entropy value using Eq.19.end end generate new A and ε end Algorithm .The pathfinder algorithm.
where n denotes the total number of features.The fuzzy C-Means clustering is utilized to construct the membership function of all features.The fuzzy membership method is defined in the following five steps.
In the first step, we assumed the number of clusters (C), where 2 ≤ C ≤ N. In the next step, the jth center clusters are computed by the following Equation ( 16).
where e ≥ 1 is a fuzziness coefficient and µ ij is the degree of membership (DOM) for the ith data point u i in jth cluster.Euclidean distance is computed in the third step using Equation ( 17).
In the fourth step, the value of the fuzzy membership function is updated by Equation ( 18): In the final step, we repeated steps 2-4 until the change in µ was less as per the previous values.Hence, the fuzzy entropy function is formulated as follows in Equation ( 19): where λ c Ȟ is a class degree of the membership function (Khushaba et al., 2007).The fuzzy entropy process is applied to the selected features of Equation ( 12).The dimensions of the selected features are N × 1,467.The final features are employed for the classification.The proposed fuzzy entropy-controlled pathfinder algorithm's pseudo-code is given in Algorithm 1.

Results and analysis
The proposed AD stage classification model undergoes evaluation using a Kaggle dataset, providing a robust framework for assessing its performance.The forthcoming section will comprehensively showcase all the experiments conducted and the corresponding results obtained, offering insights into the efficacy and potential of the proposed model in accurately diagnosing AD.

. Experimental setup and evaluation measures
The experimental process of this study is discussed here.The proposed framework of AD is evaluated on a publically available dataset that includes four classes as mentioned in Section 3.1.The dataset is divided into 50:50 approaches, and training data augmentation is performed.The training data extracts and optimizes features for the best feature selection.The selected features are classified using machine learning classifiers, and the following measures are computed: recall rate, precision rate, F1-Score, MCC, and KAPPA.The entire experimental process has been conducted on MATLAB2023a using a personal computer with 128GB RAM, 512FB SSD, and a 12GB Graphics Card of NVIDIA3060 RTX.

. Proposed ResNet-Self results (random values)
The proposed ResNet-Self CNN architecture is tested on 1,600 images in this experiment.The hyperparameters of this experiment are randomly initialized (related work knowledge such as learning rate 0.0001 and momentum 0.70) and performed training.Features are extracted from the testing data, and the maximum accuracy  of 99.93% for the NNN classifier was obtained (results seen in Table 4).The values of precision measure are 99.93%, and the Kappa value is 99.80%, respectively.The computational time taken by the NNN classifier is 33.532 (s), whereas the minimum noted time is 28.071 (s) for bilayered NN.The rest of the classifiers obtained accuracies of 98.7, 99.90, 99.65, 98.97, 73.86, 98.75, 99.88, and 99.72%, respectively. .

Bayesian optimization results
This section presents the results obtained from Bayesian optimization (BO).We executed our BO algorithm 100 times and got the value for a learning rate of 0.00010195, momentum value of 0.81079, L2Regularization of 2.8724e −10 , and section depth value of 3.These are the best feasible points.Based on these points, the classification was performed, and the results are noted in Table 5.The MNN classifier achieved a maximum accuracy of 99.95% in this table.The precision rate of this classifier is 99.95, the Kappa value of 99.87, and the MCC value of 99.95%, respectively.In addition, the computation time of this classifier is 17.78 (s).Compared to the results in Table 4, this experiment shows improved accuracy, precision, Kappa, and MCC values.Moreover, the computation time of this experiment was less than that of the results in Table 4.The results show that selecting hyperparameters using BO can improve the accuracy and reduce the computational cost.
. Proposed feature selection  4, 5, the selection results show better.Moreover, the computation time of each classifier is also noted, and the minimum noted time for this experiment is 12.338 (s), which is less than Tables 4, 5. Overall, the time is decreased after employing the optimization method.
To further improve (minimize) the computational time, we improved the PFA using Fuzzy Entropy formulation in this study.The proposed Fuzzy Entropy PFA (FEPFA) results are given in the second half of this technique is 99.90%, whereas the precision rate value of 99.93%.The Kappa and MCC values of this experiment are 99.80 and 99.90%, respectively.In addition, the computation time of this classifier is 10.231 (s), less than the original PFA (12.338).
Overall, the performance of this technique is improved and time is minimized.The performance of the NNN classifier can be further verified using a confusion matrix illustrated in Figure 9.In this figure, the diagonal values represent the true predicted rates of each class. .

Discussion and comparison
In this section, a detailed analysis of the proposed study has been conducted in the form of visual graphs and comparison with recent state-of-the-art (SOTA) techniques.The proposed framework of AD stage classification has been discussed in Section 3.1, and the visual illustration is shown in Figure 1.The MRI dataset has been used for the experimental process (a few sample images are shown in Figures 2, 3).The augmentation process has been performed to increase the number of images for a better training process.After that, a new model is proposed named ResNet-Self as shown in Figure 8 for the accurate classification of AD stages.The performance of AD stage classification is improved by proposing new FEPFA techniques that select the best features.The results are presented in Tables 4-6.Table 4 presents results for the proposed ResNet-Self architecture using random initialization of hyperparameters.Table 5 presents the results of the proposed ResNet-Self after employing BO for hyperparameters selection.Table 5 shows better accuracy, precision rate, MCC, and Kappa performance than Table 4.The computational time and precision rate are further improved using the proposed FEPFA feature selection algorithm, and the results are presented in Table 6.In this table, accuracy is also improved and time is significantly decreased.In addition, a comparison is also conducted of the proposed FEPFA with the original PFA, showing the improvement in accuracy, precision, MCC, and computational time.Overall, the time comparison is illustrated in Figure 10.This figure clearly shows that the proposed selection method consumed less time than the other steps.
Table 7 compares the methods currently utilized for predicting AD.To enhance the categorization of early AD phases while reducing parameters and computational costs, a novel detection network named DAD-Net was introduced by Mohi et al. (2023).This network appropriately classified initial AD processes and depicted class activation characteristics as a heat map of the brain, achieving 99.2% accuracy using a Kaggle dataset.Additionally, AI-Atroshi et al. ( 2022) utilized convolutional layers with freeze

Conclusion and future study
It is challenging to diagnose and predict Alzheimer's disease using multiclass datasets promptly.A computerized technique is widely required for early AD prediction from MRI images.This study proposes a computerized framework based on deeplearning and optimization algorithms.A dataset balancing issue has been resolved at the initial stage using mathematical formulations that improved the training capability of the proposed ResNet-Self deep model.The proposed ResNet-Self model is a combination of ResNet-50 architecture modified by adding the self-attention module.The self-attention module shows improved accuracy; however, the random initialization of hyperparameters impacts the accuracy and computational time.Therefore, we implemented a BO technique that automatically initialized the hyperparameters for the training process.Moreover, we proposed a feature selection algorithm named FEcPFA that selects the best features and shows improved accuracy (99.90), precision rate, and Kappa value.In addition, the computational time is significantly reduced, which is the strength of FEcPFA.The optimized hyperparameters that make the proposed model less generalized and lead to overfitting are the limitations of the proposed framework.In the future, a new custom model will be proposed based on the fire module, and the output of that module will be employed with self-attention and crossvalidation to overcome overfitting.In addition, more MRI datasets will be utilized for the experimental process.

FIGURE
FIGUREClasses of Alzheimer's disease for the classification.

240 FIGURE
FIGUREDescription of AD stages.

FIGURE
FIGUREProposed model of classification of AD stages.

FIGURE
FIGUREVisual illustration of image augmentation using mathematical techniques.

FIGURE
FIGUREVisual illustration of contrast enhancement.

FIGURE
FIGUREBayesian optimization workflow for hyperparameters selection.

FIGURE
FIGUREDeep transfer learning architecture for classification of AD stages.

FIGUREFrontiers
FIGUREProposed ResNet-Self architecture for classification of AD stages.
10) where x k+1 p demonstrates the modified position vector of PFA.x k p indicates the present location of vectors in PFA, while x k−1 p While k<maximum number of iterations α and β = random number in [1,2] update the position of Pathfinder using Equation (10) and check the bound if the new Pathfinder is better than old update the Pathfinder end for i=2 to maximum number of populations update the position of members using the Equation (

FIGURE
FIGUREConfusion matrix of the NNN classifier using proposed feature selection algorithm.

FIGURE
FIGUREComparison among middle steps of the proposed method in terms of computational time.
used HAAR-based object identification techniques because they are more suitable with discriminant attributes and generated

TABLE Description of
AD classes dataset.
set of hyperparameters whose values lie in that range.Training takes longer and is extremely difficult to do by hand with DNN models with numerous hyperparameters.ML and simulations employ BO.FFNN designs alter hyperparameters in CV-based techniques to enhance the performance of the mode network. the TABLE Proposed prediction results of AD stages using initialization of random hyperparameters.
TABLE Proposed classification results after employing Bayesian optimization-based selection of hyperparameters.
Table 6 presents the AD stage classification results using the proposed selection of BO extracted features.In the first stage of this table, results are presented for the original pathfinder algorithm.The PFA was applied to the BO-based deep features extraction and performed classification.The maximum obtained accuracy for this experiment is 99.82%.The precision and recall values are 99.83 and 99.83%.In addition, Kappa and MCC measure values of 99.80 and 99.80%, respectively.Compared to Tables

Table 6 .
The maximum obtained accuracy for TABLE Proposed classification results after employing Bayesian optimization and proposed feature selection algorithm.
TABLE Comparison of proposed method results with existing techniques.