Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Comput. Sci., 21 January 2026

Sec. Computer Security

Volume 7 - 2025 | https://doi.org/10.3389/fcomp.2025.1687867

Identifying key features for phishing website detection through feature selection techniques


Raed AlazaidahRaed Alazaidah1Mohammad BaniSalmanMohammad BaniSalman1Khaled E. AlqawasmiKhaled E. Alqawasmi1Ali Abu ZaidAli Abu Zaid1Yousuf HazaimehYousuf Hazaimeh1Fuad Sameh AlshraiedehFuad Sameh Alshraiedeh1Emma Qumsiyeh
Emma Qumsiyeh2*
  • 1Faculty of Information Technology, Zarqa University, Zarqa, Jordan
  • 2Faculty of Engineering and Information Technology, Palestine Ahliya University, Bethlehem, Palestine

Over the past few years, phishing has evolved into an increasingly prevalent form of cybercrime, as more people use the Internet and its applications. Phishing is a type of social engineering that targets users' sensitive or personal information. This paper seeks to achieve two main objectives: first, to identify the most effective classifier for detecting phishing among 40 classifiers representing six learning strategies. Secondly, it aims to determine which feature selection method performs best on websites with phishing datasets. By analyzing three unique datasets on phishing and evaluating eight metrics, this study found that Random Forest and Random Tree were superior at identifying phishing websites compared with other approaches. Similarly, GainRatioAttributeEval, along with InfoGainAttributeEval, performed better than the five alternative feature selection methods considered in this study.

1 Introduction

Due to the widespread use of online services like e-commerce and social media and the increased access afforded by the Internet, users are increasingly susceptible to cyberattacks targeting sensitive information, such as usernames or credit card details. One popular method used by attackers is called phishing, which uses fraudulent websites that appear authentic and trick individuals into divulging their private data (Athulya and Praveen, 2020). This can be accomplished using email or text messages designed solely for this purpose; even communication between clients and companies may contain such deceptive links. Typically motivated by financial gain, malware infections on user machines, or identity theft, most phishing attempts involve these motives.

Recent findings indicate a dramatic increase in unique reported instances, exceeding 199 thousand detections in December 2020 alone—an alarming statistic compared with the Anti-Phishing Working Group's results from previous years (APWG, 2021). Moreover, since the early days of the pandemic in March last year, when global COVID-19 fears were high, scammers have frequently issued phony certificates containing the words “COVID” or “corona.” These scammers have increasingly relied on digital certification policies and HTTPS protocols rather than on traditional tactics (Warburton, 2020).

Broadly, there are two ways to identify phishing: through user knowledge or anti-phishing software. Due to the realism of phishing emails and websites, many users find it challenging to detect them. Consequently, accurate software solutions for detecting these threats have become increasingly necessary. Software-based detection strategies include blocklisting, heuristics, and machine learning (Athulya and Praveen, 2020). Previous studies using machine learning often relied on numerous features to achieve high accuracy; however, extracting these features is not always possible in real-time scenarios, requiring more resilient solutions.

The purpose of this paper is to support the worldwide effort to combat phishing scams by leveraging advanced machine learning techniques to predict fraudulent websites accurately.

Numerous classification models have been proposed and employed to identify phishing websites, claiming superiority over other approaches (Alazaidah et al., 2018). Moreover, this study aims to determine the most suitable classification method (classifier) for phishing datasets. To obtain a comprehensive overview of the findings, more than 40 classifiers across six learning strategies are evaluated using several metrics, including accuracy, precision, recall, and F1-measure.

Feature selection is one of several necessary preprocessing steps when creating any machine learning (ML)-based learning model. Its purpose is to identify relevant features that aid in constructing intended models by selecting non-redundant consistent attributes (Alluwaici M. et al., 2020). The feature selection procedure always prioritizes characteristics that closely align with the objective qualities of the dataset's attributes (Alluwaici M. et al., 2020).

To achieve the goal, 40 classifiers from six well-known learning strategies were selected for assessment. The evaluation phase encompasses eight diverse, commonly used metrics, including accuracy, precision, recall, and AUC. Besides, it aims to implicitly identify the best learning strategy among those considered using four distinct evaluation indicators: accuracy, precision, recall, F-Measure, MCC, PRC area, and ROC-Area (receiver operating characteristics).

The second objective of this study is to determine the optimal feature selection technique for predicting phishing websites. To achieve this objective, five commonly used feature selection methods were assessed and compared with identical classifiers used in the first goal across three evaluation metrics: accuracy, precision, and recall.

The remaining sections of the paper are structured as follows: Section 2 reviews the current literature on implementing ML techniques for phishing. In Section 3, we present our methodology, results, and discussion. Finally, concluding remarks and future directions are proposed in Section 4.

2 Related research

In this section, we examine prior research that has used machine learning techniques to detect phishing. In their study on fuzzy rough set feature selection, Zabihimayvan and Doran (2019) used multiple features to construct a model intended to detect fraudulent activity attempts by criminals intentionally sidestepping existing anti-phishing measures on Iranian banking websites. They trained and tested their system using fuzzy experts, achieving an accuracy of around 88%. Still, they acknowledged that there is scope for optimizing feature selection during the training/testing phases, which could increase predictive power while reducing prediction time.

A different approach was taken by Cui (2019), leveraging data analytics across multiple search engines as its source material identifying idle URLs previously exposed through popular searches or internal links shared between identified related sites along with additional input from frequently visited pages from URL structural similarity evaluation utilizing twelve (12) distinct characteristics depicting intra-relatedness/popularity degrees among entered site structures and components; altogether building classifiers resulting overall classification rates exceeding nearly ninety-five percent success rate coupled at about one-and-a-half false positives per classifying session—however may overlook obfuscated content when analyzing linked materials such as domain name variations generated algorithmically/hosted solely off malicious web domains themselves/limited character string-denser link shortening platforms commonly employed against undetected trapping activities.

Gandotra and Gupta (2021) compared various ML techniques using a 30-feature set comprising approximately 5,000 phishing websites and over 6,000 authentic webpages. This study found that incorporating feature selection enables faster creation of effective phishing detection models while maintaining accuracy. Notably, their results highlight that random forest classification (RF) achieves superior accuracy regardless of whether feature selection is used.

Detecting phishing attempts using ML often involves analyzing lexical features of URLs. This method, pioneered by Abutaha et al. (2021), was intended for use as a browser plug-in that scrutinizes a webpage's URL to alert users before they visit it. To test the efficacy of this technique, over one million legitimate and fraudulent URLs were used in experiments that extracted 22 variables, which were reduced to 10 key ones.

Findings revealed an accuracy rate of 99.89% when combined with SVM classification, surpassing the RF classifier, gradient boosting classifier (GBC), and neural network approaches trialed alongside it.

Chapla et al. (2019) proposed a fuzzy-logic-based framework for detecting phishing websites, using a dataset containing both legitimate and fraudulent URLs. The model achieved 91.4% accuracy but was limited by a small sample size of 1,000 features focused solely on URL-related attributes; as a result, it is less effective at identifying other bypass techniques.

The author in Tan (2018) improved the performance of their phishing URL detection system by using lexical features. A model proposed in Chiew et al. (2019) achieved high accuracy while being independent of third-party services and source code analysis, thereby requiring less processing time. Meanwhile, authors in Abdelhamid et al. (2014) sought to enhance the accuracy of phishing detection systems through feature selection and an ensemble learning approach, achieving 95% accuracy in their experiments.

In yet another effort detailed in article (Su et al., 2023), an innovative approach used seven distinct machine learning algorithms for detecting potential risks posed by various unwanted attacks, including those utilizing zero-day exploits, with selected implemented security features overcoming issues such as language dependency or reliance on external parties during real-time monitoring operations without issue!

Rahman et al.'s research also explored machine learning classifiers' ability concerning various datasets related to phishing practices (Gandotra and Gupta, 2021). This initiative likewise demonstrated equivalent results, with gradient boosting trees (GBT) outperforming all metrics and achieving higher success rates than other methods, such as random forest (RF).

OFS-NN was proposed by Sahingoz et al. (2019) and combines optimal feature selection with a neural network to mitigate overfitting by using a new metric, the feature validity value (FVV). Experimental results on two datasets demonstrated that FVV outperformed information gain and optimal feature selection across various categories, including specific features such as abnormal, domain, HTML/JavaScript, and even address-bar features. The OFS-NN model achieved an overall accuracy of 0.945; however, among the feature types used for detection, the highest accuracy, 0.903, was observed with “address bar,” while the lowest, around half accurate at 0.562, was observed with HTML/JavaScript.

Another phishing detection system was introduced by Sahingoz et al. (2019), which comprises 40 NLP-based traits, along with additional hybrid characteristics derived from word vectorization, totaling about 1,700 more relevant aspects.

In their study, the authors compared seven distinct algorithms offering diverse options but ultimately determined random forest's implementation made using solely natural language processing delivered the most superior performance, scoring almost perfect precision statistics, peaking up to staggering score amounts nearing practically zenith level, i.e., tracing fraudulent websites based upon this criterion managed to reach correct outcomes nearly 98 percent times—rendering maximum efficacy amongst all tested methodologies researched herein.

In Alazaidah et al. (2024), the authors conduct a comparative analysis of 24 classifiers across two datasets using several evaluation metrics. The results revealed the superiority of the random forest, filtered classifier, and J48 classifiers. The author suggests considering additional classification models with different learning strategies, as well as more datasets and evaluation metrics.

The research in Aljofey et al. (2025) proposed a hybrid methodology that combines URL character embeddings with several handcrafted features. Three datasets were used in this work: two are benchmarks, and the third was collected and preprocessed by the authors. The results showed excellent performance across accuracy and other evaluation metrics.

Several deep learning optimization techniques were used in Barik et al. (2025) to improve phishing prediction on websites. The authors used standardization and variational autoencoder techniques in the preprocessing step, and an enhanced grid search optimizer to improve accuracy. The results showed superior performance across accuracy, precision, and F1-score metrics. Unfortunately, utilizing one dataset only does not help in generalizing the finding of the conducted research. Several other related research works could be find in Ganjei and Boostani (2022), Gareth et al. (2023), Ni et al. (2022), Nti et al. (2022), Rashid et al. (2020), Srivastava (2014), Ubing et al. (2019).

Throughout this literature review, random forests perform comparatively better than their counterparts in detecting phishing using machine learning. However, gradient boosting machines (GBM) were frequently not a subject of comparison, affecting project linearity and requiring deeper exploration, while lackluster attempts, such as minimal input/no-noise coefficient data filtering, were still in early phases, indicating that extensive future research remains vital.

3 Research methodology

The methodology employed in this paper is depicted in Figure 1. The first phase in Figure 1 involves collecting the datasets. Afterward, the datasets are cleaned and preprocessed. Then, several feature selection techniques are trained on the pre-processed datasets and evaluated. Next, 40 classification models are trained on the datasets using the selected features from the previous step. These classifiers are compared using several well-known evaluation metrics.

Figure 1
Flowchart of a process to detect phishing websites. Datasets are prepared and transformed, followed by feature extraction from a web URL. The data is split into training and validation dataframes. A model is built with hyperparameter tuning, leading to a final model that outputs phishing detection.

Figure 1. Research methodology workflow diagram.

The description of three website phishing datasets used in this research is provided in Section (A), while Sections (B, C, and D) evaluate the performance of feature selection and machine learning algorithms on these datasets.

Moreover, Section 4 considers which classification model is most appropriate for phishing website datasets. Therefore, three datasets are considered in this section.

In addition to that, Section 5 evaluates and identifies the best among five renowned feature selection methods, as well as identifying the most efficient classifiers, which are outlined in Section 6 before finally discussing primary results obtained from these sections' analyses at length.

In addition, 40 classifiers from six learning strategies are evaluated and contrasted in terms of their predictive efficacy across the three datasets under consideration. These examined classifiers encompass:

Random tree, random forest, REPTree, DecisionStump, HoeffdingTree, LMT, J4B, and REPTree from the Trees learning strategy; BayesNet, NaiveBayesUpdateable, and NaiveBayes from the Bayes learning strategy. Logistic, MultilayerPerceptron, SimpleLogistic, VotedPerceptron, and SMO from the Functions strategy. IBK, KStar, and LWL from the lazy learning strategy; AdaBoostM1, AttributeSelectedClassifier, Bagging, ClassificationViaRegression, FilteredClassifier, IterativeClassifierOptimizer, LogitBoost, MultiClassClassifier, MultiClassClassifierUpdateable, RandomCommittee, RandomizableFilteredClassifier, RandomSubSpace, Stacking, WeightedInstancesHandlerWrapper, vot, and CVParameterSelectionr from the Meta learning strategy; DecisionTable, JRip, OneR, PART, and ZeroR learning strategy. Finally, InputMappedClassifier from the misc learning strategy.

The WEKA software's default settings are utilized for all classification models. This renowned data analysis tool, also known as (Waikato Environment for Knowledge Analysis), is frequently used (Rao et al., 2020). The outcome validation process uses 10-fold cross-validation to ensure the results.

To compare the considered classification models, six performance metrics were analyzed: Accuracy, precision, recall, F-measure, MCC (Matthews correlation coefficient), ROC Area, and PRC Area. Next up are the equations needed to calculate these metrics.

Accuracy =TP+TNTP+TN+FP+FN TP rate =TPTP+FN FPrate                  =FPFP+TN Precision =TPTP+FP Recall =TPTP+FN

Accuracy is a metric that indicates how frequently a machine learning model predicts the correct outcome. The number of right guesses divided by the total number of forecasts yields accuracy (Alzyoud et al., 2024; Alazaidah et al., 2023a,b).

Precision is a metric that indicates how often a machine learning model correctly predicts the positive class. Precision can be calculated as the number of correct positive predictions (true positives) divided by the total number of positive predictions made by the model (including true and false positives).

Recall is a metric that indicates how often a machine learning model accurately detects positive examples (true positives) from all actual positive samples in the dataset. Divide the number of true positives by the number of positive cases to determine recall. The latter includes true positives (correctly identified cases) and false negatives (missed cases) (Al-Batah et al., 2023; Pei et al., 2022).

MCC is the best single-value classification metric for summarizing a confusion or error matrix. A confusion matrix has four entities:

• True positives (TP)

• True negatives (TN)

• False positives (FP)

• False negatives (FN)

And is calculated by the formula:

MCC=TN×TP-FN×FP(TP+FP)(TP+FN)(TN+FP)(TN+FN)

F-measure is an alternative machine learning evaluation metric that assesses the predictive skill of a model by elaborating on its class-wise performance rather than its overall performance, as done by accuracy. The F1 score combines two competing metrics—precision and recall—of a model, making it widely used in recent literature.

F  measure =2*Recall * PrecisionRecall + Precision

ROCArea: a metric that graphically assesses classifier performance across varying thresholds by plotting the false positive rate on the x-axis and the true positive rate on the y-axis.

True Positives (TPs): instances in which the model correctly identifies examples.

True Negatives (TNs): represent cases where the model correctly recognizes and labels negative examples.

False Positives (FPs): occur once the model mistakenly identifies examples as positive. In words, these are instances where negative examples are mistakenly labeled as “positive.”

False Negatives (FNs): arise when positive examples are incorrectly classified as negative. These are cases in which positive examples are incorrectly labeled as “negative.”

3.1 Description of datasets

In the study, three datasets are available for download from the UCI repository. The first dataset, a binary classification set, contains 11,055 instances with 30 integer features. Most of these features are binary. On the other hand, the second dataset comprises three class labels, supports multiclassification, and provides nine integer-type features and 10,000 examples; the third dataset comprises two class labels, consists of 13 integer-type features, and provides 2,670 instances. Table 1 presents the distinguishing qualities of both sets for quick reference. This research focuses on the first two datasets, which are the largest and have 3 class labels, while the third dataset is relatively small with only two classes: selection and understanding.

Table 1
www.frontiersin.org

Table 1. Datasets characteristic.

This step focused on collecting datasets and understanding the attributes. Three datasets, denoted DS1, DS2, and DS3 (Su et al., 2023; Alluwaici M. A. et al., 2020), and DS3 (Mohammad et al., 2015), were selected, as they have different numbers of features and only some are common. Table 2 summarizes the feature categories across the three datasets. DS1, DS2, and DS3 contain both internal features (i.e., derived from webpage URLs and HTML/JavaScript source code available on the webpage itself) and external features (i.e., obtained from querying third-party services such as DNS, search engines, and WHOIS records). DS2 only contains internal features (Mohammad et al., 2015).

Table 2
www.frontiersin.org

Table 2. Categories of features for the two datasets.

3.2 Data preparation

Data preprocessing involves operations such as handling missing values, removing outliers, and eliminating redundant information. As stated in reference (Alazaidah et al., 2023a), the DS1, DS2, and DS3 datasets were free of missing data but required cleaning before use. For instance, the HttpsInHostname attribute in DS3 had all values set to 0, making it unnecessary for analysis.

To identify common attributes across these datasets (DS1-DS2-DS3), the authors checked their descriptions available in references (Mohammad et al., 2015) and (Alzyoud et al., 2024). The authors' citations for each dataset feature significantly simplified this preprocessing step.

It was noted that some feature pairs captured similar information expressed in different formats, such as UrlLength, which is numeric, and its counterpart, “UrlLengthRT,” which is categorical. In cases where those occurred only once, they would be mapped to the same variable, URL_Length, found solely in dataset DP1; otherwise, they would remain separate. Ultimately, after scrutinizing these intricate details across variables, we discovered a match between 18 key attributes among the three aforementioned sources (as shown in Table 3).

Table 3
www.frontiersin.org

Table 3. The matched features between ds1, ds2 and ds3 dataset with the features after feature selection.

3.3 Feature selection

The significance of independent features was assessed using P-values, with a threshold of 0.05 to identify statistically significant features.

To begin with, the Spearman rank-order correlation method assessed collinearity between feature pairs. In Figure 2, we show the correlation matrix for the DS1-2-3 matching feature, with the pop-up window and on-mouse-over having the highest observed value at 0.73, followed by the pop-up window and favicon pair, which had a corresponding score of 0.66. Most pairs showed small or negligible correlations.

Figure 2
Heatmap displaying a correlation matrix for various features, indicating their relationships with each other. The scale ranges from negative one (blue) to positive one (red). Diagonal elements show a perfect correlation of one.

Figure 2. Spearman correlation heatmap based on the merged Dataset 1, 2, and besides 3 datasets, showing some collinearity between the different features (note that Result is the class attribute).

To identify multicollinearity—where three or more variables converge even when no two have high individual similarities—the Variance inflation factor (VIF) scores were used (Ubing et al., 2019).

Each trait received its VIF rating calculated as follows:

VIFi=11-Ri2

Ri2=Unadjusted coefficient of determination for regressing the ith independent variable on the remaining ones.

Based on VIF analysis, in addition to p-values, the combined DS1-2-3 data identified 15 features as noteworthy and independent.

This process used various Python packages, including statsmodels to calculate VIF scores and p-values, scikit-learn to build logistic regression models, and Matplotlib and Seaborn to generate visualizations.

For the feature selection and ranking step, four techniques have been considered and evaluated. The first technique is called Correlation Attribute Evaluator (CAE). CAE measures the linear correlations between the input features and the output feature (class) and is usually implemented using Pearson's correlation coefficient. The second technique is the Gain Ratio Attribute Evaluator (GRAE). This technique assesses feature significance by measuring each feature's gain ratio relative to the class label. The third technique is dubbed the Information Gain Attribute Evaluator (IGAE). IAGE measures how a feature is worth based on the value of information gain for this feature with respect to the class label. The last technique is the Principal Components Analysis (PCA). This technique aims to reduce data dimensionality by transforming a large dataset into a smaller one with low-correlated features.

4 Comparative analysis amongst the classification models in the domain of website phishing

This section describes the process of determining the ideal classification model for phishing datasets. To attain this objective, three distinct sets of data cognate to phishing have been analyzed in detail. Table 4 outlines the highlighted attributes associated with these datasets, all of which can be obtained from the UCI repository with ease.

Table 4
www.frontiersin.org

Table 4. Comparative analysis of 40 classifiers utilizing feature selection via CAE, on dataset DS1.

The results of using 40 classifiers on the phishing website dataset 1 (DS1) are presented in Table 4 and analyzed with respect to accuracy and pre-session metrics. The data reveal that IBK achieves the highest accuracy, whereas RandomCommittee achieves outstanding accuracy and precision.

Evaluating learning strategies indicates that Lazy achieves optimal accuracy, while RandomCommittee yields superior precision.

The Recall and MCC metric results for the phishing website dataset after applying 40 classifiers are outlined in Table 5. The table shows that random forest classification models have produced superior results when evaluated against these criteria.

Table 5
www.frontiersin.org

Table 5. Comparative analysis of 40 classifiers utilizing feature selection via CAE, on dataset DS1.

Additionally, Tree outperforms other learning strategies on both precision and MCC metrics in this dataset (DS1).

A comparative analysis of 40 classifiers on the phishing dataset, in terms of accuracy and precision, is presented in Table 5.

Random forest outperforms the other considered classifiers in accuracy and precision on the phishing dataset (DS1), as shown in the table.

Moreover, among the eight learning strategies assessed using these two measures, the Functions Tree strategy yields better outcomes than its counterparts.

The precision metrics obtained from applying 40 classifiers to the phishing dataset are shown in Table 6. According to the table, among all classification models, the RandomCommittee learning strategy achieves the highest precision. Similarly, for the Random Forest metric, based on Table 6 and the Trees learning strategy, we can see that the Random Forest classification model delivers superior outcomes.

Table 6
www.frontiersin.org

Table 6. Comparative analysis of 40 classifiers utilizing feature selection via GRAE, on dataset DS1.

In conclusion, regarding optimizing the precision metrics shown in Table 6, function learning is our preferred approach, yielding the best results compared to other available strategies.

In Table 7, the random forest classification models achieve the best recall and MCC results on the phishing dataset (DS1). The random forest classifier belongs to the Tree learning strategy.

Table 7
www.frontiersin.org

Table 7. Comparative analysis of 40 classifiers utilizing feature selection via IGAE, on dataset DS1.

Moreover, regarding the best learning strategy, Table 7 shows that the tree learning strategy achieves the best results for the recall and MCC metrics.

According to Table 8, the classifier in the tree learning strategy, random forest, has the highest precision metric. Additionally, when it comes to the accuracy metric and other compared classifiers, this same classifier performs best again. Furthermore, among the seven considered learning strategies, Tree stands out as achieving superior results across comparisons.

Table 8
www.frontiersin.org

Table 8. Comparative analysis of 40 classifiers utilizing feature selection via PC, on dataset DS1.

The outcomes of the 40 classifiers applied to the phishing website dataset, with respect to recall and MCC, are shown in Table 9.

Table 9
www.frontiersin.org

Table 9. Comparative analysis of 40 classifiers utilizing feature selection via CAE, on dataset DS2.

Analysis of Table 9 indicates that, among all classification models, the random tree classifier achieved the highest accuracy and precision on the given dataset (DS2). Additionally, compared with other learning strategies exhibited by the remaining classifying algorithms in Table 9, the tree strategy was found to outperform others in terms of efficient data processing.

The results from implementing 40 classifiers on the phishing website dataset (DS2) are shown in the table, including accuracy and precision metrics.

The Random Forest model, a tree-based learning strategy, achieves higher accuracy and precision than other classification models, as shown in Table 10.

Table 10
www.frontiersin.org

Table 10. Comparative analysis of 40 classifiers utilizing feature selection via CAE on dataset DS2.

Besides, when focusing solely on optimizing the precision metric through a strategic approach perspective, adopting the tree learning strategy can be highly effective.

Table 11 presents the results of applying 40 classifiers to the phishing website dataset (DS2), focusing on recall and MCC.

Table 11
www.frontiersin.org

Table 11. Comparative analysis of 40 classifiers utilizing feature selection via GRAE, on dataset DS2.

According to Table 11, the Random Tree classifier performs exceptionally well on the Recall metric. At the same time, the Random Forest model achieves the best MCC among all considered classification models.

Furthermore, Trees prove themselves to be an exceptional learning strategy, producing superior output compared to seven alternative strategies from both recall and MCC perspectives.

The results obtained from the 40 classifiers applied to the phishing website dataset (DS2) for the recall and MCC metrics are presented in Table 12. Random tree classifier demonstrates superior recall, while the random forest and the random tree stand out with exceptional performance on MCC among the classification models considered. Also, compared to the seven learning strategies under review, Trees shows better results for both the Recall and MCC measures.

Table 12
www.frontiersin.org

Table 12. Comparative analysis of 40 classifiers utilizing feature selection via IGAE, on dataset DS2.

Additionally, these two classifiers have been most effective on this dataset, as indicated by their respective evaluation scores in Table 12.

The accuracy and precision metrics for the phishing dataset (DS2) were evaluated using 40 classifiers, and the results are presented in Table 13.

Table 13
www.frontiersin.org

Table 13. Comparative analysis of 40 classifiers utilizing feature selection via PC, on dataset DS2.

From the table, it is evident that the IBK model under the lazy learning strategy, along with the random tree model under the tree learning approach, achieved the highest accuracy and precision values.

Furthermore, based on the findings in Table 13 regarding optimizing the precision metric for the Learning Strategy factor, Tree Learning should be selected for its superior performance.

Table 14 displays the results of forty classifiers applied to a dataset (DS3) containing phishing websites. The evaluation metrics were F-measure and ROC area. Among these, the random forest classifier showed exceptional performance in both F-measure and ROC, compared with all seven learning strategies under scrutiny. Additionally, Trees displayed better outcomes than others on both measures.

Table 14
www.frontiersin.org

Table 14. Comparative analysis of 40 classifiers utilizing feature selection via CAE, on dataset DS3.

From Table 14, the scores for each evaluation method indicate that, among the classifiers tested, they were most efficient on this dataset when compared with the other methods employed herein.

The results of running 40 classifiers on the phishing website dataset (DS3) are shown in Table 15, including F-measure and ROC metrics. According to the table, the random forest classifier outperforms other classification models on both F-measure and ROC for this dataset.

Table 15
www.frontiersin.org

Table 15. Comparative analysis of 40 classifiers utilizing feature selection via CAE, on dataset DS3.

Additionally, Trees is the most effective learning strategy for achieving high marks on both evaluation measures among the seven strategies considered here.

When 40 classifiers were applied to the phishing website dataset (DS3), Table 16 shows the results for both the F-measure and ROC metrics. According to this table, among the considered classification models, the random forest classifier achieves superior results in terms of F-measure and ROC on the same dataset. Besides, Trees, as a learning strategy, demonstrates top-notch performance across both evaluation criteria when juxtaposed with seven other strategies.

Table 16
www.frontiersin.org

Table 16. Comparative analysis of 40 classifiers utilizing feature selection via GRAE, on dataset DS3.

The results of applying 40 classifiers to the phishing website dataset, with respect to F-measure and ROC metrics, are shown in Table 17. The random forest classifier outperforms the other considered classification models on both measures for this dataset, as shown in Table 17.

Table 17
www.frontiersin.org

Table 17. Comparative analysis of 40 classifiers utilizing feature selection via IGAE, on dataset DS3.

Notably, Trees proves superior as a learning strategy, based on its performance across all evaluation criteria among the seven strategies compared here, particularly on F-measure and ROC metrics.

The results of using 40 classifiers on the phishing website detection dataset (DS3) are depicted in Table 18 and analyzed using accuracy and pre-session metrics. The data reveal that Random Forest achieves the best F-MEASURE and ROC scores, while other top-performing methods, Random Committee and J4B, also achieve outstanding F-MEASURE and ROC scores.

Table 18
www.frontiersin.org

Table 18. Comparative analysis of 40 classifiers utilizing feature selection via PC, on dataset DS3.

Evaluating the learning strategy indicates that Tree attains optimal results for F-MEASURE and ROC METRICS.

The summary of the comparative analysis of 40 classifiers across three datasets, as presented in Tables 418, is shown in Table 19. In this table, “RC” refers to a random committee, “RF” denotes random forest, and “RT” stands for random tree.

Table 19
www.frontiersin.org

Table 19. Best classifier with respect to the evaluation metric and the dataset.

The study revealed that the classifier delivered superior results across the considered metrics and datasets. According to Table 11, the random tree classifier achieved superior results on 13 occasions, and the random forest classifiers were best seven times. Random committee classifiers besides IBK performed well twice.

This indicates that, for phishing datasets, random forest is the preferred option, compared with committee classifiers, which ranked second. The phishing website said this: the random forest classifier excelled on the phishing dataset, where all attributes were integer types, and it showcased excellent performance on the phishing website detection dataset, or even the phishing website dataset, which included integer types. Its exceptional ability to perform well regardless of the number/types of attributes makes it evident why random forest remains a preferred choice among classification techniques.

5 Best feature selection method to use with phishing website datasets

The objective of this section is to determine the optimal feature selection approach suitable for phishing datasets. To achieve this, five popular methods are assessed and compared: ClassifierAttributeEval (CAE), CorrelationAttributeEval (CAE), GainRatioAttributeEval (GRAI), InfoGainAttributeEval (IGAE), and principal components. The default settings and parameters in WEKA were utilized throughout all evaluations.

These feature selection methods were applied to a phishing dataset, with 40 classification models trained using only the top-performing 15 features, corresponding to 0.50% of the available attributes (30).

Moreover, the evaluation metrics used earlier, such as MCC, accuracy, and precision, will be analyzed again comprehensively within the same segment under Section Four's scope.

The phishing dataset underwent various feature selection methods, after which 40 classification models were trained using the top-performing 0.50% of features (15 out of 30). This section evaluates the accuracy, precision, and MCC metrics outlined in Section 4 using Table 20. Dissecting four feature selection techniques considered for this study, applied to the phishing dataset, and evaluated for accuracy, is showcased.

Table 20
www.frontiersin.org

Table 20. Evaluation of the considered feature selection methods on the phishing dataset-(DS1) using the accuracy metric.

As shown in Table 20, the random forest and IBK classifiers achieved their highest accuracies with CAE as their selected method, also revealing the Functions strategy as superior across accuracy-based field analyses in this particular case.

In addition, Tree performed optimally when acclimated alongside CAE's specified attribute-selection methodology.

Table 20 clearly shows that using only 0.50% of the features generally improves accuracy.

For instance, random forest classifiers achieved the best accuracy result (96.2) when all features were used, while the random forest achieved (96.1). However, both classifiers attained their highest accuracy scores with the phishing dataset by utilizing GRAE, besides the IGAE feature selection method on just 0.50% of its features, resulting in an overall improvement based on Table 20 analysis evidence, which suggests that employing a preprocessing step, such as feature selection, may enhance predictive performance

for various classification models specifically through adoption of the CAE technique.

The evaluation results for the phishing dataset using five feature selection methods are shown in Table 21, with emphasis on the precision metric. The Random Forest classifier using the GRAE method achieved the highest precision, yielding remarkable results regardless of the feature selection method.

Table 21
www.frontiersin.org

Table 21. Evaluation of the considered feature selection methods on the phishing dataset-(DS1) using the precision metric.

Moreover, function and tree strategies proved to be efficient learning approaches for the precision metrics in this dataset. GRAE achieved the trees' maximum precision.

Comparing Table 21 (utilizing all features) and using only 0.50% percentiles confirms that a general improvement in precision results can be seen when utilizing fewer attributes such as those demonstrated in Table 1's findings; for instance, while utilizing every attribute resulted in a top score reaching 0.938, lessened usage proved more beneficial overall performance-wise across varying methodologies examined previously through these tables mentioned above.

Hence, according to Table 21, using feature selection as a preprocessing step may improve the overall predictive performance of most classification models.

Hence, according to Table 21, using feature selection as a preprocessing step may improve the overall predictive performance of most classification models,

Table 22 displays the assessment results for five feature selection techniques applied to the phishing dataset, using MCC as the metric. The Random Forest classifier achieved the highest MCC values with the IGAE technique, and the Functions strategy produced optimal learning results on this data set. The tree method achieved favorable results by applying IGAE for feature selection.

Table 22
www.frontiersin.org

Table 22. Evaluation of the considered feature selection methods on the phishing dataset-(DS1) using the MCC metric.

Moreover, comparing all features vs. using only 0.50% showed an improvement in overall performance when examined against the MCC matrix, exemplified by the best-case scenario, in which using all available features yielded a score of 0.887 via Random Forest classification.

According to Table 15, considerable progress is expected in refined prediction accuracy across various classification models if appropriate feature selection is conducted during preprocessing, particularly when leveraging responsive methods such as those designated “IGEA.”

The optimal approach to optimization is demonstrated in Figure 5, which shows that IGAE Feature selection reigns supreme.

Figures 35 reveal that the feature selection IGAE and GRAE, in addition to the tree learning strategy, exhibit superior performance compared to other strategies in terms of accuracy, precision, recall, MCC, F-measure, and ROC area across three datasets. Moreover, the rules and misc learning strategies demonstrate subpar results across almost all metrics for those same three datasets.

Figure 3
Bar chart displaying values for categories CAE, CAE, GRAE, IGAE, and PC across seven indices. Key highlights include high values at index 3 for all categories, and varying heights at indices 4 and 5, with lower values at index 6. Each category is color-coded.

Figure 3. Evaluation of the best feature selection method to use with phishing website datasets-(DS1).

Figure 4
Bar chart displaying five categories labeled CAE, CAE, GRAE, IGAE, and PC across seven data points. Heights of bars range from approximately 0.3 to 0.8, showing varied values for each category at each point.

Figure 4. Evaluation of the best feature selection method to use with phishing website datasets-(DS2).

Figure 5
Bar chart showing performance for seven categories, marked as one to seven. Each category displays five groups: CAE (blue), CAE (red), GRAE (green), IGAE (purple), and PC (light blue). Categories one to six range between 80 and 100, while category seven is lower, between 60 and 70.

Figure 5. Evaluation of the best feature selection method to use with phishing website datasets-(DS3).

Consequently, it is strongly advised against using rules other than the misc learning strategy course of study for phishing detection.

6 Conclusion and future research

This research aimed to identify optimal characteristics for creating a stronger machine learning model for detecting phishing websites. Over the past three decades, machine learning has made significant strides and has been implemented in many practical applications, including identifying malicious web pages used in scams or identity theft.

The paper investigates the best classification model for detecting these site types. While exploring which classification method would best handle phishing website detection datasets, the author discovered that an ensemble approach combining Random Forest, Random Tree, and IBK classifiers proved most effective. In conclusion, after evaluating several feature selection methods for detecting fraudulent websites, InfoGainAttributeEval and GainRatioAttributeEval were deemed reliable options. However, further appraisals focusing on variables such as the additional classification styles mentioned above should continue to be considered alongside other metrics. Comparing their performance will provide additional insight into refining detection accuracy for tracing illicit online activity.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: https://data.mendeley.com/datasets/h3cgnj8hft/1.

Author contributions

RA: Writing – original draft, Writing – review & editing. MB: Writing – review & editing, Writing – original draft, Data curation. KA: Writing – original draft, Methodology, Writing – review & editing. AA: Writing – review & editing, Software, Writing – original draft. YH: Writing – review & editing, Writing – original draft, Project administration. FA: Writing – review & editing, Writing – original draft, Visualization. EQ: Writing – review & editing, Writing – original draft.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Abdelhamid, N., Ayesh, A., and Thabtah, F. (2014). Phishing detection based associative classification data mining. Expert Syst. Appl. 41, 5948–5959. doi: 10.1016/j.eswa.2014.03.019

Crossref Full Text | Google Scholar

Abutaha, M., Ababneh, M., Mahmoud, K., and Baddar, S. A. H. (2021). “URL phishing detection using machine learning techniques based on URLs lexical analysis,” in 2021 12th International Conference on Information and Communication Systems (ICICS) (Valencia: IEEE), 147–152. doi: 10.1109/ICICS52457.2021.9464539

Crossref Full Text | Google Scholar

Alazaidah, R., Ahmad, F. K., Mohsen, M. F. M., and Junoh, A. K. (2018). Evaluating conditional and unconditional correlations capturing strategies in multi label classification. J. Telecommun. Electr. Comput. Eng. 10, 47–51.

Google Scholar

Alazaidah, R., Al-Shaikh, A., Al-Mousa M, R., Khafajah, H., Samara, G., Alzyoud, M., et al. (2024). Website phishing detection using machine learning techniques. J. Stat. Applic. Probab. 13, 119–129. doi: 10.18576/jsap/130108

Crossref Full Text | Google Scholar

Alazaidah, R., Alzyoud, M., Al-Shanableh, N., and Alzoubi, H. (2023b). “The significance of capturing the correlations among labels in multi-label classification: an investigative study,” in AIP Conference Proceedings, Vol. 2979 (Jordan: AIP Publishing). doi: 10.1063/5.0177340

Crossref Full Text | Google Scholar

Alazaidah, R., Samara, G., Aljaidi, M., Haj Qasem, M., Alsarhan, A., and Alshammari, M. (2023a). Potential of machine learning for predicting sleep disorders: a comprehensive analysis of regression and classification models. Diagnostics 14:27. doi: 10.3390/diagnostics14010027

PubMed Abstract | Crossref Full Text | Google Scholar

Al-Batah, M. S., Alzboon, M. S., and Alazaidah, R. (2023). Intelligent heart disease prediction system with applications in Jordanian hospitals. Int. J. Adv. Comput. Sci. Applic. 14, 1151–1159. doi: 10.14569/IJACSA.2023.0140954

Crossref Full Text | Google Scholar

Aljofey, A., Bello S, A., Lu, J., and Xu, C. (2025). Comprehensive phishing detection: a multi-channel approach with variants TCN fusion leveraging URL and HTML features. J. Netw. Comput. Applic. 238:104170. doi: 10.1016/j.jnca.2025.104170

Crossref Full Text | Google Scholar

Alluwaici, M., Junoh, A. K., AlZoubi, W. A., Alazaidah, R., and Al-luwaici, W. (2020). New features selection method for multi-label classification based on the positive dependencies among labels. Solid State Technol. 63.

Google Scholar

Alluwaici, M. A., Junoh, K., and Alazaidah, R. (2020). New problem transformation method based on the local positive pairwise dependencies among labels. J. Inform. Knowl. Manag. 19:2040017. doi: 10.1142/S0219649220400171

Crossref Full Text | Google Scholar

Alzyoud, M., Alazaidah, R., Aljaidi, M., Samara, G., Qasem, M., Khalid, M., et al. (2024). Diagnosing diabetes mellitus using machine learning techniques. Int. J. Data Netw. Sci. 8, 179–188. doi: 10.5267/j.ijdns.2023.10.006

Crossref Full Text | Google Scholar

APWG (2021). Phishing Activity Trends Reports, 4th Quarter 2020. Anti-Phishing Working Group. Available online at: https://apwg.org/trendsreports/ (Accessed May 09, 2021).

Google Scholar

Athulya, A. A., and Praveen, K. (2020). “Towards the detection of phishing attacks,” in 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI) (Tirunelveli, India: IEEE), 337–343. doi: 10.1109/ICOEI48184.2020.9142967

Crossref Full Text | Google Scholar

Barik, K., Misra, S., and Mohan, R. (2025). Web-based phishing URL detection model using deep learning optimization techniques. Int. J. Data Sci. Anal. 20, 1–23. doi: 10.1007/s41060-025-00728-9

Crossref Full Text | Google Scholar

Chapla, H., Kotak, R., and Joiser, M. (2019). “A machine learning approach for URL based web phishing using fuzzy logic as classifier,” in 2019 International Conference on Communication and Electronics Systems (ICCES) (Coimbatore: IEEE), 383–388. doi: 10.1109/ICCES45898.2019.9002145

Crossref Full Text | Google Scholar

Chiew, K. L., Tan, C. L., Wong, K., Yong, K. S. C., and Tiong, W. K. (2019). A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Inform. Sci. 484, 153–166. doi: 10.1016/j.ins.2019.01.064

Crossref Full Text | Google Scholar

Cui, Q. (2019). Detection and Analysis of Phishing Attacks. Diss. Université d'Ottawa/University of Ottawa.

Google Scholar

Gandotra, E., and Gupta, D. (2021). An efficient approach for phishing detection using machine learning, multimedia security: algorithm development. Anal. Applic. 239–253. doi: 10.1007/978-981-15-8711-5_12

Crossref Full Text | Google Scholar

Ganjei, M. A., and Boostani, R. (2022). A hybrid feature selection scheme for high-dimensional data. Eng. Appl. Artif. Intell. 113:104894. doi: 10.1016/j.engappai.2022.104894

Crossref Full Text | Google Scholar

Gareth, J., Witten, D., Hastie, T., Tibshirani, R., and Taylor, J. (2023). “Statistical learning,” in An Introduction to Statistical Learning: With Applications in Python (Cham: Springer International Publishing), 15–67.

Google Scholar

Mohammad, R. M., Thabtah, F., and McCluskey, L. (2015). Phishing Websites Features. School of Computing and Engineering, University of Huddersfield.

Google Scholar

Ni, J., Shen, K., Chen, Y., Cao, W., and Yang, S. X. (2022). An improved deep network-based scene classification method for self-driving cars. IEEE Trans. Instrument. Measur. 71, 1–14. doi: 10.1109/TIM.2022.3146923

Crossref Full Text | Google Scholar

Nti, I. N., Narko-Boateng, O., Adekoya, A. F., and Somanathan, A. R. (2022). Stacknet based decision fusion classifier for network intrusion detection. Int. Arab J. Inform. Technol. 19, 478–490. doi: 10.34028/iajit/19/3A/8

Crossref Full Text | Google Scholar

Pei, M., Feng, Y., Changlong, Z., and Minghua, J. (2022). Smoke detection algorithm based on negative sample mining. Int. Arab J. Inform. Technol. 19, 1–9. doi: 10.34028/iajit/19/4/15

Crossref Full Text | Google Scholar

Rao, R. S., Vaishnavi, T., and Pais, A. R. (2020). CatchPhish: detection of phishing Websites by inspecting URLs. J. Ambient Intell. Hum. Comput. 11, 813–825. doi: 10.1007/s12652-019-01311-4

Crossref Full Text | Google Scholar

Rashid, J., Mahmood, T. M., Nisar, W., and Nazir, T. (2020). “Phishing detection using machine learning technique,” in 2020 First International Conference of Smart Systems and Emerging Technologies (SMARTTECH) (Riyadh: IEEE), 43–46. doi: 10.1109/SMART-TECH49988.2020.00026

Crossref Full Text | Google Scholar

Sahingoz, O. K., Buber, E., Demir, O., and Diri, B. (2019). Machine learning based phishing detection from URLs. Expert Syst. Applic. 117, 345–357. doi: 10.1016/j.eswa.2018.09.029

Crossref Full Text | Google Scholar

Srivastava, S. (2014). Weka: a tool for data preprocessing, classification, ensemble, clustering and association rule mining. Int. J. Comput. Applic. 88:10. doi: 10.5120/15389-3809

Crossref Full Text | Google Scholar

Su, J.-M., Chang, J., Indrayani, N. L. D., and Wang, C. (2023). Machine learning approach to determine the decision rules in ergonomic assessment of working posture in sewing machine operators. J. Saf. Res. 87, 15–26. doi: 10.1016/j.jsr.2023.08.008

PubMed Abstract | Crossref Full Text | Google Scholar

Tan, C. L. (2018). Phishing Dataset for Machine Learning: Feature Evaluation. Mendeley Data. Available online at: https://data.mendeley.com/datasets/h3cgnj8hft/1 (Accessed May 10, 2021).

Google Scholar

Ubing, A. A., Kamilia, S., Abdullah, A., Jhanjhi, N., and Supramaniam, M. (2019). Phishing Website detection: an improved accuracy through feature selection and ensemble learning. Int. J. Adv. Comput. Sci. Appl. 10, 252–257. doi: 10.14569/IJACSA.2019.0100133

Crossref Full Text | Google Scholar

Vigneswari, T., Vijaya, N., and Kalaiselvi, N. (2021). Early prediction of cervical cancer using machine learning techniques. Turkish J. Physiother. Rehabil. 32, 262–269.

Google Scholar

Warburton, D. (2020). 2020 Phishing and Fraud Report. F5 Labs.

Google Scholar

Zabihimayvan, M., and Doran, D. (2019). “Fuzzy rough set feature selection to enhance phishing attack detection,” in 2019 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (New Orleans, LA: IEEE). doi: 10.1109/FUZZ-IEEE.2019.8858884

Crossref Full Text | Google Scholar

Keywords: classification, phishing websites, machine learning, feature selection, URL analysis

Citation: Alazaidah R, BaniSalman M, Alqawasmi KE, Abu Zaid A, Hazaimeh Y, Alshraiedeh FS and Qumsiyeh E (2026) Identifying key features for phishing website detection through feature selection techniques. Front. Comput. Sci. 7:1687867. doi: 10.3389/fcomp.2025.1687867

Received: 18 August 2025; Revised: 23 November 2025;
Accepted: 26 November 2025; Published: 21 January 2026.

Edited by:

Zainab Loukil, University of Gloucestershire, United Kingdom

Reviewed by:

Faisal Ahmad, Workday Inc., United States
Abdul Karim, Hallym University, Republic of Korea

Copyright © 2026 Alazaidah, BaniSalman, Alqawasmi, Abu Zaid, Hazaimeh, Alshraiedeh and Qumsiyeh. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Emma Qumsiyeh, ZS5xdW1zaXllaEBwYWx1bml2LmVkdS5wcw==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.