Deep learning-based meta-learner strategy for electricity theft detection

Electricity theft damages power grid infrastructure and is also responsible for huge revenue losses for electric utilities. Integrating smart meters in traditional power grids enables real-time monitoring and collection of consumers’ electricity consumption (EC) data. Based on the collected data, it is possible to identify the normal and malicious behavior of consumers by analyzing the data using machine learning (ML) and deep learning methods. This paper proposes a deep learning-based meta-learner model to distinguish between normal and malicious patterns in EC data. The proposed model consists of two stages. In Fold-0, the ML classifiers extract diverse knowledge and learns based on EC data. In Fold-1, a multilayer perceptron is used as a meta-learner, which takes the prediction results of Fold-0 classifiers as input, automatically learns non-linear relationships among them, and extracts hidden complicated features to classify normal and malicious behaviors. Therefore, the proposed model controls the overfitting problem and achieves high accuracy. Moreover, extensive experiments are conducted to compare its performance with boosting, bagging, standalone conventional ML classifiers, and baseline models published in top-tier outlets. The proposed model is evaluated using a real EC dataset, which is provided by the Energy Informatics Group in Pakistan. The model achieves 0.910 ROC-AUC and 0.988 PR-AUC values on the test dataset, which are higher than those of the compared models.


Introduction
Non-technical losses (NTLs) are a major concern nowadays, defined as the energy consumed by clients without paying bills to electric utilities.These losses occur due to physical tempering of smart meters, energy fraud, installation error, inappropriate parameter settings of smart meters, etc., (Aslam et al., 2021;León et al., 2011;Glauner et al., 2016;Chen et al., 2023;Shehzad et al., 2022).Moreover, these losses are only not accountable for high revenue losses for power providers but also negatively impact smart grid reliability.Electricity theft is a primary reason for NTLs that are performed through meter tempering, bypassing smart meters, direct connections to distribution lines, changing meter readings over communication links, double tapping attacks, etc., (Shehzad et al., 2022;McLaughlin et al., 2013;Khattak et al., 2022).
A recent report indicates that NTLs cause a revenue loss of 96 billion Canadian dollars yearly; this amount is enough to supply energy to 77,000 homes for 1 year (Zheng et al., 2017;Haq et al., 2023).According to the latest World Bank report, India, China, and Brazil bear losses of 25%, 16%, and 6%, respectively, on their total electric supply (Aryanezhad, 2019).These losses are not only limited to developing countries; it is estimated that developed countries such as the United Kingdom bear £173 million, while the United States has to pay $6 billion every year (Jokar et al., 2015;Saraiva et al., 2015).Another report estimated that power companies pay more than 20 billion dollars annually for electricity theft.In addition, electricity theft creates many issues in smart grids like a high load on a grid that may create fire shocks and affects public safety.From the aforestated discussion, it is concluded that an accurate electricity theft detection (ETD) model is required to reduce financial loss and save the infrastructure of electric utilities.
Substantial research work has been performed for accurate ETD.Conventional theft detection methods include manually checking the unauthorized line diversions, comparing the tempered meters with benign ones, and human verification of faulty and problematic meters.However, these methods have low detection accuracy, are expensive, time-consuming, and lead to unreliable results.The introduction of the smart grid brings new opportunities for detecting electricity theft.It consists of a traditional grid system, communications technologies, smart meters, and sensors for collecting data and computing facilities to manage the self-healing mechanism (Jiang et al., 2016;Ahir and Chakraborty, 2022).So, smart meters and sensors collect information about voltage level, power factor, electricity price at the current moment, status of grid infrastructure, and consumption behavior of consumers.Moreover, they help maintain energy and information flow between consumers and electric utilities (Batalla-Bejerano et al., 2020;Zidi et al., 2023).These collected data help design advanced algorithms, which control demand and response (Langendahl et al., 2019;Ahmed et al., 2019) and predict the electricity prices (Chang et al., 2019;Peng et al., 2018;Qiao and Yang, 2020).Some studies (Jokar et al., 2015;Punmiya and Choe, 2019;Buzau et al., 2018;Zheng et al., 2017;Ramos et al., 2016;Huang and Xu, 2021;Gunturi and Sarkar, 2021;Li S. et al. 2019;Ibrahim et al., 2021) show that it is possible to detect malicious patterns in electricity consumption (EC) data by performing an analysis using machine learning (ML)/deep learning (DL) methods.However, most of the studies which are scrutinized for ETD have the following drawbacks: • The false-positive rate is very high, which is not bearable by electric utilities, • Most of the existing methods are based on specific hardware devices or sensors, which are very expensive, • Individual classifiers like support vector machine (SVM), decision tree (DT), and naive Bayes (NB) have low accuracy, • The class imbalance problem is tackled using under-and over-sampling methods, which create information loss and duplication of data, leading to underfitting and overfitting issues,

Serial no
Mathematical form • In some cases, class imbalance problems or uneven distribution of samples is not tackled in a proper way, which pushes the ML/DL toward the majority class, which leads to biased results, • Individual classifiers suffer in overfitting, which decreases their generalization property, and • Most of the researchers focus on bagging and boosting methods.They ignore the efficacy of stacking methods for detecting malicious patterns in EC data.
Therefore, this article aims to develop a new model that solves the aforementioned concerns.So, a DL-based meta-learner is proposed, which has two folds.At fold-0, conventional ML classifiers learn and extract diverse knowledge from EC data.These multiple classifiers learns based on the same dataset, but differently.After this, a multilayer perceptron (MLP) is employed, which is trained and tested using prediction results of fold-0 classifiers.An MLP model gives different weightage to each classifier on its prediction capability and establishes a non-linear relationship among them.This process improves its performance and helps overcome the overfitting problem of individual classifiers.Furthermore, the class imbalance problem is a severe issue in ETD.Due to this, ML/DL models tend toward the majority class and ignore the minority class samples.Consequently, these classifiers lead to false alarms and produce biased results.This study developed a hybrid approach based on Adaptive Synthetic (oversampling) and TomekLinks (undersampling) to tackle the class imbalance problem.The main contributions of this study are mentioned here.
• A deep learning-based meta-learner model is developed to detect malicious patterns in the EC data.According to the analyzed literature, this is the first study that utilizes the metalearner concept in smart grids.• Uneven distribution of class samples is a severe issue, handled through a hybrid approach.
• Extensive experiments are conducted on the real dataset to evaluate the efficacy of the proposed meta-learner model and compare its results with those of other boosting, bagging, and standalone models.
The remaining article is organized as follows.Related work and description of the proposed model are explained in Section 2 and Section 3, respectively.The experimental and results discussion is given in Section 4. The conclusive note on this study is presented in Section 5.

Related work
In this section, existing ETD methods that are already proposed in the literature are comprehensively discussed.These methods analyze the EC data and differentiate between normal and malicious patterns.Jokar et al. (2015) proposed a consumption pattern-based electricity theft detector based on the SVM classifier.However, they do not utilize any feature engineering technique to select and extract optimal attributes from EC data.Moreover, the SVM classifier has a large execution time and performs poorly on noisy datasets.Punmiya and Choe (2019) worked on optimal feature selection using stochastic and built-in modules of gradient boosting classifiers (GBCs).They proved that fewer features reduce ML models' complexity and overfitting issue while increasing their performance in terms of the detection rate.However, GBCs have a large number of hyperparameters, which are difficult to be tuned.
In Buzau et al. (2018), a solution for NTL detection is presented.The different features are derived from EC data and historical information of smart meters using z-score, K-means clustering, and local outlier factors.The derived features are fed as input into several ML algorithms like SVM, LR, K-nearest neighbor, and extreme gradient boosting (XGB).The work performed in this study to tackle the curse of the dimensionality issue is mostly based on manual feature engineering, which is a very time-consuming and expensive task.Buzau et al. (2019) proposed a hybrid DL model with a selflearning ability to extract abstract or latent features from EC data.Moreover, this is the first study that integrates both sequential and non-sequential information.However, this study does not tackle the class imbalance problem, which affects the model's performance in real-world scenarios (Qin et al., 2020).Zheng et al. (2017) developed a hybrid DL model based on a convolutional neural network and MLP.The former is utilized to extract latent or abstract patterns, while the latter is employed to retrieve global knowledge in EC data.Extensive experiments are conducted to evaluate the performance of the proposed model, and the authors argue that a hybrid model performs better than individual ones.However, a DL model requires a huge volume of data and computation power, which is the main hurdle for its realtime applications in the smart grid.Another DL-based model is proposed to detect the losses at the generation and consumption sides (Nabil et al., 2019;Takiddin et al., 2020).Li W. et al. (2019) developed a multimodel based on RNN, MLP, and LSTM layers.Moreover, its performance is measured by employing the simple moving average.The model is trained on real data collected using a VeraEdge controller.However, the accuracy is only utilized to check the model performance, which is not a good measure to evaluate DL or ML models, where data are unevenly distributed.The high dimensionality of data is another concern that badly affects the efficacy of ML or DL models.Ramos et al. (2016) utilized metaheuristic techniques to select optimal features from EC data.Huang and Xu (2021) extracted features using a stacked denoising autoencoder, while Kong et al. (2021) employed the kernel fuzzy C-means algorithm and a simple autoencoder for this purpose.
Gunturi and Sarkar (2021) proposed an energy theft detector that is based on ensemble approaches.They argue that bagging models perform better as compared to boosting ones.Moreover, near-miss and synthetic minority oversampling techniques are employed to tackle uneven distribution of data.However, undersampling methods lose valuable information from data, while oversampling techniques create an overfitting problem.The convolutional neural network (Li S. et al., 2019) is employed to extract the features, which are fed as input to the RF to differentiate between normal and malicious samples.However, their proposed model yields a low detection rate and a high false-positive rate (FPR).

Overview of the framework
The diagram of the proposed framework is shown in Figure 1.The EC dataset is taken from an electric utility and reduced from 1 min to half-hourly because a high sampling rate affects consumers' privacy and intensive computation resources are required to train ML and DL models.The acquired dataset only contains normal samples.So, malicious samples are generated by applying theft cases and concatenating them with normal samples.The description of the theft cases is given as follows.After applying theft cases, the dataset is deemed suitable for applying ML/DL classifiers.However, the dataset has an imbalanced nature, which tends the ML/DL toward the majority class and ignores the minority class samples.Consequently, their performance is affected.So, a newly developed sampling approach is applied, which generates the samples of minority classes to solve the class imbalance problem.To avoid data leakage issues, the sampling approach is only applied to training data.After performing the data preprocessing steps, it is time to choose/develop an appropriate classification strategy for the identification of malicious and normal samples."Unity is strength" is a well-known quotation that is best utilized by ensemble approaches.In the current era, stacked ensembles have received considerable attention, where they have been combined with ML frameworks to win many Kaggle and Netflix competitions (Olasehinde et al., 2020).In this article, a stacked-based meta-learner strategy is adopted, which has two-fold benefits.In the first fold, multiple wellperforming classifiers get diverse knowledge in different ways on the EC dataset, and in the second fold, an MLP is chosen as a metalearner, which takes the diverse knowledge of these classifiers as input and generates accurate and robust results.The description of each step is given as follows.

Dataset description
The PRECON1 dataset is utilized in this study to evaluate the performance of the proposed meta-learner strategy.It is collected by students of Lahore University of Management Sciences from Lahore, Pakistan's second largest city.The dataset includes EC usage reports of 42 houses for 1 year, which belonged to people from different financial and demographic backgrounds.The consumers who participated in this research agreed to share their EC patterns.So, it is a reasonable assumption that all consumers are honest (Nadeem et al., 2019).Moreover, the large number of electricity usage patterns, long period of measurements, and public availability of this dataset make it an excellent source for research in the smart grid domain.Moreover, the sampling rate is reduced from 1 min to half-hourly by simply taking the mean because a high data sampling rate requires intensive computational resources for training of ML/DL models and also affects consumer privacy.Now, a vector, X t = {x t1 , x t2 , x t3 … x t48 }, contains 48 values to represent 1 day.
Theft attacks to generate malicious samples: For supervised learning methods, a labeled dataset containing both benign and malicious samples is required.However, the PRECON dataset only contains normal samples because theft cases rarely happen in real-world scenarios.The first approach is to apply clustering algorithms to detect malicious patterns in the EC data.However, these clustering algorithms do not give optimal results because the minor changes affect the overall distribution of data.So, a good approach is to convert the data into a supervised learning problem where each record has an associated target label as a malicious or normal sample.
where [x 11 , x 12 , … x 1n ] is the record of a consumer and y i is an associated label.In the index, x mn , m indicates the record number and n is the associated value.To generate the malicious samples for conversion of data into a supervised learning problem, six theft attacks are applied to the normal records, and their descriptions are given as follows in Table 1.e1(⋅) multiplies meter readings with the same random number and reports less consumption as compared to actual usage.e2(⋅) sends zero or complete consumption to an electric utility.In the case of e3(⋅), a meter reading is multiplied with a different random number after each interval.e4(⋅) and e5(⋅) report mean consumption or random factor of mean consumption to the electric utility.e6(⋅) sends a reverse order of meter readings.e5(⋅) and e6(⋅) launch attacks against voltage control mechanisms, shift the EC order from on-peak hours to off-peak hours, and reduce the electricity bill (Jokar et al., 2015).Figure 2 shows the EC of a consumer before and after applying the theft attacks.

Data preprocessing
Data preprocessing is an important step in the analysis of big data.In the literature, garbage in and garbage out is a well-known quotation about this step, which indicates that the performance of ML algorithms completely depends upon the quality of data.So, analyzing data that are not carefully screened generates poor results and misleads the decision process.So, in this article, a data preprocessing module is developed to enhance result quantification in terms of true positive (TP), true negative (TN), false positive (FP), and false negative (FN).The data preprocessing steps start with handling outliers, progresses with data standardization, and finally, the class imbalance problem is tackled using a newly developed hybrid sampling approach.

Outlier detection
During data analysis, some erroneous values (i.e., outliers) are found in the PRECON data that mislead the training process of a model.Because of this, it takes longer training time, and ultimately generates mediocre results.So, outliers are restored using the following Eq. 1.
Here, O shows μ(x i ) + 2 ⋆ σ(x i ), while μ and σ indicate the average and standard deviation, respectively.

Data normalization
ML or DL models learn the mapping among the input features (with different scales and values) and a target variable.If a model is trained on the features with larger values, it will learn the weights of larger values and ignore the smaller ones.Consequently, it will produce unstable results with high generalization errors.So, after removing the outliers, min-max normalization is applied to handle the scaling issue, and its mathematical form is elaborated in Eq. 2.
where min (X i ) and max(X j ) are minimum and maximum values in the record of consumer i, respectively.

Class imbalance problem
Class imbalance is a major issue in ETD in which the sample of honest consumers (majority class samples) is higher than the sample of malicious consumers (minority class samples).It creates an uneven distribution of target variables, and data are skewed toward the majority class.So, ML or DL models are biased toward majority class samples and ignore minority ones.If they are trained on imbalanced data, they give biased results, which are unacceptable in the ETD domain because electric utilities have a limited budget for onsite inspection (Batista et al., 2003).Furthermore, for the efficient and unbiased performance of supervised learning models, it is necessary to extract and learn valuable patterns from balanced data.Although the considered dataset in this study has an imbalanced nature, and this case occurs in more ETD scenarios, it is necessary to balance the EC data before extracting features and patterns.Hence, to solve the declared issue, a new sampling approach is proposed that holds the characteristics of undersampling and oversampling approaches to reduce the misclassification rate.This approach is based on Adaptive Synthetic (ADASYN) and TomekLinks, is named AST, and is used for the first time in the smart grid domain to handle uneven data distribution.
In AST, ADASYN is a sampling approach that generates the minority class samples by leveraging the concept of adaptive learning, which overcomes the bias issue and focuses on those samples that are hard to learn.The generation of minority class samples starts by counting the number of majority class samples around the minority samples and then calculating the density distribution r i…m .Here, m is the total number of samples of the minority class.r decides how many new samples will be generated around each minority class sample x i .After that, a random x zi sample is chosen from K-nearest neighbors of x i , and a new sample is created by using a i = x i + (x i − x zi ) × λ and added into the data.λ is a random number whose value lies between 0 and 1.This process continues until the desired distribution of both classes is not achieved.More information about the working mechanism of ADASYN can be found here (He et al., 2008).
Although oversampling approaches solve the class imbalance problem, they still have some issues like blindness of neighbor selection, sample overlapping, and noise inference (Jiang et al., 2021;Khan et al., 2021).As in the case of sample overlapping, some newly generated data points may be invaded into majority class data due to undefined clusters.Another concern is that noise inference may introduce some noisy data points near the decision boundary and confuse the learning algorithm.To address the mentioned issues, TomekLinks is introduced, which magically removes those samples that are hard to learn.
It creates a pair of samples and measures the Euclidean distance d(x i , x j ) between them.x i belongs to the majority class, while x j is part of the minority class.The generated pair is declared as a Tomeklink if no sample x k follows the following conditions d(x i , x k ) < d(x i , x j ) or d(x j , x k ) < d(x i , x j ).So, in this way, the unwanted samples from the majority class are removed to equalize the class distribution.Further information about TomekLinks can be found here (Tomek, 1976).
To develop the AST method, ADASYN and TomekLinks methods are combined by exploiting the Python imbalanced library.This library has been developed by experts to perform sampling It has built-in sampling approaches (Lemaître et al., 2017).First, TomekLinks is applied to remove the unwanted samples near the decision boundary, which are hard to learn.After that, ADASYN is utilized to equalize the distribution of both classes.It is possible that excellent results can be achieved with undersampling and oversampling approaches.However, the former may discard valuable information that is critical for making a robust decision boundary, while the latter synthesizes or duplicates the samples that increase the generalization error (bias) on the test data.The literature work indicates that interesting results can be achieved by combining both strategies, where a modest amount of oversampling of the minority class increases biases toward the minority class while a modest amount of undersampling of the majority class reduces bias toward majority samples.In the case of the PRECON dataset, it contains 1 (minority): 8 (majority) class samples.First, TomekLinks is applied, which removes redundant samples from PRECON data by utilizing link concepts.Next, ADASYN is exploited, which produces a ratio for both classes of 1:1 by synthesizing or generating the samples of minority classes.This phenomenon can be observed in Figures 3A, B, where green and red samples represent majority and minority class samples before and after handling uneven distribution of data, respectively.

First stage (fold-0) machine learning models
Support vector machine: The SVM is a supervised learning method, which is mostly used for classification and regression tasks.It was developed in the well-known AT&T Bell Laboratories by Vladimir Vapnik.As mentioned in the literature, it is mostly used for anomaly detection in electric utilities (Jokar et al., 2015).The SVM algorithm moves the input features into high-dimensional data space and draws a decision boundary between samples of both classes with the help of support vectors.The support vectors are data points, which are taken from the training dataset.
In this study, C and radial basis function (RBF) hyperparameters are utilized to achieve good results.The C controls the misclassification rate, while the RBF converts the complex and nonlinear patterns into linear ones.However, the noise factor cannot be discarded in real-world scenarios.So, a small change in testing data suddenly decreases its performance.That is why it is very difficult to find the right kernel function.Moreover, its computation and memory complexity is increased in the case of large datasets because it draws a large number of support vectors in highdimensional data space, which are stored in memory (Jokar et al., 2015).
Gradient boosting classifiers: They are well-known in the research and data science community.They are utilized for both classification and regression tasks.These methods are based on ensemble trees, which learn specific rules to classify between normal and anomalous data points, and these are called gradient boosting methods because they use gradient loss function to reduce the wrong predictions of weak learners and convert them into strong ones.XGB and categorical boosting (CatB) are the most regularized versions of gradient boosting methods.XGB is a popular algorithm in the Kaggle community, and it won 17 competitions out of 29 in 2015.It utilizes the Gini index, similarity score, and regularization parameters to improve the predictive performance of weak learners.
On the other hand, CatB utilizes order boosting and oblivious decision trees to learn decision rules from EC data.These trees have equal splitting criteria on each node, which makes CatBoost less prone to overfitting and increases its prediction performance on testing data.Although experimental results indicate that gradient boosting classifiers give good results, they continuously minimize the loss function value, which overemphasizes outliers and has a high chance of overfitting.In most cases, they require more than 1,000 trees, which increases computational and memory complexity.Moreover, they have a large number of hyperparameters that are difficult to tune (Punmiya and Choe, 2019).
Random forest (RF) is an ensemble algorithm that first builds a forest of decision trees and then uses the wisdom of majority to decide about anomalous or non-anomalous EC patterns.RF eradicates the limitations of the decision tree algorithm and controls its overfitting problem.It also gives better prediction results than a decision tree algorithm.However, it gives lower performance than gradient-boosting algorithms and is computationally expensive because it builds a large number of decision trees to overcome the overfitting issue (Qiao and Yang, 2020).
Decision tree: It is a simple supervised machine learning algorithm that is utilized for both classification and regression problems.It draws a simple tree-like structure where decision rules are learned from root nodes to leaf nodes to differentiate between normal and malicious theft patterns.It is simple, easy to understand, and easy to implement.However, it has a high chance of overfitting because, in the case of large datasets, it creates a complex decision structure.Moreover, it has an unstable nature because a small change in data leads to a completely different tree structure.DT follows a greedy search strategy, which cannot guarantee to return an optimal solution.
Logistic regression: It is a basic model in binary classification, where input features are multiplied with a weight matrix, which is learned during the training process.It is also known as a one-layer neural network.However, it is unable to learn complex and nonlinear patterns in EC data because it considers the linear relationship between input and output variables.

Meta-learner (level-1 learning)
In previous studies, a large number of multiple classifiers, either DL or ML, were applied to smart grid data to detect the anomalies; however, none generated fully accurate results, and mistakes were noted in different aspects.In the current era, ensemble approaches, especially stacking, have won many Kaggle competitions and proven their own worth.Stacking (stacked) is an ensemble approach that is mostly adopted to reduce the generalization error by combining the predictions of multiple classifiers.It is based on a meta-learner which is responsible for merging the predictions of ML classifiers and getting skills or knowledge from EC data in a diverse way.In the first phase, multiple ML classifiers C i(1−6) are applied to EC data, which get independently diverse knowledge in a parallel fashion and provide a vector of their predictions.After that, the prediction of the three best classifiers C 1−3 is utilized to build a new dataset X = [c 1 , c 2 , c 3 ].This new dataset is fed into a meta-learner classifier for its training.Once the meta-learner is trained, its performance is analyzed on the test data.The complete working of the framework can be seen in the form of a flowchart in Figure 4.In this article, the MLP model is used as a meta-learner to stack prediction of multiple classifiers.An MLP is inspired by the working mechanism of the human brain and used in diverse applications.From input to output prediction, it learns the non-linear approximation functions between input and output layers.There are many layers among them, which are called hidden layers.These layers are stacked in a hierarchical manner, which endows characteristics to learn or extract non-linear relationships from high-dimensional EC data with little guidance.
The architecture of the MLP is shown in Figure 1, where the leftmost layer is the input layer and the rightmost one is the output layer with one neuron.The middle layers are known as hidden layers that are made up of neurons, activation functions, and dropout layers, which are used to establish mapping between input and output layers, control overfitting issues, and help generate accurate results.To classify the samples, an objective function is defined that measures the error between actual scores and predicted scores.In addition, the MLP model modifies the hyperparameters to adjust the weights and biases of hidden layers and utilizes optimizers, especially stochastic gradient descent (SGD), to reduce the objective function error.However, in this study, the Adam optimizer is adopted to tune the weights and biases of middle layers because it adopts the characteristics of both SGD and RMSprop (Khozeimeh et al., 2022;Kingma and Ba, 2014).Additionally, it leverages the concepts of moving the average to update the hyperparameters.
Furthermore, each layer consists of multiple neurons, which have activation functions, which are a key part of the MLP.In literature, multiple activation functions are studied to enhance the performance of DL models like tanh, sigmoid, and softmax; however, recent work shows that rectified linear unit (ReLU) outperforms other ones and prevents the model from falling into vanishing and exploding gradient problems.The experimental results that ReLU performs better at hidden layers, while softmax and sigmoid give good results at the output layer.More details about the MLP can be found at Xiao et al. (2018), where the authors use it for binary classification in the medical domain.The DL models are built-in TensorFlow, while ML models are deployed with the help of the Scikit-learn library.Data preprocessing and exploration steps are performed using NumPy, Pandas, Matplotlib, and Scikit-learn libraries.

Evaluation
To evaluate the performance of the proposed framework, the PRECON dataset is reduced into training data, validation data, and testing data.Training and validation parts are exploited to tune the hyperparameters, while the testing part is exploited to check the ML models' performance.In the literature, it is seen that the error obtained on the validation part is reported as the final error (Krstajic et al., 2014).This procedure leads to a poor generalization in which the ML models suffer from overfitting problems (Buzau et al., 2018;Varma and Simon, 2006).

Performance metrics
One of the most commonly used metrics for evaluating the performance of ML/DL models is accuracy.However, it does not give a real assessment of their prediction capability.A naïve classifier would achieve 90% accuracy on imbalanced data because it becomes biased toward majority class samples.The literature work shows that ROC-AUC has proved to be an excellent performance measure for imbalanced datasets.It is a ratio between the TPR and FPR, the value of which can be changed by varying decision thresholds on the curve.However, this metric is only focused on positive class samples and ignores negative ones.In the real-world scenario, both classes have equal importance, which is why the PR curve and F1-score are also considered.Moreover, electric utilities have a low budget for onsite inspections.So, the FPR is also taken into consideration because it tells how many positive samples are misclassified by the predictor.

Results
In this section, we compare the performance of the proposed meta-learner strategy with baseline models and existing state-ofthe-art models in the smart grid domain.
Comparison with the state-of-the-art models: the performance of the proposed meta-learner strategy is compared with the WD-CNN (Zheng et al., 2017) and MLP-LSTM (Buzau et al., 2019) models.Both models were recently published in top-tier journals.The WD-CNN model is developed to classify the normal and malicious samples in electricity consumption.The model contains MLP and CNN layers.The MLP layers are utilized to capture global knowledge, while CNN layers are considered to retrieve temporal patterns from data.In the MLP-LSTM model, the authors developed a hybrid algorithm, which leverages the advantages of MLP and LSTM layers.The EC data are fed into the LSTM model to capture periodic patterns, while the metadata of the smart meter is passed into MLP models to learn local features.In this article, the authors conducted extensive experiments and proved that the hybrid model gives better results than simple models.In addition, in this study, the manual procedure is considered to tune the hyperparameters of WD-CNN and MLP-LSTM models.
In case study 1, the six ML classifiers, XGB, CatB, RF, DT, LR, and SVM, are introduced.These classifiers are trained on the PRECON dataset that is provided by the Energy Informatics group in Pakistan.After this, the meta-ensemble learning strategy is proposed for integrating the predictions of base models into the MLP model.This process significantly reduces the number of features, which overcomes the overfitting issue and provides better data for accurate predictions.As shown in Figure 5A, EC data are directly fed into the MLP model, where it can be observed that the model faces the overfitting issue and gives a high loss value on the testing data as compared to the training data.Now, training data are prepared from predictions of base models and fed as input into the MLP model.The process reduces the dimensionality of data and overcomes the overfitting issue, which can be seen in Figure 5B.In the first case, data have an imbalanced nature, which is why models give biased results and are attracted toward the majority class and ignore the minority class.So, the ROC curve is employed to evaluate the true effectiveness of all implemented models.The meta-ensemble strategy achieves a 0.919 ROC-AUC value that is more than its counterparts, which is shown in Figure 6A.Although it is a good performance indicator, it focuses only on positive class samples and ignores the negative ones.So, the PR curve is taken into consideration because it gives equal weightage to both samples.The meta-learner strategy obtains 0.987 PR-AUC values.It is more than its base models or counterparts, which can be viewed in Figure 6B.Other important performance metrics are also considered, such as accuracy, precision, recall, F1-score, and FPR.The models attain different values on base and meta-ensemble strategy, which indicates that each base model has different skills on the trained data.So, this is an ideal situation, where stacking strategies achieve optimal results.In addition, the performance of the proposed meta-learner strategy is compared with that of existing state-of-the-art models: MLP-LSTM and WDCNN.In Table 2, it can be seen that the proposed meta-learner strategy gives better results than both models.
In case study 2, the class imbalance problem is solved using AST because ML/DL models give biased results on imbalanced datasets, where they tend toward majority class samples and ignore the minority ones.Balanced data with all features and prediction of base models (fold-0 learners) are fed into the MLP.As shown in Figure 7A, it can be observed that MLP suffers from the overfitting problem with all 48 features.Figure 7B proves that MLP gives good results with the ensemble strategy and does not suffer from overfitting problems, which indicates that ensemble strategies are a good option for real-world applications.
As shown in Figure 8A, the performance of base models and meta-learner ensemble strategy on balanced data is evaluated through PR curves, which indicate that the meta-learner performs better as compared to base models.Moreover, the DT, LR, and SVM achieve 0.507, 0.481, and 0.486 ROC-AUC on balanced data, respectively, which are low as compared to values that they attain on imbalanced data.DT is a fundamental classifier.However, its performance decreased as the number of samples increased.It also suffered from overfitting problems.LR is a one-layer neural network, which is unable to learn optimal patterns from complex data like EC data, while the SVM does not give good performance on noisy datasets.Due to the aforementioned reasons, these three base models do not give good performance even after solving the class imbalance problem.Furthermore, boosting and bagging models perform well on both balanced and imbalanced data.RF attains 0.856, while XGB and CatB achieve 0.896 and 0.901 ROC-AUC values on balanced data, respectively.RF is a bagging strategy classifier, which gives good results on large datasets.However, it is computationally expensive because it requires high resources to train multiple DTs.XGB and CatB are boosting classifiers whose performance will be improved with an increasing number of learners.The deep learning-based meta-learner strategy achieves a value of 0.910 ROC-AUC, which is more than the ROC-AUC values of all base models.The PR curves of all classifiers are shown in Figure 8B, which indicates that the meta-learner ensemble strategy achieves a 0.988 PR-AUC value, which is more than that of all base models.As is indicated in the literature, the FPR is an important performance measure for electric utilities because its high value directly relates to onsite inspection costs.The proposed strategy achieves a 0.014 FPR value, which is lower than that of all base models.So, this strategy has good efficacy for real-world applications in the smart grid domain because it reduces onsite inspection costs for electric utilities by generating fewer false alarms.Finally, the performance of the meta-learner strategy is compared with that of two deep learning models, which were recently published in top-tier journals, and several other studies used them as baseline models.Both baseline models are trained on the balanced dataset.
In Table 3, it can be seen that the proposed meta-learner strategy gives better results than both models and also achieves a lower FPR value.However, the performance of deep learning models largely depends on the selection of hyperparameter ranges.Due to the limited availability of computational resources, the hyperparameters are manually tuned, which may be another valid reason for the lower performance of WD-CNN and MLP-LSTM models.Table 3 shows the f1-score, accuracy, precision, and recall values of all baseline models and the proposed ensemble strategy on balanced data.Scalability: here, we measure the running time of each ML and DL model for the selected dataset, which is mentioned in Table 3.The running time is measured for a single train-test split where it can be seen that, as usual, the running time of the DL models is more than that of the ML models.However, the running time of the proposed meta-learner strategy is less than that of baselines and other compared algorithms because the proposed strategy takes the

Discussion
Based on the aforementioned results, it is concluded that the proposed DL-based meta-learner approach yields satisfactory results and outperforms all individual models, including bagging and boosting approaches, for the detection of NTLs.Electricity theft is a primary reason for NTLs.The accurate prediction of these losses will not only reduce electricity theft cases but also provide direct benefits to electric utilities and consumers.
In this article, the comparison is conducted between the metalearner ensemble approach that is developed in this study and the traditional six ML classifiers.These traditional classifiers are advanced ones and are mostly used for ETD in the literature.In Punmiya and Choe (2019); Jokar et al. (2015), the authors employ CatB, XGB, LightBoost, and SVM to differentiate between benign and normal samples.All these classifiers give different results on the same dataset.So, we believe that the same type of situation can be seen in other datasets.This is the main motivation to develop an ensemble model that integrates the advantages of all classifiers, ignores their shortcomings, and gives accurate and unbiased classification results.That is why the meta-learner ensemble approach is proposed, which gives a superior performance as compared to standalone classifiers.In addition, results in the literature indicate that single classifiers give an unstable performance on diverse datasets due to different data distributions, sample sizes, and redundant features (Xiao et al., 2018).By going deeper into the output of conventional ML classifiers, the proposed ensemble strategy gives more weightage to classifiers with high accuracy and ignores those with lower ones.So, in this way, the meta-learner strategy fully utilizes the advantages of standalone models and gives better results with a high accuracy rate.
Moreover, the performance of the proposed meta-learner ensemble strategy is compared with that of the majority voting classifier, which has applications in diverse fields and has achieved tremendous results.In Li S. et al. (2019), the authors utilize an advanced version of the majority voting classifier to differentiate between normal and theft samples.Furthermore, it is observed that the DL-based meta-learner approach gives superior values of performance indicators than the majority voting algorithm, which is also proved in Xiao et al. (2018).The results can be formulated as the majority voting algorithm gives equal weightage to each classifier and establishes the linear relationship among them.In addition, in the proposed strategy, MLP is utilized on fold-1, which gives different weightage to each participating classifier on its prediction ability and automatically learns hidden intricate structures, including non-linear relationships.Furthermore, MLP is a DL model, which has complex non-linear functions like hyperbolic tangent, sigmoid, and rectified linear unit.These functions have the ability to learn non-linear relationships among predictions of fold-0 classifiers and class labels.
Furthermore, false alarms are a limiting factor that reduces the efficacy of an ETD system.Once a false alarm is generated, an on-site inspection is performed for its final verification.However, electric utilities have a limited budget for this purpose.So, in realworld applications, a classifier which has a low FPR is preferred (Jokar et al., 2015).The proposed meta-learner strategy achieves a low FPR on both balanced and imbalanced datasets as compared to bagging, boosting, and individual classifiers that are shown in Tables 2, 3.

Conclusion
In this article, the DL-based meta-learner model is proposed for the detection of non-technical losses.According to the reviewed literature, this is the first model in the smart grid domain that utilizes the diverse knowledge of multiple classifiers and automatically learns hidden intricate features, including non-linear relationships, among them.It controls the overfitting issue, gives high accuracy, and surpasses state-of-the-art methods, including bagging and boosting classifiers.Moreover, the proposed model is evaluated on a real EC dataset, which is provided by PRECON, the largest energy informatics group in Pakistan.Furthermore, class imbalance is a severe issue in ETD, which affects ML/DL models' performance.So, a hybrid sampling approach is developed to handle the class imbalance problem.The meta-learner model achieves 0.914, 0.987, 0.999, 0.923, 0.929, 0.960, and 0.139 ROC-AUC, PR-AUC, recall, precision, accuracy, F1-score, and FPR values on imbalanced data, while it attains 0.910, 0.988, 0.9994, 0.985, 0.9768, 0.978, and 0.014 ROC-AUC, PR-AUC, recall, precision, accuracy, F1-score, and FPR values on balanced data, respectively.In both cases, the proposed model outperforms other baseline models, which indicates that the meta-learner is a good option for the classification of malicious and normal samples.In the future, we will evaluate the performance of the proposed model in other domains.
FIGURE 1 Proposed model.
FIGURE 2 (A, B) EC before and after applying theft attacks.

FIGURE 4
FIGURE 4Flowchart of the proposed framework.
environmentDue to the large size of the PRECON dataset, all preprocessing and model-building steps are performed in Google Colaboratory and Kaggle platforms by taking the power of distributed computing.