ORIGINAL RESEARCH article

Front. Energy Res., 28 March 2023

Sec. Sustainable Energy Systems

Volume 11 - 2023 | https://doi.org/10.3389/fenrg.2023.1136914

Prediction of return on equity of the energy industry based on equity characteristics

  • School of Management, Heilongjiang University of Science and Technology, Harbin, China

Abstract

We take the return on equity of energy enterprises as the research object to predict it. Our research adopts a new framework to solve multivariable time series problems. Compared to a single regression model, this model focuses more on the results of the regression equation rather than the coefficients of each indicator. Compared to the single machine learning regression method, this model can use the two-way encoder representation of the Transformers model to embed text data into the data, and then use the XGBoost model for regression model processing after PCA dimensionality reduction processing, thereby improving the accuracy of model prediction. Comparative experiments have verified that the method we use has advantages in terms of prediction accuracy.

1 Introduction

The energy industry, which belongs to the basic industry of the national economy, is an important field of enterprise reform. At this stage, the equity operation matters faced by energy enterprises are gradually increasing, which puts forward higher requirements for the ability and level of equity management of such enterprises. In the actual operation process of energy enterprises, equity factors are at the forefront of several factors affecting the return on equity (ROE).

Energy enterprises are very interested in predicting the performance of the next year based on the previous annual performance and taking it as the basis for the prediction and decision making. As an important indicator of enterprises, it is very important for energy enterprises to be able to predict and correct. As an indicator to measure the operating efficiency of energy enterprises, the ROE reflects not only a simple number but also the problems reflected behind it through the superficial meaning of the number. Therefore, this article proposes a prediction model to achieve the aforementioned situation.

Judging whether a company’s operation is good or bad depends on its return on net assets, not on the growth of earnings per share. At the moment, there are several aspects in the study of enterprise performance. With the growth of enterprises and the complexity of ownership structure, there is a gradual study of performance from the aspect of equity.

When we talk about time series modeling, we usually refer to ARIMA, VAR, LSTM, and other models. The aforementioned models can usually achieve good results in dealing with single-variable time series problems, but they often perform poorly in the face of multivariable modeling. From the data analysis, we can see that this task is not a typical time series problem, but a multivariable time series problem. In order to make eXtreme Gradient Boosting (XGBoost) () available for time series prediction, the time series dataset should be first transformed into a supervised problem so that the time series dataset can also be used for supervised learning.

In the process of model construction, this model uses the XGBoost model as the main support and combines the main body with Bidirectional Encoder Representations of Transformers (BERT) to solve the following three problems: 1) Realizing accurate prediction of the future economic data trend of the enterprise. Of course, this part of prediction is based on the sustainable operation of energy enterprises, excluding the performance fluctuations of energy enterprises caused by force majeure. 2) Solving the problem of text variables being used in the model. To improve the prediction accuracy of this model, we select as many variables as possible, including text variables. 3) This model is not only applicable to ROE prediction of energy enterprises but also universal and can be reasonably used with other types of enterprises.

2 Literature review

The relationship between equity and performance has always been the focus of research. At present, in terms of equity research, there is a close relationship between equity and performance. ) examined the internationalization strategy of hybrid state-owned enterprises (SOEs). examined the relationship between corporate governance and dividend policy of Nepalese enterprises. The contribution of ) was to examine the relationship between firm strategy and sustainability of financial performance of Nepalese Enterprises. ) aimed to analyze SOEs in 11 post-socialist Central–Eastern European (CEE) countries. ) analyzed the structure of enterprises and organizations by forms of ownership in Russia. This study was conducted to determine empirical evidence of the influence of company characteristics and ownership structure on tax avoidance in SOEs listed on the Indonesia Stock Exchange (BEI) in 2013–2016 (). Taking the mixed ownership reform of Chinese SOEs as the research object, analyzed the impact of management ownership on the performance of mixed ownership enterprises. The objective of the work of was to identify the relationship between different variables affecting profitability of the firms in the oil and gas sector in Vietnam. The objective of the work of ) was to provide new evidence on the relationship between family involvement and financial performance in small- and medium-sized enterprises (SMEs). Other influential studies include the study of ).

As far as the ROE of enterprises is concerned, the research shows diversification. The linear regression method is often used in the study of quantitative relationship. Using performance analysis, examined the cooperation between SOEs and their suppliers. ) used multiple regression analysis which is limited to the use of data taken from the selected financial statement. ) examined the SMEs in the Czech Republic from the perspective what makes them to adopt telework using the financial indicators. The aim of the work of ) was to investigate the effect of inventory turnover on firm profitability. The subject of the work of was to determine the financial effect proxy through current ratio, debt equity ratio, ROE, return on investment, and net profit margin against stock price of the company and the like mentioned in BEI. Decomposition of ROE after return on assets (ROA), return on sales (ROS), total assets turnover (TAT), and equity multiplier (EM) provides an analytical framework appropriate for observing factors that make and influence profitability (). The subject of the work of ) was to empirically investigate the relationship between capital structure and firm performance using a sample of Vietnam material enterprises. The research by ) considered the influence of the capital structure on the efficiency of communal enterprises of passenger land transport, and also, they () identified employment in social enterprises in terms of its quantity and quality. Taking the mixed ownership reform of Chinese SOEs as the research object, analyzed the impact of management ownership on the performance of mixed ownership enterprises.

In recent years, the XGBoost model has been widely used in various fields and has performed well. ) aimed to explore the influence of road and environmental factors on the severity of freeway traffic crash and establish a prediction model toward freeway traffic crash severity. XGBoost, AdaBoost, and Bagging were the employed soft computing techniques (). The research by ) used four different ensemble machine learning (EML) algorithms: random forest, XGBoost, categorical boosting, and light gradient boosting machine, for predicting EVs’ charging time. ) developed a stacked generalization (stacking)-based incipient fault diagnosis scheme for the traction system of high-speed trains. In order to improve the problem of inaccurate results in non-contact heart rate detection due to a series of movements of the subject such as breathing, blinking, facial expressions, and noise generated by changes in ambient light, the signal is processed in advance using normalization and wavelet denoising, and then, an XGBoost algorithm based on a Gaussian process (GP)-based Bayesian optimization method is introduced (). ) presented a novel hybrid ensemble framework consisting of multiple fine-tuned convolutional neural network (CNN) architectures as supervised feature extractors and XGBoost trees as a top-level classifier, for patch-wise classification of high-resolution breast histopathology images. Other influential works include those of ), ), ), and ).

In the study of performance, this article mainly uses the ROE as a measurement index. In terms of model, the regression model is usually used for analysis. Regression analysis is usually used to study the relationship between equity and performance. The advantages of the regression model are obvious: it can show the significant relationship between independent variables and different dependent variables and show the influence intensity of multiple independent variables on a dependent variable. Regression analysis also allows comparing and measuring the interaction between variables of different scales, which is convenient for constructing the regression model.

This article will try to use XGBoost to model time series. XGBoost is an effective implementation of gradient lifting for classification and regression problems. It is fast and efficient. It can get good performance in a wide range of prediction modeling tasks. It can be used to solve time series problems. In order to make XGBoost available for time series prediction, the time series dataset needs to be first converted into a supervised learning problem. Here, the problem is transformed into a supervised problem by using the sliding time window so that the time series dataset can also be applicable to supervised learning. The specific idea of constructing the dataset is as follows: using the data of the first n years of the enterprise as the feature to predict the ROE in the N + 1 year. In this way, we can transform the time series problem into a traditional regression problem. The text data in the dataset are embedded with the BERT model, processed with PCA dimensionality reduction, and then, processed with the XGBoost regression model.

3 Model building

3.1 Data processing

The purpose of missing value supplement processing is to retain the existing data as much as possible in the case of insufficient data samples. However, when making up the data, it should be noted that it is conditional for the existing data to be insufficient for the missing value. Equity ratio, separation rate of two rights, net profit growth rate, and other indicators are continuous in time, and various indicators will not have major changes under the normal operation of the enterprise, so they can be supplemented. Therefore, the K-means clustering method is used to fill in the adjacent missing data. Because the selected data have strong consistency in time arrangement, the missing values can be processed to a certain extent. The missing value processing method selected in this article is the k-nearest distance method.

The specific method is as follows: first, the Euclidean distance method or correlation analysis method is used to collect and calculate the K sample values closest to the missing points, and then, the weighted average of K numbers is used to estimate the missing data. The same mean interpolation is a single-value interpolation. The difference is that the hierarchical clustering model is used to predict the missing variables, and then, the average value is used for interpolation.

By setting the training sample , there are numbers in the dataset and in and , and represent the th and th numbers. Each dimension contains defined aswhen p = 2, and it represents the Euclidean distance.

In fact, the selection of value has a great impact on making up the missing items. Generally, the value of is small. The classification decision rule in the nearest neighbor method is often majority voting, so the misclassification rate should be minimized.

By setting the classification function as: : . Then, the misclassification probability is

Then, the misclassification rate I iswhere, A is the set of k-nearest neighbor training instance points and is the area category. Then, the missing data are supplemented by constructing a KD tree and searching the KD tree, and the supplemented dataset is recorded.

The maximum age length of the data selected in this article is 18; that is, the time span of the annual limit is 18 years. All empty data should be filled after statistics to ensure the maximum rational use of data. In this data sample, it is classified according to the securities code and filled with the data of the same company’s adjacent years according to the aforementioned method. For information similar to equity, if the company does not publish it, it will be treated as unchanged.

3.2 Word embedding

When dealing with text non-data information, word embedding is usually needed. Word embedding is a numerical representation of text information. Generally, the text will be mapped to a high-dimensional vector (word vector) to represent the text. Generally speaking, the text will be transformed into data form. The mapped label encoding uses a dictionary to associate each category label with an increasing integer, that is, to generate a label named class the index of the instance array of.

In all data, industry types have great text characteristics. In this article, the BERT model is used to obtain the feature expression of the corresponding text.

With a highly pragmatic method and higher performance, BERT has attracted much attention because of its most advanced performance in many natural language processing (NLP) tasks. BERT has advantages over models such as Word2Vec because although each word has a fixed representation under Word2vec, regardless of the context in which the word appears, the word representation generated by BERT is dynamically represented by the surrounding words.

The BERT model can be expressed as follows:

The language model of Markov hypothesis (n-gram language model) is introduced. If n = 1, there are

The conditional probability is obtained by maximum likelihood estimation.

We set the objective function aswhere represents the corpus and represents the context of word W. Let each text generate three vectors Q, K, and V, and then, the initialization formula is

The S function means Soft-max.

In , , and , X is the formal length of text data and is the weight matrix. Recalculate score is given by

We normalize

By increasing the attention of text recognition through matrix multiplication,

Since the Common Language Specification (CLS) of BERT is not very effective as the expression of phrase, and the next sentence task is canceled in the later unofficial version, the mean pooling strategy is adopted as the expression of phrase.

  • First, the output vector of CLS position is directly used to represent the vector representation of the whole sentence

  • Second, the mean strategy calculates the average value of each token output vector to represent the sentence vector

  • Third, the max strategy takes the maximum value of each dimension of all output vectors to represent the sentence vector

The data are quoted here to highlight the advantages of using the BERT model. This time, the pre-training model open source by the Tencent UER-py team is used to code short sentences to obtain embedded information.

The following is the performance of the model in several different tasks (Table 1). It should be noted that RoBERTa here is an optimized version of the BERT model, which is basically a replica version of BERT, so it is usually classified as BERT. Because BERT’s CLS is not very good as a phrase and the later unofficial versions cancel the task of listing the next sentence, the mean pooling strategy is used as a phrase.

TABLE 1

ModelScoreDoubanChnSentiCorpLCQMCTNEWS (CLUE)iFLYTEK (CLUE)OCNLI (CLUE)
RoBERTa-Tiny72.38391.481.8625560.3
RoBERTa-Mini75.784.893.786.163.958.367.4
RoBERTa-Small76.886.593.486.565.159.469.7
RoBERTa-Medium77.887.694.888.165.659.571.2
RoBERTa-Basel79.589.195.289.26760.975.5

Scores of the BERT model.

Pooling strategy: SBERT adds a pooling operation to the output of BERT/RoBERTa to generate a fixed-size sentence embedding vector. Three pooling strategies were adopted in the experiment for comparison.

  • 1. Directly using the output vector of CLS position to represent the vector of the whole sentence.

  • 2. MEAN strategy: The average value of each token output vector is calculated to represent the sentence vector.

  • 3. MAX strategy: The maximum value of each dimension of all output vectors is taken to represent the sentence vector. Specific values are given in Table 2.

TABLE 2

Pooling strategyNLISTSb
MIN80.7887.44
MAX79.0769.92
CLS79.886.62

Pooling strategy score.

Then, principal component analysis (PCA) was used for analysis. PCA is a common data analysis method, which is often used to reduce the dimension of high-dimensional data, and can be used to extract the main feature components of data. For industry types, the pre-trained BERT model is used to embed the text, and then, PCA is used to reduce the dimension.

3.3 PCA dimensionality reduction

Dimensionality reduction of high-latitude data can avoid the high complexity of the model caused by excessive data dimensions. Especially for some cases of insufficient sample data, the final trained model will have poor generalization. Removing the collinearity between data attributes can optimize the model, reduce the complexity of the model, reduce the training time of the model, and improve the robustness and generalization of the model. For PCA, this process is essentially a lossy feature compression process, but it is expected to lose as little accuracy as possible and retain the most original information in the compression process.

By setting text dataset ω, the feature after dimensionality reduction is A. , and the larger the value, the better. is the mean value of characteristic A. By setting the data sample as , the covariance matrix is ; by setting Γ for raw data, the data after PCA meets the requirements Γ= P Γc and c corresponding matrix:

Because this task contains many influencing factors and the sample data are time series data, this task does not belong to a typical time series problem but should be classified as a multivariable time series problem. In this article, the existing problems will be transformed into supervised problems by using sliding time window so that the time series dataset can be suitable for supervised learning. The text data in the data are embedded with the BERT model, processed with PCA dimensionality reduction, and then, processed with the XGBoost regression model.

In the process of dimensionality reduction, the data after dimensionality reduction do not represent the advantages and disadvantages of the data after dimensionality reduction but reflect the distance of relevant industries. The reason is that this step extracts the more important components or key parts of the original data and maps them to another space. Currently, the performance of the data is not directly related to the original data, so there are positive and negative situations. The original data are compressed and transformed in the original space and mapped to a new space. In essence, it is the original data, but the form of expression is also different.

At this time, the dataset after defect filling, word embedding, and dimension reduction is recorded as X; then, .

3.4 Data scaling

To prevent the model accuracy from being affected by abnormal data in the data, this article scales the data and limits the ROE to - 1–1.where Q is the rate of ROE , is a new feature of ROE, and and is the minimum and maximum value before being scaled by the feature, respectively. is the original eigenvalue. Minmaxscaler can convert each element (feature) into a given range of values. The estimator scales and transforms each feature separately so that it is within a given range of the training set such as in the interval [0, 1].

The conversion method is

This transformation is often used as an alternative to zero mean, unit variance scaling. The task is not a typical time series problem, but a multivariable time series problem. By using sliding time window representation, time series datasets can be suitable for supervised learning. The idea of building a dataset is as follows:

First, the “securities code” column is used as the basis for grouping.

Second, the grouped data are sorted using “deadline.”

Third, the data of the previous n years are used to predict the “ROE in the N + 1 year.”

3.5 Feature selection

By selecting K Best, all features except the K features with the highest score are removed, and the function returns the variable score and p value. The calculation formula of p value is as follows:

In inferential statistics, hypothesis testing is a very important link. Through the statistical analysis of SAS and SPSS, it is found that p value (p value, probability, PR) is another important basis for testing.

value is used to reflect the possibility of an event in the actual situation. The value is obtained through the significance test. Generally, is the statistical difference, is the significant statistical difference, and is the extremely significant statistical difference. It means that the probability of the difference between the samples caused by sampling error is less than 0.05, 0.01, and 0.001. However, the value cannot assign any importance to the data and can only indicate the probability of an event. The statistical results show , which can also be written as .

3.6 ROE model

The ROE model is essentially the XGBoost model. XGBoost is a tree-based model. It can stack as many trees as possible, and each additional tree tries to reduce the errors of the previous tree set. The general idea is to combine many simple and weak predictors to build a powerful predictor.

XGBoost is an addition formula composed of multiple decision trees, which is expressed as follows:where the estimated is the predicted value, is the eigenvector, is the value calculated for each tree, and K is the total number of trees. is the tth lifting tree, n is the number of decision trees, and the initial value .

Therefore, for each tree, the XGBoost model is essentially an additive model. We observe to know how to calculate the tree score to determine which function to use. The loss function represents the loss value of each tree, and the loss function is expressed as

The objective function of a given XGBoost iswhere is the regular term used to express the complexity of the model and is the penalty coefficient in the penalty term. The loss function is proved to be convex and differentiable, and the latter term is the regular term of system complexity, that is, the penalty coefficient. The significance of this coefficient is to prevent overfitting of the model. In the regular term of the latter term, it actually includes the sum of the square of the number of leaf nodes of each decision tree and its node score, which is used to judge the quality of decision making and optimize the calculation.

Since the objective function is difficult to solve at this time, the Taylor formula can be used for an approximate solution.where is the first partial derivative of , is the second-order partial derivative of , and the expression is

After removing the constant term in the objective function, the simplified approximate function is obtained.

The previous formula is to calculate the loss for each sample and then calculate the sum of all sample losses. The sample accumulation operation is converted into leaf nodes, and the sample set on each leaf node j is defined as .where is the optimal solution of leaf weight. The optimal solution is brought into the objective function to obtain the optimal objective function.

Since the loss function is convex, the partial derivative of the objective function can be obtained to find the minimum value point in the interval, which represents the case of minimum loss. After calculation, the expression of min(L) of the minimum loss value is

The formula is used to measure the quality of decision tree. Based on the given loss function, after pruning the decision tree, we observe whether the value of the loss function decreases to judge whether pruning should be performed.

The forward step-by-step algorithm is used for model optimization, and the objective function is

According to the XGBoost document, the equation is as follows: .

attributes the feature to a specific leaf of the current tree . is the current feature of the tree which is the score of t and the current feature of x.

In determining where each decision tree forks, the exact greedy algorithm is adopted.

Here, means gain and represents split income .

Since the exact greedy algorithm affects the calculation efficiency when the sample size is large, the approximate algorithm can be used. Global and local methods can be used to determine the split point.

Global indicates that the candidate splits are calculated before the spanning tree. This method does only one operation in the whole calculation process. The candidate segmentation points that have been calculated in advance are used in the subsequent node division; local calculates the candidate segmentation points only when each node is divided. Experiments show that if the two methods want to achieve the accuracy of approaching the exact greedy algorithm, we need to take more candidate segmentation points to improve the accuracy. Because local needs to be calculated every time the node is divided, in some cases, the amount of calculation is very close. The use of the two methods is also different, which is suitable for taking a large number of segmentation points; local is more suitable for deep tree structures.

In order to avoid the unreasonable definition of candidate cut points by simple statistical indicators, the weighted quantile sketch is introduced.

Dataset represents the set of the kth eigenvalue and the second derivative of each sample.

Ranking function indicates the proportion of the kth eigenvalue less than z in the dataset.

Then, candidate points are selected according to the following formula:

In terms of sparse values, the model can automatically learn the default division direction for the missing data. In each segmentation, the missing value is segmented to the left node and the right node. By calculating the score value and comparing which of the two segmentation methods is better, an optimal default segmentation direction will be learned for the missing value of each feature.

In a word, once the model is trained, the prediction simply boils down to identifying the correct leaves of each tree according to the characteristics and summarizing the value of each leaf, which becomes the most difficult part of the problem.

XGBoost is an optimal allocation gradient lifting program for efficiency, flexibility, and convenience. Based on gradient boosting, a machine learning method based on gradient boosting is completed. XGBoost provides a parallel tree structure (also known as GBDT, GBM) that can quickly and accurately deal with a large number of data science problems. XGBoost is an implementation of gradient lifting integration method for classification and regression problems. In this task, the sliding time window representation is used to make the time series dataset suitable for supervised learning.

3.7 Effect evaluation

The goodness of fit test of this model adopts the external data verification method; that is, the simulation test is carried out with the data not participating in the training in the sample data to verify the accuracy and stability of the model prediction. The data used to evaluate the link account for about 50% the total sample data.

After the XGBoost model is constructed, the model is scored with the data of the training set. Here, we need to pay attention to the scoring standard R2. XGBregressor is used for the score. Since it is a regression model, the scoring standard of the regression model should be selected.

Under this scoring standard, when the model is better, R2 → 1 and when the model is worse, R2 → 0.

4 Experimental process

4.1 Data source

The data in this article are from the Guotai’an database. A total of 12,072 corresponding indicators of 1,084 Chinese energy enterprises from 2003 to 2021 are selected for analysis. The selected variables are shown in Table 3.

TABLE 3

Variable classificationVariable name
Ownership structureState-owned equity ratio
Shareholding ratio of top ten shareholders
Separation rate of two rights
Proportion of actual controller with control right of listed company
Total remuneration of the top three management
Shareholding ratio of management
ProfitabilityProportion of minority shareholders’ equity
Return on equity
Operating gross profit margin
Return on assets (ROAs)
Total asset net profit margin
OperationTotal asset turnover
Growth rate of administrative expenses
Financial situationAsset liability ratio
Long-term debt ratio
Business performanceProportion of profits from financial activities
Retained earnings ratio
Growth rate of main revenue
SolvencyLong-term capital liability ratio
Ratio of long-term loans to total assets
Development capacityCapital accumulation rate
Growth rate of administrative expenses
Net profit growth rate
Ratio structureCash asset ratio
Ratio of working capital to current assets
Proportion of net profit to comprehensive income
Other variablesIndustry type
Total assets

Variable classification table.

According to the industry type, energy companies are divided into traditional energy, new energy, and energy reserve according to the energy type. The specific classification is shown in the table as follows.

The data are preliminarily screened, and descriptive statistics are carried out. The software is SPSS. The descriptive statistics of the data is given in Table 4.

TABLE 4

ClassificationIndustry
Traditional energyCoal chemical industry
Coal concept
Fuel ethanol
Natural gas
Shale gas
Energy conservation and environmental protection
Energy saving lighting
New energyGreen power
Wind power generation
Photovoltaic concept
Nuclear power
Alkylene oxide
Power exchange concept
Energy reservesFuel cell
Lithium battery
Sodium ion battery
Graphite electrode
Graphene
Pumped storage
Energy savings

Classification of energy enterprises.

It can be seen from Table 5 that the minimum value of the core explanatory variable enterprise performance ROE is −28.2046, maximum value is 204.6869, mean value is 0.04, and standard deviation is 2.11727, indicating that there are individual outliers, but the overall volatility is small and the data are relatively concentrated. The difference between the maximum and minimum values of the equity ratio of SOEs, long-term debt ratio, total asset turnover rate, retained earnings ratio, and the equity ratio of the top ten shareholders is less than 10, and the standard deviation is less than 1, indicating that its numerical distribution is uniform and concentrated, with little fluctuation. For the normal operation of the model data, this article first removes the enterprises with state-owned equity ratio of 0 before regression and removes the extreme values of other control scalars.

TABLE 5

NMinMaxMean valueVariance
StatisticsStatistical valueStatistical valueStatistical valueStandard error valueStatistical value
Two-weight separation rate (%)11571−4.5442.93114.9054350.07243160.704951
Return on equity11965−28.204643204.6895940.0541390.0195834.588664
Total assets1206902.73E+121.87E+109.00E+089.77E+21
Turnover rate of total assets120580.0000018.6010210.6118930.0041580.20844
Asset liability ratio120680.001725178.3454730.5114750.0157632.998402
Long-term debt ratio1206600.9624610.1957230.0017530.037075
Growth rate of main business income11585−28.589163107.4321821.1119960.2999911,042.588701
Retained earnings ratio10786−78.27272410.6853350.0098881.054563
Return on assets A12068−29.28803911.0061620.0419770.0040770.200606
Operating gross margin12061−4.0300420.9911650.235680.0013280.021281
Ratio of long-term borrowings to total assets1206800.7296970.0618880.0009070.009933
Long-term capital liability ratio12068−63.470481107.4168450.1698990.0110321.468617
Capital accumulation rate11977−140.02259556.2569590.2340040.0156042.916331
Net profit growth rate9897−4541.7276845174.356093.1807944.649705213970.6914
Growth rate of administrative expenses11278−3.85294117.7887490.2170240.0142452.288687
Return on assets A12068−29.28803911.0061620.0419770.0040770.200606
Return on total assets (ROAs) A12068−30.95869710.4009230.0231350.0041580.208691
Return on equity A11965−28.204643204.6895940.0541680.0195834.58866
Operating gross margin12061−4.0300420.9911650.235680.0013280.021281
Proportion of control right of the listed company owned by the actual controller11609099.00540.3855720.150803264.006052
Separation rate of two rights of the actual controller11604−49.424960.32314.8982070.07358362.828658
Total remuneration of the top three management1199306.61E+071.97E+062.01E+044.83E+12
Management shareholding ratio11644010010.879220.173118348.967633
Cash asset ratio12051−0.05982671.5453130.1504070.0060280.43792
Working capital to current assets ratio12067−491.8393010.991409−0.0486360.06855656.714714
Proportion of minority shareholders’ equity12068−48.6894842.2541160.0658770.0041870.211597
Proportion of profits from financial activities12069−168.68566169.0661390.259770.03240812.67588
Proportion of net profit and comprehensive income9734−186.28428130.2813120.9539850.031289.523818
Number of effective cases (listed)6842

Descriptive statistics.

4.2 Missing value supplement

The purpose of making up the missing values is to retain as much sample data as possible for the training of datasets. Once the missing values are not handled well, the analysis results may be unreliable and fail to achieve the purpose of analysis. However, not all data can be supplemented, and the supplemented data are not the real value of the data sample, but based on other existing fields, the missing field is predicted as the target variable, so as to obtain the most possible complement value.

For example, in Figure 1, in the missing value statistics, there are 1,283 missing data values of the variable “retained earnings rate,” and the missing values of the variable are in the middle of the 18-year time span, which has a strong regularity. This depends on the nature of the variable itself. In terms of profit distribution, many enterprises distribute profits due to established strategies, in the development period or quota. Therefore, the index data will show the same value for many years, with strong regularity and in line with the continuity of time. Therefore, the k-value complement method can be used to supplement the data.

FIGURE 1

Thus, the sample data can be retained as much as possible to realize the full application of the sample data, so as to form a complete data record for subsequent data processing and analysis.

4.3 Word embedding

Word embedding for text types: The main idea of word embedding is to transform the text into a vector representation of lower-dimensional space. There are two important requirements for this transformed vector: lower-dimensional space and minimizing the sparsity of the encoded word vector. This article uses the BERT model for word embedding. The BERT model used in this article is shown in Figure 2.

FIGURE 2

4.4 PCA dimensionality reduction

The data obtained after word embedding are high-latitude data, so PCA is used for analysis. PCA is a common data analysis method, which is often used to reduce the dimension of high-dimensional data, and can be used to extract the main feature components of data. For industry types, the pre trained BERT model is used to embed the text, and then, PCA is used to reduce the dimension.

PCA is essentially a lossy feature compression process, but it is expected to lose as little accuracy as possible and retain the most original information in the compression process. In this model, the variable “industry type” belongs to text variable, which cannot be quantified intuitively. Therefore, we build the BERT model to embed text words into the classified variable “industry type,” so as to obtain a large number of high-dimensional data. PCA is used to reduce the dimension of the obtained high-dimensional data. Treatment results are given in Table 6. The corresponding scatter diagram is shown in Figure 3.

TABLE 6

ClassificationNumberNamePCA dimensionality reduction
Traditional energy1Coal chemical industry−0.6997
2Coal concept−0.7345
3Fuel ethanol2.5851
4Natural gas−0.0543
5Shale gas−2.5140
6Energy conservation and environmental protection1.5466
7Energy saving lighting−2.9614
New energy8Green power−0.9272
9Wind power generation−0.1071
10PV concept−1.2262
11Nuclear power−0.5171
12Alkylene oxide1.4380
13Power exchange concept−1.1772
Energy reserves14Fuel cell2.6068
15Lithium battery−0.3903
16Sodium ion battery1.4123
17Graphite electrode−1.1713
18Graphene−1.2262
19Pumped storage−1.5915
20Energy savings3.5715

PCA data dimensionality reduction result.

FIGURE 3

The data in the aforementioned chart represent the distribution of industries. The sign does not represent the advantages and disadvantages of the data, but reflects the distance of relevant industries. The reason is that after extracting the more important components or key parts of the original data in this step, they are mapped to another space. At this time, the performance of the data is not directly related to the original data, so there are positive and negative situations. The original data are compressed and transformed in the original space and mapped to a new space. In essence, they are the original data, but the form of expression is different.

4.5 Data zoom

The multi-index comprehensive evaluation method is scientific and reasonable to evaluate things. It combines multiple indexes describing different aspects of a thing to get a comprehensive index and evaluates and compares the thing through it. Due to different properties, different evaluation indexes usually have different dimensions and orders of magnitude. When the indexes differ greatly, if the original index value is directly used to calculate the comprehensive index, the role of the index with larger value in the analysis will be highlighted and the role of the index with smaller value in the analysis will be weakened.

Data scaling, in statistics, means that the original data are converted according to a certain proportion through a certain mathematical transformation method, and the data are placed in a small specific interval. The purpose is to eliminate the differences of characteristic attributes such as characteristics and quantity between different samples and convert them into a dimensionless reactive value. The characteristic quantity values of each sample are in the same order of magnitude.

As shown in Figure 4, after excluding the extreme value, this article scales the ROE to . This step can not only further remove the extreme value and realize the prediction accuracy of the model but also eliminate the difference of dimension and order of magnitude between the evaluation indexes and ensure the reliability of the results.

FIGURE 4

4.6 Eigenvalue selection

Eigenvalue selection is very important. Some scholars believe that in most machine learning tasks, features determine the upper limit of model effect, and the selection and combination of models are only infinitely close to this upper limit.

Feature selection can reduce the number of features, prevent dimension disaster, and reduce training time. The generalization ability of the model is enhanced, and overfitting is reduced; the understanding of features and eigenvalues is enhanced. Specific data are given in Table 7. In this article, the eigenvalues are selected according to the p value, and the results are as shown in Figure 5.

TABLE 7

FieldScorep value
Separation rate of two rights1.32767.97E-06
Long-term debt ratio1.29054.86E-05
Growth rate of main revenue7.74991.80E-136
Retained earnings ratio7.73801.08E-08
Industry type1.45071.08E-08
Growth rate of administrative expenses1.36462.67E-04
Separation rate of two rights of actual controller1.25332.67E-04
Total remuneration of the top three management1.79825.14E-18
Shareholding ratio of management1.77881.81E-17

Statistical table of p value score.

FIGURE 5

The smaller the score in the aforementioned table, the better it can reflect the correlation between independent variables and dependent variables. Among the top ten of the aforementioned eigenvalues, the impact of equity specific factors is the first, which means that equity factors play an important role in the prediction of ROE.

It can be seen from the aforementioned table that the score of the two rights separation rate is the lowest, so the cash capital ratio has the lowest impact on the ROE, and the two rights separation rate has the greatest impact on the ROE.

It can be seen from the aforementioned pie chart that the p value of the separation rate of the two rights is the highest, and the equity factors stand out among several other factors, which has the greatest impact on the prediction rate of the ROE.

4.7 Construction of the equity net asset return model

The ROE equity model is essentially the XGBoost model. XGBoost is a tree-based model. It can stack as many trees as possible, and each additional tree tries to reduce the errors of the previous tree set. The general idea is to combine many simple and weak predictors to build a powerful predictor.

After construction, the results of the equity net asset income model are as shown in Figure 6.

FIGURE 6

The model is the final result of the collection of multiple decision trees. The red line is the branch reduction direction of the decision tree, and the blue line is the decision direction of the decision tree.

4.8 Model evaluation

The goodness-of-fit test of this model adopts the external data verification method; that is, the simulation test is carried out with the data not participating in the training in the sample data to verify the accuracy and stability of the model prediction. The data used to evaluate the link account for about 50 percent of the total sample data.

After the XGBoost model is constructed, the model is scored with the data of the training set. Here, we need to pay attention to the scoring standard R2. XGBregressor is used for score. Since it is a regression model, the scoring standard of the regression model should be selected. Under this scoring standard, when the model is better, R2 → 1 and when the model is worse, R2 → 0.

Finally, under this scoring standard, through the evaluation of the model, the final score of the model is

The score shows that the independent variable of this model has a good explanation for the dependent variable, so the ROE predicted 435 by this model is reliable.

4.9 Model comparison

After the previous experiments and conclusions, this section will now conduct comparative experiments and compare R2 scores of different machine learning model methods.

The datasets that have been processed are brought into the decision tree regression model, the BP neural network model, and the random forest regression model, and the R2 scores of each model operation are obtained as shown in Table 8, with the scores of 0.921, −0.0071, and 0.861. This value is less than 0.946 of the model R2 in our model. Therefore, among these models, the model built in this article has the highest accuracy.

TABLE 8

ModelR2 score
XGBoost0.9460
Decision tree regression model0.9210
BP neural network model−0.0071
Stochastic forest model0.8610

Model score comparison.

5 Conclusion

Using the XGBoost model, this article constructs the net asset income model of energy enterprises based on the characteristics of ownership structure through missing value supplement, word embedding, PCA dimensionality reduction, and model evaluation. After evaluation, the data accuracy of the N + 1 year predicted by the model based on the data of the previous n years is about 95%, and the prediction effect is good.

Under the background of carbon neutralization and carbon peak, the operation status of energy enterprises is gradually remarkable. With the rapid development of energy enterprises, the ownership structure is becoming more and more complex. Taking energy enterprises as the research object and based on the XGBoost model, this article constructs the ROE equity model to realize the accurate, scientific, and timely prediction of the ROE and can reasonably modify and allocate the equity structure of energy enterprises through the predicted ROE, so as to play the role of prediction and early warning for the next business cycle of enterprises. The timeliness of the prediction of the ROE of this model can reasonably avoid the lag of the ROE and play a positive role in the equity planning and future development of energy enterprises.

At present, the relationship between equity and ROE generally tends to be inter-industry research, which is not subdivided into different types in a specific industry. In addition, the existing research is more inclined to analyze the existing results, that is, to analyze the connection in the existing data at a certain time node, which has a certain lag effect, which is caused by the lag of the ROE itself.

The aforementioned research ideas are not scientific in predicting the future business performance level, and the test cycle of the conclusions is long. To sum up, this article abstracts the problem into a multivariable time series problem by using the thought and method of deep learning, analyzes it under various indicators, scientifically and effectively predicts the operation effect of the next cycle in this business cycle, and modifies the equity proportion and rights by predicting the ROE, so as to achieve a more ideal and reasonable level.

ROE is the most important financial indicator in the business process of an enterprise, and it is also the benchmark to judge whether the business is healthy or not. Many enterprises hope to achieve the performance indicators, so the research on ROE has always been the favorite of different scientists. Based on XGBoost, this article uses missing value supplement, PCA dimensionality reduction, and the BERT model to process text information, uses the R2 score method to evaluate the model, takes energy enterprises in multiple industries as samples to process and predict multivariate time series data, and adds comparative experiments. The experiment shows that the prediction accuracy of this model is higher than that of other machine learning models, and the dataset can be directly applied to this model after adding text variables, which increases the range of variables compared with other machine learning models, and can better predict the ROE of energy enterprises. In addition, this model is not only limited to predicting the ROE of energy enterprises but also applicable to other types of enterprises. Compared with the traditional regression model, this model has certain advantages, mainly focusing on two points: first, the focus is different. Traditional regression models pay more attention to the coefficient relationship between various indicators and ROE, that is, the change trend; the model in this article is more direct to get the predicted value, and the data are more secure because they come from a lot of learning. Second, the traditional regression model is not good at dealing with text information. This model takes this factor into account and is not bound by reasonable text variables. Third, the model proposed in this article has more advantages for processing large amounts of data and seeing the general changes of the whole industry.

Statements

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Materials, further inquiries can be directed to the corresponding author.

Author contributions

YY: data analysis, writing, and formal analysis. ZW: validation and methodology.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

  • 1

    ArviyantiA.MuizE. (2020). Pengaruh karakeristik perusahaan dan struktur kepemilikan terhadap penghindaran pajak/tax avoidance pada perusahaan bumn yang terdaftar pada bei tahun 2013-2016. J. Akunt.7 (1), 2846. 10.37932/ja.v7i1.22

  • 2

    BhattaraiD. (2018). Generic strategies and sustainability of financial performance of Nepalese enterprises. PRAVAHA24 (1), 3949. 10.3126/pravaha.v24i1.20224

  • 3

    BіelіenkovaO. (2020). Factor analysis of profitability (losses) construction enterprises in 1999-2019.

  • 4

    ChazovaI. Y.MukhinaI. A., Effectiveness of administration of economic entities in state and municipal ownership, In Proceedings Of The International Science And Technology Conference "Fareastсon" (Iscfec 2019), 2019.

  • 5

    ChenT.GuestrinC. (2016). XGBoost: A scalable tree boosting system[J]. New York, NY: ACM.

  • 6

    FarooqU. (2019). Impact of inventory turnover on the profitability of non-financial sector firms in Pakistan. J. Of Finance And Account. Res.01, 3451. 10.32350/jfar.0101.03

  • 7

    GaoL.SongS. (2021). Determining the problems of management shareholding and the mixed ownership, modern perspectives in economics. Bus. And Manag.7.

  • 8

    GaoT.YangX.RenZ.ZhaoJ. (2022). Research on non-contact heart rate detection method based on GP-XGBoost. OTHER Conf.

  • 9

    Irfan SauqiM.EndahT. W.HeniA. (2019). Analisis kinerja keuangan terhadap harga saham pada industri loga yang terdaftar di bei. EQUITY22, 3746. 10.34209/equ.v22i1.899

  • 10

    JiH. P.KimC. Y. (2020). Social enterprises, job creation, and social open innovation. J. Open Innovation Technol. Mark. Complex.6 (4), 120. 10.3390/joitmc6040120

  • 11

    MaoZ.XiaM.JiangB.XuD.ShiP. (2022). Incipient Fault diagnosis for high-speed train traction systems via stacked generalization. Ieee Trans. Cybern.52, 76247633. 10.1109/tcyb.2020.3034929

  • 12

    MatuszakP.SzarzecK. (2019). The scale and financial performance of state-owned enterprises in the CEE region. ACTA OECONOMICA69, 549570. 10.1556/032.2019.69.4.4

  • 13

    MenT. B.HieuM. N. (2021). Determinants affecting profitability of firms: A study of oil and gas industry in Vietnam. J. Of Asian Finance, Econ. And Bus.

  • 14

    NarB. B.NiteshR. B.OmS.PoojaG.PoshanL.PratikshaP.et al (2018). Impact of corporate governance on dividend policy of Nepalese enterprises. Bus. Gov. And Soc., 377397. 10.1007/978-3-319-94613-9_21

  • 15

    NguyenN. H.Abellán-GarcíaJ.LeeS.García-CastañoE.VoT. (2022). Efficient estimating compressive strength of ultra-high performance concrete using XGBoost model. J. Of Build. Eng.52, 104302. 10.1016/j.jobe.2022.104302

  • 16

    OtekunrinA. O.NwanjiT. I.JohnsonOlowookereK.EgbideB-C.FakileS. A.LawalA. I.et al (2018). Adebanjo joseph falaye, damilola felix eluyela, financial ratio analysis and market Price of share of selected quoted agriculture and agro-allied firms in Nigeria after adoption of international financial reporting standard. J. Of Soc. Sci. Res.

  • 17

    PetrukO.TrusovaN.PolchanovA.DovgaliukV. (2020). The influence of the capital structure on the efficiency of communal enterprises of passenger transport. Mod. Econ.24 (1), 132137. 10.31521/modecon.V24(2020)-21

  • 18

    RoffiaP. (2021). Family involvement and financial performance in SMEs: Evidence from Italy. Int. J. Of Entrepreneursh. And Small Bus.43, 39. 10.1504/ijesb.2021.115313

  • 19

    SanyalR.KarD.SarkarR. (2022). Carcinoma type classification from high-resolution breast microscopy images using A hybrid ensemble of deep convolutional features and gradient boosting trees classifiers. Ieee/Acm Trans. Comput. Biol.

  • 20

    ShenZ.AhmedDeifallaF.KamińskiP.DyczkoA. (2022). Compressive strength evaluation of ultra-high-strength concrete by machine learning. MATERIALS15 (10), 3523. 10.3390/ma15103523

  • 21

    SoY. K.ShinH-H.YuS. (2018). Do state-owned enterprises cooperate with suppliers? Performance analysis in the Korean case. Emerg. Mark. Finance And Trade15.

  • 22

    SrinivasP.KataryaR. (2022). HyOPTXg: OPTUNA hyper-parameter optimization framework for predicting cardiovascular disease using XGBoost. Biomed. SIGNAL Process. CONTROL.

  • 23

    Tho DoT. (2020). The relationship between capital structure and firm performance: The case of Vietnam material enterprises. Res. J. Of Finance And Account.

  • 24

    UllahI.LiuK.YamamotoT.ZahidM.JamalA. (2022). Prediction of electric vehicle charging duration time using ensemble machine learning algorithm and shapley additive explanations. Int. J. ENERGY Res.46, 1521115230. 10.1002/er.8219

  • 25

    VlčkováM.FrantíkováZ.VrchotaJ. (2019). Relationship between the financial indicators and the implementation of telework, DANUBE: Law. Econ. And Soc. Issues Rev.

  • 26

    WangY.ZhaoD. (2021). “Research on the evaluation index system of trust, innovation and M&A value,” in Proceeding of the Journal Of Physics: Conference Series.

  • 27

    YangY.WangK.ZhenZ. Y.LiuD. (2022). Predicting freeway traffic crash severity using XGBoost-bayesian network model with consideration of features interaction. J. Of Adv. Transp.2022, 116. 10.1155/2022/4257865

  • 28

    ZhangC.HuD.YangT. (2022). Anomaly detection and diagnosis for wind turbines using long short-term memory-based stacked denoising autoencoders and XGBoost. Reliab. Eng. Syst. Saf.

  • 29

    ZhouN. (2018). Hybrid state-owned enterprises and internationalization: Evidence from emerging market multinationals. Manag. Int. Rev.58, 605631. 10.1007/s11575-018-0357-z

  • 30

    ZhouX.WenH.LiZ.ZhangH.ZhangW. (2022). An interpretable model for the susceptibility of rainfall-induced shallow landslides based on SHAP and XGBoost. Geocarto Int.37, 1341913450. 10.1080/10106049.2022.2076928

Summary

Keywords

energy industry, equity, return on equity, XGBoost, predict

Citation

Yang Y and Wang Z (2023) Prediction of return on equity of the energy industry based on equity characteristics. Front. Energy Res. 11:1136914. doi: 10.3389/fenrg.2023.1136914

Received

03 January 2023

Accepted

15 March 2023

Published

28 March 2023

Volume

11 - 2023

Edited by

Zbigniew M. Leonowicz, Wrocław University of Technology, Poland

Reviewed by

Najabat Ali, Hamdard University, Pakistan

Mengshi Li, South China University of Technology, China

Updates

Copyright

*Correspondence: Zhenqing Wang,

This article was submitted to Sustainable Energy Systems, a section of the journal Frontiers in Energy Research

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics