Despite the advancement within the last decades in the field of smart grids, energy consumption forecasting utilizing the metrological features is still challenging. This paper proposes a genetic algorithm-based adaptive error curve learning ensemble (GA-ECLE) model. The proposed technique copes with the stochastic variations of improving energy consumption forecasting using a machine learning-based ensembled approach. A modified ensemble model based on a utilizing error of model as a feature is used to improve the forecast accuracy. This approach combines three models, namely CatBoost (CB), Gradient Boost (GB), and Multilayer Perceptron (MLP). The ensembled CB-GB-MLP model’s inner mechanism consists of generating a meta-data from Gradient Boosting and CatBoost models to compute the final predictions using the Multilayer Perceptron network. A genetic algorithm is used to obtain the optimal features to be used for the model. To prove the proposed model’s effectiveness, we have used a four-phase technique using Jeju island’s real energy consumption data. In the first phase, we have obtained the results by applying the CB-GB-MLP model. In the second phase, we have utilized a GA-ensembled model with optimal features. The third phase is for the comparison of the energy forecasting result with the proposed ECL-based model. The fourth stage is the final stage, where we have applied the GA-ECLE model. We obtained a mean absolute error of 3.05, and a root mean square error of 5.05. Extensive experimental results are provided, demonstrating the superiority of the proposed GA-ECLE model over traditional ensemble models.

Predicting energy consumption remains a problematic and mathematically demanding task for energy grid operators. Current prediction methods are typically based on a statistical analysis of the load temperature observed in various channels and generating a warning if a critical threshold is reached. However, the latest computer science advances have shown that machine learning can be successfully applied in many scientific research fields, especially those that manipulate large data sets [

We have proposed a modified ensemble model to improve the forecast accuracy based on utilizing the model’s error as a feature. This approach combines three models, namely CatBoost (CB), Gradient Boost (GB), and Multilayer Perceptron (MLP). The ensembled CB-GB-MLP model’s inner mechanism consists of generating a meta-data from Gradient Boosting and CatBoost models to compute the final predictions using the Multilayer Perceptron network. A genetic algorithm is used to obtain the optimal features to be used for the model. To prove the proposed model’s effectiveness, we have used a four-phase technique using South Korea’s Jeju province’s actual energy consumption data. Jeju Island is located on the southernmost side of the Korean peninsula. The solar altitude remains high throughout the year, and in summer, it enters the zone of influence of tropical air masses. It is situated in the Northwest Pacific Ocean, which is the Pacific Ocean’s widest edge and is far from the Asian continent and is affected by the humid ocean [

combine three models of machine learning, namely Catboost, Gradient Boost, and Multilayer Perceptron,

utilizing a genetic algorithm for the feature selection,

using error of model as a feature to improve the forecast accuracy.

The remainder of the article is arranged as follows. Section 2 introduces preliminaries about machine learning techniques used in this publication. Section 3 presents the four-stage proposed methodology. Section 4 introduces the data collection, data analysis process, pre-processing, and training. Section 5 presents the performance results of the proposed model evaluated using Jeju energy consumption data. It also analyzes the results with the existing models. Lastly, we conclude this article in the last conclusion section.

Artificial neural networks and machine learning provide better results than traditional statistical prediction models in various fields. Li [

Gradient Boosting model is a robust machine learning algorithm developed for various data such as computer vision [

In the study, Touzani et al. [

CatBoost is a gradient boosting algorithm-based open-source machine learning library. It can successfully adjust the categorical features and use optimization instead of adjusting time during training [

The multi-layer perceptron (MLP) consists of simple systems connected to neurons or nodes. The nodes involved in the measurement and output signals result from the amount of material available for that node, which is configured by the function or activation of a simple nonlinear switch [

The MLP algorithm is trained to calculate and learn the weights, synapses, and neurons of each layer [

where the sigmoid function is ^{k} is the output of the MLP. _{i}

Genetic algorithm (GA) was inspired by Darwin’s theory of evolution, in which the survival of organisms and more appropriate genes were simulated. GA is a population-based algorithm. All solutions correspond to the chromosomes, and each parameter represents a gene [

Ensemble refers to the integration of prediction models to perform a single prediction. Ensemble models can combine different weak predictive models to create more robust predictive models [

In the article by Park et al. [

We have designed a four-phase strategy better to understand the impact of the error curve learning technique.

A genetic algorithm is used to obtain optimal features. The CB-GB-MLP model is then used for the forecasting using the same data division scheme as used in phase 1. By applying the GA-ensembled model, we obtained better performance results. By using

The third phase is for the comparison of the energy forecasting result with the proposed ECL-based model. In this stage, we have used the testing dataset of phase 2 as the complete dataset. Then we split this data into 70% and 30% for training and testing, respectively.

The fourth stage is the final stage, where we have applied the genetic algorithm-based error curve learning ensembled (GA-ECLE) model. For this phase, we have again used the testing dataset of phase 2 as the complete dataset. Then we split this data again into 70% and 30% for training and testing, respectively. However, this time we have used the error data obtained at phase 2 as an input feature along with weather, time, and holiday features. By utilizing the GA-ECLE model, we obtained comparatively good results.

Time series data is used to evaluate the performance of the proposed model. _{t}

There are four different weather stations in Jeju island named as Jeju-Si, Gosan, Sungsan, and Seogwipo. Set of weather features from each weather station _{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}

Date features _{t}_{t,}, month _{t,}, year _{t,}, Quarter _{t,} and day of week _{t,}. Holidays also greatly impact energy consumption; we have collected the holidays and make a set of holiday features

It contains holiday code _{t,}, special day code _{t,}, special day name _{t,}. Holiday code consists of holidays such as solar holiday, lunar holiday, election day, holidays interspersed with workdays, alternative holidays, change in demand, and special days. Special days consist of several holidays, including New Year’s Day, Korean army day, Korean New Year’s Day, Christmas, workers’ day, children’s day, constitution day, and liberation day.

We have proposed to use the error of model _{t}_{t}_{t}

The target feature is the total hourly based load consumption

The proposed hybrid model consists of three sub-models, Catboost, gradient boost, and multi-layer perceptron. The output of p roposed hybrid model _{w}

This section explains the complete data acquisition, data analysis, feature engineering, and proposed model training.

Prepossessing this data involves different functions such as data cleaning, one-hot encoding, imputation, and feature construction. We have also assigned different weights to the weather features according to the impact of each weather station. This pre-processed data is used as input for the genetic algorithm, which helps obtain optimal features according to their importance in prediction. The initial number of features was 64, and it was reduced to 32 after applying the genetic algorithm. We provided error and absolute error as features along with other holidays, meteorological, and date features. These features served as input to the ECLE model. This enabled model consists of three models, namely CatBoost (CB), Gradient Boost (GB), and Multilayer Perceptron (MLP). The ensembled CB-GB-MLP model generates meta-data from Gradient Boosting and CatBoost models and computes the final predictions using Multilayer Perceptron. We have used different evaluation metrics such as root mean square error (RMSE) and mean absolute percentage error (MAPE) to evaluate our proposed model.

The pseudo-code for the genetic algorithm-based error curve learning model is expressed stepwise in Algorithm 1. It initializes with importing actual data files and libraries such as NumPy, pandas, matplotlib. Then data is pre-processed using imputation, converting textual data into numeric data using one-hot encoding and assigned weights _{hybrid}

This section provides the exploratory data analysis to understand the patterns of data better. The data provided by the Jeju energy corporation consist of different features. These features include the energy consumption of different energy sources, weather information of four weather stations, and holiday information. The total dataset consists of hourly-based energy consumption from 2012 to mid of 2019.

Each line shows a different year from 2012 to 2019. The X-axis represents the month, and Y-axis represents the average energy consumption in MWs. Average energy consumption is low in the month of November and high during August and January.

Sr# | Variable | Name | Unit |
---|---|---|---|

1 | DFK_CD | Code of Day of the Week | sun(0) sat(6) |

2 | HOLDY_CD | Holiday Code | 0 = weekday, 1 = holiday |

3 | SPCL_CD | Special Day Code | – |

4 | SPCL_NM | Special Day Name | – |

5 | TOTAL_LOAD | Total Load | MW |

6 | TA | Temperature | â |

7 | TD | Temperature of Due Point | â |

8 | HM | Humidity | % |

9 | WS | Wind Speed | m/s |

10 | WD_DEG | Wind Direction Degree | deg |

11 | PA | Atmospheric Pressure on the Ground | hPa |

12 | DI | Discomfort Index | – |

13 | ST | Sensible Temperature | â |

14 | SI | Solar Irradiation Quantity | Mj/m^{2} |

Weekdays and holidays have assigned binary numbers, where weekday is represented as 0 and holiday as 1. Total load represents the hourly energy consumption in MW. Temperature, due point temperature, and moderate temperature are measured in Celsius. Other features used are humidity in percentage, wind speed in m/s, wind direction degree in deg, atmospheric pressure on the ground in hpa, and solar irradiation quantity in Mj/m^{2}.

The total dataset consists of hourly-based energy consumption from 2012 to 2019. It contains 64999 Rows.

Total samples | Training samples | Testing samples | Training duration | Testing duration | |
---|---|---|---|---|---|

Phase 1 | 64999 | 45476 | 19523 | 2012-01-01 to 2017-03-09 | 2017-03-10 to 2019-05-31 |

Phase 2 | 64999 | 45476 | 19523 | 2012-01-01 to 2017-03-09 | 2017-03-10 to 2019-05-31 |

Phase 3 | 19543 | 13680 | 5863 | 2017-03-10 to 2018-09-29 | 2018-09-30 to 2019-05-31 |

Phase 4 | 19543 | 13680 | 5863 | 2017-03-10 to 2018-09-29 | 2018-09-30 to 2019-05-31 |

The graphical representation of testing and training data split for phase 3 and phase 4 is shown in

The accuracy of machine learning techniques must be verified before implementation in the real-world scenario [

The difference between prediction and actual values can be expressed as mean absolute error. It can be calculated using _{p}

The root mean squared logarithmic error (RMSLE) is obtained by

Max % | Min % | Mean % | |
---|---|---|---|

Phase 1 | 25.4532 | 0.00062 | 5.7576 |

Phase 2 (With GA) | 23.8014 | 0.00027 | 4.9397 |

Phase 3 (Without ECL) | 21.3264 | 0.00130 | 4.0231 |

Phase 4 (With ECL) | 10.8014 | 0.00027 | 1.2297 |

Sr | Name | MAE | MSE | RMSE | RMSLE |
---|---|---|---|---|---|

1 | GradientBoost | 7.86 | 295.95 | 13.004 | 0.22 |

2 | CatBoost | 13.75 | 517.92 | 22.75 | 0.038 |

3 | Multilayer Perceptron | 40.50 | 1525.27 | 67.02 | 0.115 |

4 | LSTM | 37.33 | 1405 | 63.59 | 0.108 |

5 | Xgboost | 19.94 | 750.98 | 32.99 | 0.056 |

6 | Support Vector Regressor | 30.88 | 1162.73 | 51.09 | 0.086 |

7 | Proposed | 3.05 | 115.09 | 5.05 | 0.008 |

The prediction result of testing data is also illustrated using graphs. In Phase 4, data from the 30th of September 2018 to the 31st of May 2019 is used as testing data.

Green lines show the actual load values, orange lines show the prediction result after applying ECL, and cyan color lines represent the prediction using the same model but without ECL.

To better visualize the results, we have selected two different time frames. One is 48 h, and the other is one week or 168 h.

This research presents a novel genetic algorithm-based adaptive error curve learning ensemble model. The proposed technique sorts random variants to improve energy consumption predictions according to a machine learning-based ensembled approach. The modified ensemble model use error as a function to improve prediction accuracy. This approach combines three models: CatBoost, Gradient Boost, and Multilayer Perceptron. A genetic algorithm is used to obtain the best properties available in the model. To prove the proposed model’s effectiveness, we used a four-step technique that utilized Jeju Island’s actual energy consumption data. In the first step, the CB-GB-MLP model was applied, and the results were obtained. In the second phase, we used a large, full-featured GA model. The third step is to compare the energy prediction results with the proposed ECL model. The fourth step is the final step of applying the GA-ECLE model. Extensive experimental results have been presented to show that the proposed GA-ECLE model is superior to the existing machine learning models such as GradientBoost, CatBoost, Multilayer Perceptron, LSTM, Xgboost, and Support Vector Regressor. The results of our approach seem very promising. We obtained a mean error of 1.22%, which was 5.75%, without using the proposed approach. We have presented the results in a graphical way along with the statistical comparison. The empirical results seem mostly favorable for the applicability of the proposed model in the industry. In the future, other features can be added, such as the impact of the population using electricity and the number of electric vehicles.