Mehdi Jamei1, Nadjem Bailek2,*, Kada Bouchouicha3, Muhammed A. Hassan4, Ahmed Elbeltagi5, Alban Kuriqi6, Nadhir Al-Ansar7, Javier Almorox8, El-Sayed M. El-kenawy9,10
1 Engineering Faculty, Shohadaye Hoveizeh Campus of Technology, Shahid Chamran University of Ahvaz, Dashte Azadegan, Iran
2 Energies and Materials Research Laboratory, Department of Matter Sciences, Faculty of Sciences and Technology, University of Tamanghasset, Tamanghasset, Algeria
3 Unité de Recherche en Energies Renouvelables en Milieu Saharien (URERMS),Centre de Développement des Energies Renouvelables (CDER), 01000, Adrar, Algeria
4 Mechanical Power Engineering Department, Faculty of Engineering, Cairo University, Giza, 12613, Giza, Egypt
5 Agricultural Engineering Department, Faculty of Agriculture, Mansoura University, Mansoura, 35516, Egypt
6 CERIS, Instituto Superior Técnico,Universidade de Lisboa, Lisbon, Portugal
7 Department of Civil, Environmental and Natural Resources Engineering, Lulea University of Technology, 97187, Lulea, Sweden
8 Universidad Politécnica de Madrid,UPM, Avd., Puerta de Hierro, 28040, Madrid, Spain
9 Department of Communications and Electronics, Delta Higher Institute of Engineering and Technology, Mansoura, 35111, Egypt
10 Faculty of Artificial Intelligence, Delta University for Science and Technology, Mansoura, 35712, Egypt
* Corresponding Author: Nadjem Bailek. Email:
Computers, Materials & Continua 2023, 74(1), 1625-1640. https://doi.org/10.32604/cmc.2023.031406
Received 17 April 2022; Accepted 12 June 2022; Issue published 22 September 2022
Although currently, solar energy accounts for a tiny fraction of the world’s energy supply, it is the most abundant and feasible renewable energy source. Indeed, solar energy, radiant light, and heat from the sun have been harnessed by humans using a range of ever-evolving technologies since ancient times. Thus, solar energy utilization is a promising prospect for solving the energy crisis, fighting climate change, and improving overall life quality . The most accurate way to measure the global solar radiation at any site is by installing specific instruments, such as pyranometers and pyrheliometers, and conducting continuous monitoring at different time resolutions [2,3].
Nevertheless, despite the vast range of solar energy applications, solar radiation sensors are not installed at all meteorological stations. Also, they tend to experience frequent technical issues. Hence, direct solar radiation measurements are unavailable in many countries, especially developing ones [4,5]. Therefore, in such circumstances, the standard practice in assessing solar radiation is to use physical, empirical, or data-driven techniques, which have been established recently based on other meteorological parameters [6–8]. Most empirical models are based on meteorological variables and daily global solar radiation (GSR) correlations. To estimate GSR and for decades, different empirical models, such as the temperature- and sunshine-based ones, have been used [9,10]. In contrast to the empirical equations, data-driven techniques do not make any prior assumptions about the correlations between input and output parameters. Instead, those correlations are figured out from the fed data in the learning process [11–16].
Therefore, many researchers have widely used data-driven techniques for predicting solar radiation over the last decade. Vakili et al.  developed an Artificial Neural Network (ANN) model to predict the daily GSR. They concluded that the proposed ANN model provides relatively higher accuracy and reliability than other models tested by other researchers. Yeom et al.  applied four data-driven models, namely, ANN, random forest (RF), support vector regression (SVR), and deep neural network (DNN), to assess the spatial distribution of solar radiation on Earth considering different meteorological data sources. They found that the data-driven models accurately simulate the observed cloud patterns spatially.
In contrast, the physical model failed because of cloud mask errors. They concluded that more profound layers of the network approaches (RF and DNN) could best simulate the challenging spatial pattern of thin clouds when using multispectral satellite data. Kosovic et al.  also applied Machine Learning (ML) to estimate solar radiation. They concluded that data-driven models provide higher accuracy when using different time resolutions and meteorological inputs. Al-Rousan et al.  applied several data-driven techniques to predict solar radiation over Jordan. They concluded that RF algorithms perform better than other algorithms in predicting global solar radiation. Also, their results revealed that the accuracy of the predictions depends on the used category, training algorithm, and variable combinations. Taki et al.  utilized ANN, SVR, Adaptive Network-Based Fuzzy Inference System (ANFIS), Radial Basis Function (RBF), and Multiple Linear Regression (MLR) for Estimating GSR at different time scales. They concluded that the RBF-based model has the lowest error when estimating the monthly and daily solar irradiations. Hassan et al.  performed a comparative study between four different data-driven models of daily global solar irradiation based on SVR, feedforward backpropagation ANN, ANFIS, and decision trees. They showed that ANN models, followed by ANFIS- and SVM-based models, outperform classic regression-based ones by reducing the root mean square errors by up to 31.7%, depending on the type of the inputs. In another study , the authors explored the potential of the relatively simpler decision tree ensembles in predicting the hourly average and daily total global, diffuse, and direct normal solar radiation at different locations in the Middle East. Those ensembles were benchmarked against commonly used ANN, SVR, RF, bagging, and gradient boosting models. The study’s outcomes indicated a competitive performance of RF models to the most accurate ones (ANN models) but higher stability of boosting models. SVR-based models, however, showed the best combination of the two criteria. Thus, many researchers concluded that, because measuring the global solar radiation at all locations of the Earth is not possible and requires specific and expensive equipment and systems, data-driven techniques can be considered a replacement for experimental measurements and empirical methods.
Most studies dealing with data-driven techniques to predict solar radiation are conducted in arid regions. In contrast, only a few studies are conducted in semi-arid climate zones. Therefore, the main objective of this study is to evaluate the performance of different hybrid data-driven techniques in predicting daily GSR in semi-arid regions, such as the majority of Spanish territory. Namely, we tested models based on RF, Locally Weighted Linear Regression (LWLR), Random Subspace (RS), and M5P when each algorithm is hybridized with Additive Regression (AR) to estimate the solar radiation in six stations over Spain, characterized by semi-arid climate conditions. The models tested in this study are novel. Most of the base algorithms are scarcely applied in previous studies to estimate solar radiation. Therefore, this study considerably recommends new effective methods to estimate solar radiation in semi-arid regions.
The rest of the paper is organized as follows: Section 2 briefly describes the study cases and data-driven models applied in this study. Section 3 presents the main findings from this study and discusses the applied models’ relevance. Finally, the main concluding remarks drawn from this study are presented in Section 4.
Additive regression (AR) is introduced as a gradient boosting ensemble learning method presented in the WEKA open-source package, which can enhance the predictive performance of a single regressive learner in an iterative process . The AR mechanism includes the iterative enhancement of the regression-based learner by fitting a model in each iteration obtained from the remaining residuals from the previous iteration. Herein, the final prediction is provided by accomplishing the outcomes of every single learner . In this approach, the main hyperparameter is a shrinkage coefficient (learning rate) with a default value equal to 1.0. Reducing the shrinkage coefficient avoids overfitting, and smoothing improves the predictive accuracy. However, selecting a relatively low value of shrinkage coefficient may increase the cost and time computing model .
Multilinear regression (MLR), as a supervised learning method, can purpose a linear response between the input variables () and the output one () in the form of 
where denote the regression coefficients, determined using the least square (LS) method, represent the residuals, and M denotes the number of predictors. The fitness function () of MLR minimizes the squared error between the observed output () and the estimated one (), which is expressed as follows
In the case of nonlinear relationships between input variables and target ones, MLR becomes less efficient. The locally-weighted linear regression (LWLR) method introduced here is considered a non-parametric lazy learning method. LWLR is an advanced extension of the MLR approach, proposed for the first time by Atkinson et al.  to overcome the drawbacks of conventional MLR. A weight function captures the existing non-linearity relationship between the training data set and the target variable in this approach. For this purpose, a fitness function is described as follows [28,29]
where W is the weight function. This fitness function can be defined in a matrix form as , where X and Z denote the input matrix and output vector, respectively. In order to minimize the fitness function, it is required to differentiate the mentioned function over and set the matrix function to zero. Thus, can be obtained as 
Here, the radial basis (Gaussian) kernel function () is used to replace the weighted matrix. This function is expressed as
in which, denotes a constant number, whereas stands for the difference between pairs of data points ( and ).
The M5P model extends the M5 tree decision model , first developed by Wang et al.  to solve regression problems. In the M5P model, traditional decision trees are combined using linear regression functions at the leaf points of the trees. The most crucial advantage of decision tree models is their ability to solve nonlinear problems with extensive data and features for regression tasks . Herein, the data space is split into smaller sub-spaces to generate the decision tree for each particular sub-space using the elements comprised of nodes, roots, leaves, and branches. In the M5P process, the linear regression functions provided for various sub-spaces construct a set of linear models (as a committee machine) to capture the existing non-linearity. Generally, the training process of the M5P algorithm comprises three steps generating, pruning, and smoothing decision trees. The M5P model is executed using the maximization of a criterion so-called the standard reduction (SDR), as follows
in which, denotes the data samples, denotes the ith sub-space associated with a leaf, and denotes the standard deviation. It is noteworthy that a pre-scribed value of terminates the training phase. Besides, the useless leaves are pruned during the training phase to avoid overfitting. In the last stage, the unavoidable discontinuities between adjacent leaves are compensated for by a smoothing procedure . The flowchart of the M5P model is demonstrated in Fig. 1.
Random forest (RF) is a flexible bootstrap aggregated (bagged) ensemble machine learning approach, widely implemented for many regression and classification tasks. RF incorporates a forest of random binary trees based on the Classification and Regression Trees (CART) algorithm and a bagging strategy to enhance standalone CART accuracy, as shown in Fig. 2. Many researchers have recently welcomed the simple mechanism of this method in solving significant nonlinear problems concerning its ability to mitigate bias and error to variance . CART models, in many cases, lead to overfitting and instability during the training. In the RF approach, and for overcoming these drawbacks and reducing the error of each tree model, the bagging strategy is used based on two random operators. In the first operator, a subset of training data is randomly selected to construct each independent decision tree , which can be expressed as [32,33]
where, X and m are the training data and the number of decision trees, respectively. During the training stage, the values are iteratively updated at the leaf nodes through a weighted average process, i.e.,
Each input variable introduced to a decision tree in the second operator leads to an independent decision. Consequently, the final response of RF is generated using the majority voting in classification tasks or by averaging individual predictions of decision trees in regression. For regression tasks, this is expressed as 
Finally, the rest of the data which did not participate in the training phase of the decision trees is used to compute the out-of-bag (OOB) error and examine the accuracy of the RF model. Here, two tuning hyperparameters, namely the number of trees in the forest and the number of predictors in each tree, were considered, which should be carefully selected to avoid overfitting. The reader is referred to [26,27] for more detail about the RF approach.
Random subspace (RS) or “attribute bagging” method is an ensemble machine learning approach similar to the bagging algorithm. The primary strategy of RS is to reduce the correlation between estimators. Some features are randomly selected instead of the whole feature subset during the training stage . In the RS model, an attempt is made to combine the single generated models obtained from the learners, in a single model with superior performance, compared to the performances of individual learners . As mentioned before, as in the bagging method, each learner randomly selects a subset of the training data set, considering replacement, which is the main difference from the bagging method . This method is a suitable option for solving high-dimensional problems. The number of attributes (features) is significant compared to training data points . To construct an ensemble model based on the RS strategy, the following steps are followed :
1: Let N be the training data points, and D be the number of features in the training data sets.
2: Let L be the number of standalone learners for constructing the ensemble model.
3: Chose a subset of input data points () for each (standalone) learner (), with .
4: Allocate a training subset for each learner by considering features ().
5: Combine the predictions of the L standalone learners using majority voting techniques for developing an ensemble model.
The developed models are based on six different combinations (scenarios) of input variables, as shown in Tab. 1. Besides, the extraterrestrial solar radiation () and the theoretical sunshine duration () in Tab. 1 are calculated mathematically using the date and time stamps of the data, as demonstrated in . For each set of inputs, four hybrid models have been developed by integrating additive regression with RF, RS, LWLR, and M5P. Tab. 2 shows the optimized hyperparameters of these models. The datasets for each location were divided into two subsets for training and testing. The training stage corresponds to the recorded data from 2007 to 2012, while the testing stage employed the data from 2013 to 2015.
The accuracy and suitability of the models were assessed using the root mean square error (RMSE), the relative RMSE (RRMSE), and Pearson correlation coefficient (R) of the linear regression forced to the origin of the n pairs of observed and predicted GSR values, mean absolute error (MAE) [40–45].
This study is carried out for the region of Andalusia, located in the south of the Iberian Peninsula. Andalusia covers a land area of about 87268 km2 with a complex climate . The total annual rainfall ranges from <300 mm (southeast coast) to >1000 mm (Betic mountains, Sierra Morena). The region has mild temperatures, with average annual temperatures above 17°C and 13°C in Sierra Morena and Betic mountains, respectively, and 300 days of sunshine over most of the territory. The stations with semi-arid climates were selected for this work, as shown in Tab. 3. A semi-arid climate, also known as a semi-desert or steppe climate, occurs when precipitation is less than possible evapotranspiration but not as low as in an arid desert climate. Semi-arid climates come in various forms, giving rise to various biomes depending on air temperature. The Köppen Climate Classification’s subtype for this climate is BS .
The measured meteo-solar radiation datasets for the six locations in Tab. 3 were obtained from the Spain Agroclimatic Information Network (SIAR; Sistema de Información Agroclimática para el Regadío, available at: https://eportal.mapa.gob.es) for nine years (2007–2015). The quality of the meteorological values of each station is tested through different checks, including range, step, internal consistency, persistence, and spatial consistency . The correlograms in Fig. 3 demonstrate the strength of linear correlations between these variables and the predicted one (GSR).
This Figure, stands for the daily extraterrestrial solar radiation, whereas stands for the theoretical sunshine duration.
This section provides the detailed results of the afore-described models in both the training and testing stages. The six models (S1 to S6) are first developed for the first four locations (TIJ, ICH, ALM, and PAD). The most promising models and base algorithms are highlighted. Those selected models are further verified at the other locations (JER and JOD).
As shown in Fig. 4a, four hybrid models have been developed by integrating the additive regression algorithm with the four-machine learning, overviewed in Section 2.1. The Figure shows the sixth scenario/model results in the training phase, with all seven variables considered as the inputs. The Figure shows appreciable differences in the performances of all models, which justifies the objective of this study, i.e., exploring new base algorithms for considerable improvement in prediction accuracy. Fig. 4a also highlights considerable differences between models’ performances at different locations, with TIJ and ICH showing better error estimates than ALM and PAD. This also contributes to the significance of this study, where such discrepancies in models’ performances occur within the same climatic zone, let alone across different zones. Compared to all other models, the AR-RF model shows exceptional performance in terms of RRMSE error measure with RRMSE values as low as 10.950% (at ICH) and also exceeds the performances of other models by up to 50.5% in terms of RRMSE. These substantial improvements emphasize the importance of the model hybridization concept and selecting the base algorithms of the hybrid model [49,50]. For instance, Fig. 4a shows that the AR-M5P models always result in the least accurate predictions, even though M5P and RF (best-performing base algorithm) share the base foundation of decision trees. It is also depicted that the AR-RS model performs close to the AR-RF model. In contrast, and at this point, AR-LWLR and AR-M5P models show significantly poor performances compared to the other models. Hence, they are excluded from the following discussions.
So far, the models have been discussed, considering their performance in the training phase. However, their performance in the testing phase is more important to see if they can handle new observations with different statistics. Tab. 4 summarizes the error estimates of the two best-performing models (i.e., AR-RF and AR-RS) when tested using the three years (2013 to 2015) datasets at the four locations. Once again, the AR-RF models show superior performances compared to the AR-RS models, with improvements up to 52.9% and 54.2% in terms of MAE and RMSE, respectively. Most importantly, the AR-RF model shows a highly stable performance when processing the datasets for three years of the testing phase. Typically, data-driven models tend to have higher error estimates in the test phase since they are not necessarily prepared for new data points that they have not been trained to handle. However, since the model is well-developed and the two datasets of training and validation are large enough, the model shows comparable error estimates at ICH (e.g., training and testing RMSEs of 1.902 and 2.188 MJ/m2.day). Even lower test error estimates at the three other locations. For instance, at PAD, the RMSE of the testing phase (1.603 MJ/m2⋅day) is 20.1% lower than that of the training phase (2.024 MJ/m2⋅day). The AR-RF model demonstrates excellent performance at this location (PAD), with an RRMSE of only 8.5%. Overall, it is concluded that the AR-RF models are the most suitable for such climatic conditions, followed by AR-RS, which is also a bagging technique.
So far, the different hybrid models have been assessed based on the input variables of Scenario #6, where wind speed and relative humidity records are also required. Also, from the previous section, it has been demonstrated that AR-RF results in sustainably better predictions of GSR. Hence, this section aims to evaluate the sensitivity of the AR-RF model’s predictions to different input combinations. As shown in Fig. 4b, the six combinations of model inputs result in different but comparable density distributions of residuals. As aforementioned, all models tend to overestimate the actual GSR values. However, the more complex models (S4 to S6) tend to show less frequent significant overestimations, as clear from the shifted peaks to the left in Fig. 4b.
To better summarize the performance of the models with different inputs, Fig. 5 shows the Taylor diagrams of the six models at TIJ, ICH, ALM, and PAD. The model’s performance improves as the number of inputs increases, especially when the new inputs represent a brand-new domain/type of measurements. This is expected since data-driven models tend to be greedy regarding dataset size and the number of inputs. S1 and S2 show lower performances for most stations, whereas S6 performs considerably better. However, the differences between models S3 to S6 are not that significant. Hence, all are recommended depending on the anticipated level of accuracy in GSR predictions, with S6 being the most accurate (but requires additional measurements of wind speed and relative humidity) and S3 being the most cost-effective (only air temperature measurements are required). The comparable performances of S3 and S6 are even clearer in Fig. 6. In this Figure, while all models fail to capture extreme GSR values, they succeed in capturing the actual GSR records’ main distribution characteristics (e.g., mean and quartile values).
As a final verification of the selected models (AR-RF models based on S3 and S6 input combinations), the error estimates of the test phase are summarized in Tab. 5. Compared to the results in Tab. 4, it can be seen that even better error estimates are obtained at the two new locations (JER and JOD), with RMSEs as low as 1.274 MJ/m2⋅day (the lowest RMSE in Tab. 4 was 1.603 MJ/m2⋅day for PAD). The RRMSE values are also substantially lower than those in Tab. 4. This proves that the suggested models are generalizable to distant locations with similar climatic features. Once again, S6 shows the best performances, with RMSEs of 1.274 and 1.403 MJ/m2⋅day at JER and JOD, respectively. However, these values are in the same order of magnitude as those offered by S3, which is drastically simpler. Hence, the AR-RF model based on air temperature measurements (S3) is recommended as the best trade-off between complexity and accuracy.
In this research, four novel hybrid ensemble models based on RS, RF, M5P, and LWLR, each integrated with AR (i.e., AR-RS, AR-RF, AR-M5P, and AR-LWLR), were developed to estimate the solar radiation for six semi-arid regions of Spain. Extraterrestrial solar radiation, theoretical sunshine duration, mean ambient temperature, maximum daily temperature, minimum daily temperature, relative humidity, and wind speed were employed as the input variables. First, four hybrid ensemble models were examined based on all input parameters in four stations (TIJ, ICH, ALM, and PAD) during the training phase. The results show that the AR-M5P models have the lowest accuracy, with correlation coefficients of only 0.874, 0.909, 0.870, and 0.869 at TIJ, ICH, ALM, and PAD. In contrast, the testing phase outcomes demonstrated that the AR-RF models have the best performances among all the hybrid models, with correlation coefficients of 0.97, 0.962, 0.966, and 0.982 at the same locations, respectively. In the next stage, six scenarios (input combinations: S1:S6) based on a correlation analysis were adopted to examine AR-RF, as the superior ensemble model, at two remaining stations (i.e., JER and JOD) to verify the model’s performance and assess the impact of candidate inputs on the prediction accuracy. The testing phase results proved that Scenario #6 leads to the highest accuracy, compared to other scenarios, with RMSEs of 1.274 and 1.403 MJ/m2⋅day and correlation coefficients of 0.988 and 0.968 at JER and JOD, respectively. Also, Scenario #3 stood in the next rank of accuracy for estimating the solar radiation in both validating stations. The AD-RF was the best predictive model, followed by AD-RS. However, the AD-RF model has better performance than all other models for most stations. Therefore, predicting global solar radiation for regions with similar climate characteristics using AD-RS is recommended.
Acknowledgement: The authors acknowledge the Agroclimatic-Information-Network of Andalusia (RIA) in providing most of the used data. This work was supported by the Portuguese Foundation for Science and Technology (FCT) through the project PTDC/CTA-OHR/30561/2017 (WinTherface).
Funding Statement: The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.