This paper is aimed to develop an algorithm for extracting association rules, called Context-Based Association Rule Mining algorithm (CARM), which can be regarded as an extension of the Context-Based Positive and Negative Association Rule Mining algorithm (CBPNARM). CBPNARM was developed to extract positive and negative association rules from Spatio-temporal (space-time) data only, while the proposed algorithm can be applied to both spatial and non-spatial data. The proposed algorithm is applied to the energy dataset to classify a country’s energy development by uncovering the enthralling interdependencies between the set of variables to get positive and negative associations. Many association rules related to sustainable energy development are extracted by the proposed algorithm that needs to be pruned by some pruning technique. The context, in this paper serves as a pruning measure to extract pertinent association rules from non-spatial data. Conditional Probability Increment Ratio (CPIR) is also added in the proposed algorithm that was not used in CBPNARM. The inclusion of the context variable and CPIR resulted in fewer rules and improved robustness and ease of use. Also, the extraction of a common negative frequent itemset in CARM is different from that of CBPNARM. The rules created by the proposed algorithm are more meaningful, significant, relevant and insightful. The accuracy of the proposed algorithm is compared with the Apriori, PNARM and CBPNARM algorithms. The results demonstrated enhanced accuracy, relevance and timeliness.
It is an information age as all is transferred to computers and the use of the information system has become a necessity of life. Knowledge extraction from data takes place through the data mining process. Data mining is a step-by-step process that begins with data analysis, classification/prediction and finding trends and patterns [
Data mining prepares the data for processing by recovering the erroneous and blank data fields that are then stored in the warehouse and finally applying algorithms to it [
Shaheen et al. [
The CBPNARM algorithm was developed for the extraction of Spatio-temporal association rules, which was applied only to spatial data. Spatial data differs from conventional data in that it relates directly or indirectly to a location on earth. Spatial data attributes combine to represent an image that is drawn on the geographic information system (GIS) or other similar information systems [
Apriori algorithm proposed by Agarwal et al. [
The proposed algorithm is implemented for sustainable energy development indicators. Sustainability in the energy sector is the primary need of almost every country in the world. The commission on sustainable development has provided a list of indicators [
This paper is intended to develop an algorithm for exploring positive and negative context-based association rules for conventional/characteristic data as an extension to the CBPNARM algorithm. The accuracy of the proposed methodology is compared with Apriori, CBPNARM at the methodological level and is also compared to sustainable energy development, categorized at the application level. The contribution made in this study is given below:
CBPNARM algorithm was designed for spatial data only. CARM is the algorithm proposed in this paper which can be applied to non-spatial or conventional numeric and ordinal data. The algorithm is applied to energy datasets to mining rules for energy sustainability. CPIR is not used in the CBPNARM algorithm as the complexity of CBPNARM became greater after CPIR when the results were not remarkable. CPIR is added to the proposed algorithm. The extraction of negative frequent items in the CARM differs from that of CBPNARM. Four CARM algorithm cases given in the pseudo-code differ from CBPNARM.
The importance of energy is vigorous in eliminating scarcity and elevating the standard of human life [
S. No. | Name of indicator | Data required | |
---|---|---|---|
Social domain | |||
1 | Share of households (or population) without electricity or commercial energy | Population (no energy), total population | |
2. | Share of household income spent on fuel and electricity | Income spent on energy, total income | |
Domestic use of energy classified with respect to the income group and fuel mix | Energy use per household, Household income, Corresponding fuel mix | ||
Accident fatalities per energy produced by fuel chain | Annual fatalities by fuel chain, Annual energy produced | ||
Economic domain | |||
3. | Energy use per capita | Energy use, Total population | |
4. | Energy use per unit of GDP | Energy use, GDP | |
5. | Efficiency of energy conversion and distribution | Losses in electricity generation, transmission and distribution | |
6. | Reserves-to-production ratio | Proven recoverable reserves, Total energy production | |
7. | Resources-to-production ratio | Total estimated resources, Total energy production | |
8. | Value added by energy in industrial sector | Use of energy in industry, Value added | |
9. | Value added by energy in agriculture | Use of energy in agriculture, Value added | |
10. | Value added by energy in service sector | Use of energy, Value added | |
11. | Value added by energy in household | Use of energy in household, Value added | |
Value added by energy in transport | Use of energy in transport, Value added | ||
Fuel shares in energy and electricity | Primary energy supply and final consumption by fuel type, Total primary energy supply and final consumption | ||
12. | Non-carbon energy share in energy and electricity | Non-carbon energy supply and final consumption, Total primary energy supply and final consumption | |
13. | Renewable energy share in energy and electricity | Renewable energy supply and final consumption, Total primary energy supply and final consumption | |
14. | End-use energy prices by fuel and by sector | Energy prices with and without tax | |
15. | Net energy import dependency | Energy imports, Total primary energy supply | |
16. | Stocks of critical fuels per corresponding fuel | Stocks of critical fuel, Critical fuel consumption | |
Ecological domain | |||
17. | Greenhouse-gas emissions by energy products | Greenhouse-gas emissions resulting from energy products and its use, Total population, GDP | |
Ambient concentrations of air pollutants in urban areas | Concentration of pollutants in air | ||
Air pollutant emissions from energy systems | Air pollutant emissions | ||
18. | Pollutant expulsions in liquid wastes from energy systems | Pollutant expulsions in energy liquid wastes | |
Soil area where acidification exceeds critical load | Affected soil area, Critical load | ||
19. | Rate of deforestation attributed to energy use | Forest area at two different times, Biomass utilization | |
20. | Ratio of waste generated in energy production to energy obtained | Amount of generated waste from the source, Total energy production from the source | |
21. | Ratio of waste properly disposed of total generated solid waste | Disposed solid waste, Total solid waste | |
22. | Ratio of solid radioactive waste to units of energy produced | Units of solid radioactive waste, Energy produced | |
23. | Ratio of solid radioactive waste awaiting disposal to total generated solid radioactive waste | Solid radioactive waste awaiting disposal, Total solid waste |
The economic domain of sustainability indicators can be divided into consumption, production patterns and security of supply. The indicators related to the consumption and production of energy include energy use per GDP per capita, energy supply efficiency, energy production, etc. The ecological domain covers the impacts of energy-related indicators of atmosphere, water and land [
The basis for the selection of energy sustainability indicators for this study is identical to that proposed by Shaheen et al. [
Support is a measure of finding the frequency of an itemset in the database [
Confidence is an indication of how often a rule is true [
Lift is used to measure the correlation value of the antecedent and consequent of an association rule [
Interestingness is a measure used to find potentially positive and potentially negative item sets from a dataset. A rule
The conditional-probability increment ratio (CPIR) of a rule is computed based on the dependence of the antecedent and consequent. In an association rule
Context is the state of the entity, environment or action that can affect the results of association rule mining. The value of the context variable must be within the normal range to make a matching rule valid. For example, the change in vegetation color in the surrounding area may indicate an emergency below the earth’s surface. If the value of the “waterflood” context variable is not normal and is not in normal ranges, then the change in vegetation color may indicate the presence of a volcano. The color, in this example, was changed due to the waterflood so that the waterflood, which in this case is a context variable, whose value for this rule was over the normal range [
The method proposed for extracting positive and negative association rules in conventional data sets is named CARM and is dependent on support, confidence, interestingness, CPIR and the value of the context variable. This method fetches the rules from the non-spatial datasets. CBPNARM [
Positive rules | |||
---|---|---|---|
Negative rules | |||
The aforementioned mathematical procedures generate a large number of positive and negative association rules. The measure of Interestingness measure proposed by [
The proposed algorithm for context-based association rule mining is given in the section below:
Name: CARM ( )
1:
a. SI: Database of Indicators for sustainable energy development given in
b. BaseValSupp: Threshold value for support variable
c. BaseValSuppNeg: Threshold value for support variable of negative association
d. BaseValConf: Threshold value for confidence variable
e. BaseValInterest: Threshold value for interestingness variable
f. ULC: Upper limit for context variable range
g. LLC: Lower limit for context variable range
2:
a. List of association rules
3.
/* The data of sustainability indicators have three dimensions (Year, Country and sustainability indicator). In this loop, it is converted to two-dimensional by averaging each sustainability indicators for all the years) */
4:
5:
6: Year-
7: Year-
8: Store value of Year-AvgSI in the database for SIs of a country
9: Update database SI
10:
11:
12:
/* Extract positive and negative frequent itemsets on the basis of frequency of each itemset in the database SI */
13:
14:
/* The itemsets which do not qualify the criteria of minimum support are removed from the database */
15: While (No more frequent itemset in LPos and LNeg)
16:
17:
18: End While
/* The itemsets which do not qualify the criteria of minimum confidence are removed from the database */
19: While (No more frequent itemset in PFI and NFI)
20:
21:
22: End While
/* The itemsets which do not qualify the criteria of minimum interestingness are removed from the database */
23: While (No more frequent itemset in PFI and NFI)
24:
25:
26: End While
/* The itemsets which do not qualify the criteria of CPIR are removed from the database */
27: While (No more frequent itemset in PFI and NFI)
28: if CPIR
29: Omit the rule from PFI
30: End if
31: if CPIR
32: Omit the rule from NFI
33: End if
34: End While
/* Four possible cases of context variable, two for positive association rules and two for negative association rules are applied in this loop */
35: While (PFI and NFI are not empty)
36: if (
37:
38:
39: Elseif (
40:
41:
42: End if
43: If
44: Omit the rule from PFI
45: else
46: Add the rule in PFI
47: End if
48: if (
49:
50:
51: Elseif (
52:
53:
54: End if
55: If
56: Omit the rule from NFI
57: else
58: Add the rule in NFI
59: End if
60: End While
61:
The time complexity of the proposed algorithm is O(N2) if one looks at the years and the number of countries. However, if the number of countries is set at its maximum, the time complexity is O(N), where N represents the number of years. The working of the proposed method is given in
The algorithm proposed in the present document is encoded in python Jupyter notebook which is an open-source programming language. The experiment is performed on a machine with an i7-2.11 GHz Processor, 16 GB RAM and 500 GB hard disk installed with all necessary network conditions required for the Windows 10 operating system. Data for 23 sustainable energy development indicators are collected from 28 countries over 25 years from 1990 to 2015. All data is collected from the online energy data portals. Energy sustainability indicators contain quantifiable and unquantifiable attributes from which quantifiable attributes are used in this study. Data for the 30 attributes were not available in the online sources, and 23 of the 30 attributes are included in the final database. There were some attributes for which data were not available through online sources but they could be derived from the available attributes. The context variables taken into consideration for the study of sustainable energy development are presented in
S. No. | Name of context variable | Data type | Category of indicators | Range |
---|---|---|---|---|
1. | Economic recession | Boolean | Economic domain | Normal-not normal |
2. | No of power distributors | Numeric | Social domain | 1–4 |
3. | Index of pollution | Numeric | Ecological domain | 0– |
The data from the first phase of the experiment are averaged and discretized to produce significant associations. As there were three dimensions of the data, the value of sustainability indicator, country and year, so for the discretization, it was necessary to convert the data into two dimensions. The values of each indicator were averaged over 25 years to obtain one value. The process of discretization was straightforward. Range values are determined for all data attributes on which data has been converted from values to ranges. An example of three indicators can be found in
In
A significant number of positive and negative association rules were extracted from the dataset using the CARM algorithm. It was nearly impossible to learn from these many rules. Different level of pruning’s strategies as described in the proposed method is used. Some of the final rules extracted after pruning are given in
In
Algo | Apriori | Apriori | PNARM | PNARM | PNARM | PNARM | CBPNARM | CBPNARM |
---|---|---|---|---|---|---|---|---|
Pruning | N | C | N | I | I, C | I, C, CP | CN | I, C, CP, CN |
Rules | 204 | 106 | 438 | 312 | 266 | 186 | 298 | 101 |
The average confidence graphs for the unpruned and pruned rules extracted through all algorithms are given in
CARM with additional pruning measure took lesser execution time than CBPNARM. The execution time of CARM is at last but one position if no pruning technique is used for association rule mining. The algorithm was designed to improve the quality of association rules extracted from the datasets for which comparing algorithms based on precision, recall and F-measure depicted a clearer picture. The comparison of the algorithms based on average values on multiple energy datasets is shown in
TP rate | FP rate | Precision | Recall | F-Measure | ROC area | PRC area | |
---|---|---|---|---|---|---|---|
Apriori | 0.52 | 0.16 | 0.764 | 0.912 | 0.831 | 0.433 | 0.837 |
PNARM | 0.54 | 0.3 | 0.642 | 0.984 | 0.777 | 0.337 | 0.652 |
CBPNARM | 0.74 | 0.014 | 0.981 | 0.922 | 0.950 | 0.925 | 0.998 |
TP rate: rate of true positives, FP rate: rate of false positives (instances falsely extracted as a rule), ROC/PRC area: trade-off between true positive and false positive rates. Whereas;
The CARM algorithm for mining context-based association rules is proposed in this paper as an extension of the CBPNARM algorithm. A few association rule pruning techniques are incorporated into the CARM algorithm including confidence, interestingness and CPIR to improve insights by decreasing the number of rules extracted. The context is used in the algorithm to eliminate certain rules and/or add those excluded from the final rule set defined based on the out-of-range-value of the context variable. The algorithm is applied to sustainable energy indicators to find co-varying sustainability indicators and countries for sustainable energy development. The rules produced by CARM are more robust, relevant and insightful in terms of average confidence, dependence and relevance.
The proposed method outperformed the previous methods in terms of the number of rules generated, confidence and dependency. The inclusion of the context variable and CPIR reduced the number of rules and increased the robustness and usability of the rules. Confidence and dependency values show that fewer rules do not suggest a loss of useful patterns. The execution time of the algorithm is higher than a few other algorithms, which is expected due to additional functions added for the context variable and CPIR. The complexity of the algorithm can be improved in future by using object-oriented approaches for context variable and CPIR.
The results obtained in terms of the application domain of sustainable energy development are also insightful and reported interesting covariances in the indicators and underlined the criticality of some countries for their energy development. The energy sector in a country can use associations derived from the proposed method to construct an optimal plan to ensure sustainable energy development. The associations among sustainability indicators can lead the energy sector to devise a plan according to the individual deficiencies of energy development and its relation with other developmental factors. Thus, the study can lead an energy sector to achieve optimal energy development without compromising the economy, ecology and social justice that are essential ingredients for sustainability. The work can be extended to automate the selection of context variable because manually selecting context variables can add some bias to the results. An automated mechanism interpreting negative association rules can also be added to the algorithm in future work. Different classification algorithms and learning approaches can be added to the system to reduce the complexity arising from the data structure.