Artificial intelligence (AI) and machine learning (ML) help in making predictions and businesses to make key decisions that are beneficial for them. In the case of the online shopping business, it’s very important to find trends in the data and get knowledge of features that helps drive the success of the business. In this research, a dataset of 12,330 records of customers has been analyzed who visited an online shopping website over a period of one year. The main objective of this research is to find features that are relevant in terms of correctly predicting the purchasing decisions made by visiting customers and build ML models which could make correct predictions on unseen data in the future. The permutation feature importance approach has been used to get the importance of features according to the output variable (Revenue). Five ML models i.e., decision tree (DT), random forest (RF), extra tree (ET) classifier, Neural networks (NN), and Logistic regression (LR) have been used to make predictions on the unseen data in the future. The performance of each model has been discussed in detail using performance measurement techniques such as accuracy score, precision, recall, F1 score, and ROC-AUC curve. RF model is the best model among all five chosen based on accuracy score of 90% and F1 score of 79% followed by extra tree classifier. Hence, our study indicates that RF model can be used by online retailing businesses for predicting consumer buying behaviour. Our research also reveals the importance of page value as a key feature for capturing online purchasing trends. This may give a clue to future businesses who can focus on this specific feature and can find key factors behind page value success which in turn will help the online shopping business.
Artificial intelligence (AI) and machine learning (ML) are being used by many organizations for several purposes. For instance, making predictions related to the stock market, forecasting about the weather, and for many other applications including web scrapping, making sense out of text data, and images [
ML has been widely used to predict the future based on the past behavior of data. Recent years have seen tremendous growth in online retailing around the world. While it has a low internet penetration rate of 34.5 percent, India still has fourth-largest number of internet users in the country after China. Given the growing importance of Indias’ online market, its complicated sensitivity collection and area-wise sociopsychological barriers, retail stores need to consider consumer shopping inclinations. In this paper [ To find and explain the most important features which will help in making correct predictions, the features which are contributing towards the success of online shopping business. To understand customer behavior and financial spending of the customers. To build a robust model that could make future predictions on the unseen data.
The dataset is based on an online business and this data has information of 12,330 individual users who visited websites in the past to search for the products and offerings by the business. Five ML models i.e., decision tree (DT), random forest (RF), extra tree (ET), neural networks (NN), and logistic regression (LR) have been used in this research. Data is divided into 80/20 ratio. 80% of the data is used for training while 20% is for testing. All the models is trained on training data and tested on the remaining 20% of test data. One of the most important techniques for feature importance is permutation importance which gives information about the most important features useful for making correct predictions.
References | Method | Results | Suggestions for future work |
---|---|---|---|
Hafez et al. [ |
KNN, FKNN, XGBoost, and MLP models were used to classify product catalogs into a three-level food taxonomy. | FKNN algorithm outperformed. | Currently, researchers work on multidimensional categories. |
Sharma et al. [ |
Build ensemble learning models which include RF, CNN, XGBoost, and voting mechanism techniques. | This ensemble technique outperformed. | --- |
Islam et al. [ |
Develop a shopping support robot that offering shopping services and develop a GRU to analyze the actions of customers. | Accuracy: 82% achieved to recognize the client’s actions. | Develop a robot that supports elder people during shopping. |
Joshi et al. [ |
Data collection: 124 Indian people. RF model is used for prediction and feature importance | RF model represents the percentage of each category towards online |
Increase the number of observations. |
Effendy et al. [ |
Dataset of client behavior. Simple under-sampling, SMOTE, and WRF used for customers’ attritions predication. | Combined sampling and WRF could produce a better performing predictive model than before. | Identify the most important features that affect the clients to churn. |
Xie et al. [ |
Build IBRF with ANN, DT, and CWC-SVM on a dataset of real bank customers | IBRF outperformed with 93.2% accuracy. | Improve the efficiency and simplification ability. |
Hadden et al. [ |
NN, regression trees, DT used for churn prediction. | DT model outperformed. Important variables: |
Perform a deeper analysis. |
The online shoppers intention (OSI) dataset is collected from Kaggle. This dataset was launched as a Kaggle competition in the year 2019 [
In the OSI dataset, there are numerical as well as categorical variables. As a pre-processing step, the categorical variables are converted into numerical variables using one-hot encoding and categorical encoding. The detailed information about the missing values and data type of each of the variables are given in
Variables of dataset | Data type | Variable’s description | Missing values |
---|---|---|---|
Administrative | Float | The number of pages about account administration that the visitor has visited. | 14 |
Administrative duration | Float | Time duration spent on account management-related pages by the visitor. | 14 |
Informational | Float | The number of pages visited by the visitor on the website, as well as the shopping site’s communication and address information. | 14 |
Informational duration | Float | Time duration spent on informational pages by the visitor. | 14 |
Product-related | Float | This is the total number of pages the user visited product-related pages. | 14 |
Product-related |
Float | Time duration spent on product-related pages by the visitor. | 14 |
Bounce rates | Float | The visitor’s average bounce rate for the pages he or she viewed. | 14 |
Exit rates | Float | The visitor’s average exit rate is the sum of the pages he or she has visited. | 14 |
Page values | Float | The visitor’s average page value of the pages he or she visited. | 0 |
Special day | Float | The proximity of the site’s visit time to a significant day. | 0 |
Month | Object | The visit date’s month value. | 0 |
Operating systems | Int | The visitor’s operating system. | 0 |
Browser | Int | The visitor’s browser. | 0 |
Region | Int | The visitor’s session began in the geographic region from which he or she began. | 0 |
Traffic type | Int | The source of traffic that sent the visitor to the website (e.g., banner, SMS, direct). | 0 |
Visitor type | Object | “New Visitor,” ‘‘Returning Visitor,” and “Other” are all visitor types. | 0 |
Weekend | Bool | If the visit is on a weekend, this value is a boolean value. | 0 |
Revenue | Bool | Whether or not the visit was completed with a transaction is indicated by the class label. | 0 |
The analysis of the dataset is carried out in two phases. In the first phase, univariate analysis is carried out on the dataset. The univariate analysis gives the details about each variable individually without considering any relationships among different variabes. Specifically, we compute mean, median, maximum, minimum, standard deviation, and variance of each of the variables present in the dataset. In the OSI dataset, there are numerical as well as categorical variables, In order to analyse numerical variables, descriptive statistical analyses are performed, whereas for categorical variables, we plot the frequency distribution plots. The dataset is highly imbalanced, eighty-five percent of the data samples belong to the 0 category and fifteen percent belong to class 1. Frequency distribution plots give an understanding of the class imbalance that is present in the dataset. In the second phase bivariate analysis is carried out that is used to check the relationship between two variables in the dataset. Statistical relations between the variables are measured by using quantitative metrics such as correlation, regression, and group by categories approach.
The data distribution for the variables i.e., Revenue, Visitor Type, Weekend, and Month are shown in
Min, Max, Mean, and Percentiles Information about the numerical variables is given in
Variables of dataset | Count | Mean | Standard deviation | Minimum | Maximum | Percentile |
||
---|---|---|---|---|---|---|---|---|
25% | 50% | 75% | ||||||
Administrative | 12330 | 2.318 | 3.321 | 0.0 | 27 | 0.0 | 1.0 | 4.0 |
Administrative |
12330 | 80.91 | 176.76 | −1.0 | 3399 | 0.0 | 8.0 | 93.26 |
Informational | 12330 | 0.504 | 1.27 | 0.0 | 24 | 0.0 | 0.0 | 0.0 |
Informational |
12330 | 34.51 | 140.7 | −1.0 | 2549 | 0.0 | 0.0 | 0.0 |
Product related | 12330 | 31.76 | 44.46 | 0.0 | 705 | 7.0 | 18 | 38 |
Product related |
12330 | 1196 | 1913 | −1.0 | 63973 | 185 | 601 | 1464 |
Bounce rates | 12330 | 0.0221 | 0.048 | 0.0 | 0.2 | 0.0 | 0.003 | 0.017 |
Exit rates | 12330 | 0.043 | 0.048 | 0.0 | 0.2 | 0.014 | 0.025 | 0.05 |
Page values | 12330 | 5.889 | 18.56 | 0.0 | 361.8 | 0.0 | 0.0 | 0.0 |
Special day | 12330 | 0.061 | 0.199 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
Operating systems | 12330 | 2.124 | 0.911 | 1.0 | 8.0 | 2.0 | 2.0 | 3.0 |
Browser | 12330 | 2.36 | 1.717 | 1.0 | 13.0 | 2.0 | 2.0 | 2.0 |
Region | 12330 | 3.147 | 2.101 | 1.0 | 9.0 | 1.0 | 3.0 | 4.0 |
Traffic type | 12330 | 4.070 | 4.025 | 1.0 | 20 | 2.0 | 2.0 | 4.0 |
The correlation among the numerical variables is represented using the correlation plot as shown in
Group-by target label is one of the major analyses techniques in classification tasks. The descriptive statistics of the data are calculated after grouping the data based on the target label. There are two classes in the dataset which are categorized as Label-0 (false) and Label-1 (True). The mean, maximum, minimum, standard deviation, and variance can be found for all the variables in each of the categories. The variation in the values based on the category allows for an understanding of discriminative power of the variables in terms of distinguishing the two classes.
Variables of dataset | Revenue (Target) | Variables of dataset | Revenue (Target) | ||
---|---|---|---|---|---|
Label-0 | Label-1 | Label-0 | Label-1 | ||
Administrative | 2.121 | 3.393 | Exit rates | 0.047 | 0.019 |
Region | 3.159 | 3.082 | Page values | 1.84 | 22.97 |
Informational | 0.424 | 0.739 | Special day | 0.068 | 0.023 |
Informational duration | 23.021 | 44.503 | Operating systems | 2.13 | 2.09 |
Product related | 27.59 | 44.26 | Browser | 2.34 | 2.453 |
Product related duration | 1016 | 1736 | Administrative duration | 66.327 | 108.44 |
Bounce rates | 0.025 | 0.005 | Traffic type | 4.078 | 4.021 |
ML and DL techniques [
Logistic regression (LR) model is a supervised ML model used for classification problems to find the relationship between target variable Y and independent variables X. LR model predicts the output in the form of probability. This model uses a sigmoid activation function which converts the numeric value into a probability score by taking the sigmoid of that value [
Decision tree (DT) is an important ML model used in both classification as well as regression problems. DT model uses a tree-shaped structure and explains decisions based on each step. It takes a variable with the highest information gain on the top of the tree and starts making a tree by splitting the variables downward [
RF model is an ensemble model of several DT models, where each DT predicts a class and, in the end, based on the majority voting the class is assigned to the test sample point. At each stage of RF voting is used to decide an outcome [
Similar to the RF model, there is another important model known as extra tree (ET) classifier. In the ET, more variation into the ensemble is expected so, there is a change in building trees in the ETC. Each decision stump in the ET is build using the criteria discussed as follows: All data available in the training set is used to build each stump, to form root node next step is to determine the best split and it is determined by searching in a subset of randomly selected features. In the case of ET split of each selected feature is selected at random. The maximum depth of a decision stump is 1 [
Neural network (NN) is one of the most important models of DL and is mainly used in problems where complexity is high. NN classifier is mainly used in high dimensional data where the algorithm is trained to extract non-linear patterns present in the data. NN uses different hidden layers and activation functions. For a binary classification problem, sigmoid is mostly used [
In this research AUC-ROC score, precision/recall, F1 score, accuracy, and confusion matrix are used to measure the performance of each model. Precision is the ratio of true positive (TP) values and the sum of TP and false positive (FP).
The recall is the ratio of TP and the sum of TP and FN [
There is always a trade-off between recall and precision, sometimes precision is high while recall is low and, in some cases, recall is high while precision is low. The main objective is to have good precision and recall score. F1score is a better measure to use to seek a balance between precision and recall and especially in the case when data is imbalanced. The mathematical formula of the F1 score is given in
Accuracy is another measure of performance. It is a ratio of correct predictions made by the model out of total. It is a ratio of the sum of TP and true negative (TN) divided by the sum of TN, TP, false negative (FN), and FP [
Permutation importance (PI) [
The data is a binary classification dataset. The performance metrics such as Accuracy, Precision, Recall, and F1 Score are used as the evaluation metrics to check the performance of the classification models. Moreover, the study focused on the importance of variables in predicting the outcome that is measured by classification algorithms.
(a) | (b) | ||||||
---|---|---|---|---|---|---|---|
Output results of LR classifier | Output results of DT classifier | ||||||
Confusion matrix | Predicted output | Confusion matrix | Predicted output | ||||
Label–0 | Label–1 | Label–0 | Label–1 | ||||
Actual output | Label–0 | 2025 | 59 | Actual output | Label–0 | 1910 | 174 |
Label–1 | 237 | 145 | Label–1 | 168 | 214 | ||
(c) | (d) | ||||||
Output results of RF classifier | Output results of ET classifier | ||||||
Confusion matrix | Predicted output | Confusion matrix | Predicted output | ||||
Label–0 | Label–1 | Label–0 | Label–1 | ||||
Actual output | Label–0 | 1997 | 87 | Actual output | Label–0 | 2014 | 70 |
Label–1 | 162 | 220 | Label–1 | 185 | 197 |
Classifiers | Performance measures | |||
---|---|---|---|---|
Precision | Recall | F1–Score | Accuracy | |
LR classifier | 0.80 | 0.68 | 0.71 | 0.88 |
DT classifier | 0.72 | 0.73 | 0.72 | 0.85 |
RF classifier | 0.82 | 0.77 | 0.79 | 0.90 |
ET classifier | 0.81 | 0.74 | 0.77 | 0.88 |
NN classifier | 0.72 | 0.80 | 0.74 | 0.87 |
Permutation importance (PI) method is used to find the best variables that are contributing towards correct predictions of output variable.
(a) | (b) | ||
---|---|---|---|
PI by LR classifier | PI by DT classifier | ||
Variables of dataset | Importance score | Variables of dataset | Importance score |
Page values | 0.091 | Page values | 0.1224 |
Product-related duration | 0.0026 | Administrative | 0.0253 |
Administrative duration | 0.0018 | Administrative duration | 0.0238 |
Informational duration | 0.0014 | Exit rates | 0.0134 |
Operating systems | 0.0013 | Product-related duration | 0.0117 |
Visitor type | 0.001 | Bounce rates | 0.0098 |
Administrative | 0.0006 | Informational | 0.0075 |
Product related | 0.0004 | Product-related | 0.0074 |
Weekend | 0.0001 | Month | 0.0047 |
Exit rates | 0 | Traffic type | 0.0044 |
Bounce rates | 0 | Visitor type | 0.0041 |
Special day | 0 | Informational duration | 0.0030 |
Informational | −0.0001 | Region | 0.0009 |
Traffic type | −0.0002 | Operating systems | 0.0003 |
Month | −0.0008 | Special day | 0.0001 |
Browser | −0.0009 | Browser | −0.0002 |
Region | −0.0012 | Weekend | −0.0006 |
(c) | (d) | ||
PI by RF classifier | PI by ET classifier | ||
Variables of dataset | Importance score | Variables of dataset | Importance score |
Page values | 0.1253 | Page values | 0.1159 |
Exit rates | 0.0093 | Bounce rates | 0.0039 |
Month | 0.0067 | Month | 0.0029 |
Product related | 0.0045 | Visitor type | 0.0025 |
Product-related duration | 0.0042 | Product-related duration | 0.0025 |
Bounce rates | 0.0013 | Product-related | 0.0021 |
Informational | 0.0012 | Exit rates | 0.0013 |
Operating systems | 0.0010 | Administrative | −0.0001 |
Administrative duration | 0.0009 | Operating systems | −0.0001 |
Browser | 0.0009 | Special day | −0.0001 |
Informational duration | 0.0007 | Browser | −0.0010 |
Special day | 0.0006 | Informational duration | −0.0010 |
Visitor type | 0.0004 | Traffic type | −0.0015 |
Region | 0.0001 | Region | −0.0016 |
Administrative | −0.0003 | Administrative duration | −0.0019 |
Weekend | −0.0005 | Informational | −0.0031 |
Traffic type | −0.0005 | Weekend | −0.0033 |
PI is an approach to find the best features for predicting the outcome of a classifier. The importance of the variable is decided by the prediction power that the variable has on the modeling technique. This is determined by random shuffling and removing the variables and determining the performance of the modeling technique without the variable. All the four classifiers which are used to explain the feature importance have shown that the “Page Values” is the most important factor in deciding the output “Revenue”. This means that the prediction gives poor results if the modeling is carried out without the “Page Values” variable. This is advocated by the fact that the Revenue has shown a correlation value of 0.5 with the “page values” which is highest among all the other variables. The variation in the page values has shown a significant effect on the target variable. It is evident that most users spend time on the website before they make a transaction, the importance of the variable Page Value further strengthens the fact that customers’ average time spent on the page has a strong impact on the sales. In day-to-day e-commerce websites like Amazon and eBay, an average user spends time browsing the website and chooses a product. The product which the user likes to buy will have a higher page value to the user. The reason that user takes extra time and care before making a transaction and the increase in page value is because of the extra time spent by the user in reviewing the product before making the purchase. The process of reviewing the product by reading the reviews and seeing product specifications add value to the page. Although the “Page value” turned out to be an important factor, it cannot be considered as the only important factor in deciding the purchase. There could be additional driving factors that make the user visit the page and make the sale. The following most important factor in making the sale is “Product Related Duration”, “Administrative”, “Bounce Rates”, “Exit Rates”. The importance of the Bounce rate can be explained by the fact that customers who immediately fly away from the product site have an impact on the sale. This could be due to the ad links that appear on the websites like Facebook and Instagram. Users tend to click the ads and close the product pages instantly. Product-Related duration is the average time spent by the user across all the related products. Most of the users who do the e-commerce shopping spend time on the product and the related products to compare the price and quality of the product with similar products which makes the product-related duration one of the important variables. For instance, the user who will purchase a mobile from the website compares all the mobiles with a similar price range and specifications.
The primary objective of this research is to provide eXplainable artificial intelligence (XAI) that drives the business not just by predicting the outcomes, but also explaining the importance of each of the variables, and identifying key features for future predictions. The reason behind choosing the approach of variable importance is to further explore the variables that are contributing to the revenue. Although business decisions are based on several other factors when intelligence is combined with statistics and numbers with explanation, businesses are sure about the outcomes rather than depending on just predictions. This study focuses on the usage of traditional approaches for determining the feature importance. Permutation Feature importance is an approach where the importance of variables is determined by randomly shuffling the values of the variable under consideration across all samples and observing the results of the experiments with the shuffled values. This project also explains the use of NN in making future predictions on the unseen data. Although NN has been observed to be good in making predictions, it cannot explain results.
XAI has shown an increasing demand in recent years, The reason behind it is that the majority of the organizations are interested in knowing the reason behind the outcomes rather than the results themselves. If the AI used in the project can explain results, this will help the businesses in making crucial decisions and allow them to focus on areas where attention is necessary. This research shows the impact of page values on the target label “Revenue”. This information could be useful for businesses that can now focus on areas that are contributing to the page value. As there is also a positive correlation between the “page value” and “revenue” the businesses can focus on factors that are helping in the increase in “page values”. This study shows the importance of variables and draws inferences about the reasons behind the importance of those variables and thus, could help businesses grow and get popular in the same e-commerce market.
In the future models can be exported to the cloud making the models to predict in live real-time and also the models can be used beforehand to make XAI solutions, such that businesses can make the right decisions in time on the attributes that are contributing to the profits.
Thanks to the supervisor and co-authors for their valuable guidance and support.