With the rise of internet facilities, a greater number of people have started doing online transactions at an exponential rate in recent years as the online transaction system has eliminated the need of going to the bank physically for every transaction. However, the fraud cases have also increased causing the loss of money to the consumers. Hence, an effective fraud detection system is the need of the hour which can detect fraudulent transactions automatically in real-time. Generally, the genuine transactions are large in number than the fraudulent transactions which leads to the class imbalance problem. In this research work, an online transaction fraud detection system using deep learning has been proposed which can handle class imbalance problem by applying algorithm-level methods which modify the learning of the model to focus more on the minority class i.e., fraud transactions. A novel loss function named Weighted Hard- Reduced Focal Loss (WH-RFL) has been proposed which has achieved maximum fraud detection rate i.e., True Positive Rate (TPR) at the cost of misclassification of few genuine transactions as high TPR is preferred over a high True Negative Rate (TNR) in fraud detection system and same has been demonstrated using three publicly available imbalanced transactional datasets. Also, Thresholding has been applied to optimize the decision threshold using cross-validation to detect maximum number of frauds and it has been demonstrated by the experimental results that the selection of the right thresholding method with deep learning yields better results.
Two types of transactions are generally performed either they are genuine transactions performed by the actual users or fraud transactions performed by the fraudsters. To verify that a transaction is performed by the actual user, it is matched with the history of transactions performed by the user. If it is not matched, then it may be categorized as a fraud. Genuine transactions occur largely in number as compared to fraudulent transactions. Thus, a fraud detection system should be able to detect the frauds in imbalanced data of transactions occurring in real-time. Rule-based systems are used to detect fraudulent transactions in which rules are based on the existing experience to detect only the occurred fraudulent patterns and existing fraudulent behaviors. In rule-based systems, rules are pre-programmed to identify the changes in patterns. These rule based expert systems are not capable to detect online fraudulent transactions effectively. The application of deep learning seems to be promising in the fraud detection domain due to the favorable results produced [
Class imbalance problem is inherent in various real-life applications like credit card fraud detection, the medical diagnosis of disease e.g., whether a patient has cancer or not. The drawback of using this imbalanced data into the system leads to the bias towards the majority class because classic learning algorithms are trained to maximize the overall accuracy and such high score accuracy may mislead about the performance of the learning model. Class imbalance is one of the main issues in the transactional dataset. However, it is still understudied, and research on the usage of deep learning to handle class imbalance in non-image data like historical transactional data, insurance, or medical claims, etc. is limited [
In this research work, a deep learning-based model has been proposed for handling class imbalance problem in online transaction fraud detection. Algorithm level approaches i.e., loss functions have been explored for handling class imbalance by altering the learning of the model. Also, a novel loss function has been proposed to maximize the fraud detection rate i.e., TPR. Thresholding has been explored in conjunction with deep learning to optimize the decision threshold for altering the output of the learning model. The decision threshold of proposed deep learning model has been optimized using validation data in order to achieve maximum fraud detection rate. By experimental results, it has been demonstrated that choosing the right thresholding method yields better results. Also, it has been shown as the decision threshold gets adjusted by altering the learning of model. The relationship between class imbalance level and decision threshold has also been found.
Class imbalance naturally exists in many real applications where one class generally dominates the others in terms of frequency. The methods to address the class imbalance problem are data level, algorithm level and hybrid methods. In algorithm level methods, the importance of minority class is increased by adjusting the learning of the model. The different approaches to modify the backpropagation learning of neural network are cost-sensitive classification, adaptive learning rate, new loss functions, and output threshold moving or thresholding [
Loss functions play an important role in the learning of a neural network (NN). Loss can be referred to as the prediction error of a Neural Network (NN). In this research work, the following loss functions have been used for altering the learning of NN.
A novel loss function proposed by Lin et al. [
Here,
Weighted cross-entropy loss (W-CEL) [
Sergievskiy et al. [
Here, the value of
The objective of this research work is to maximize the fraud detection rate i.e., TPR, and to maintain the overall performance of the model. Thus, hard positive examples i.e., fraud transactions having a probability less than 0.5 need to get more attention in the loss. Hence, to achieve maximum fraud detection rate (TPR), the reduced focal loss (RFL) function has been tuned by giving more weightage to the hard-positive examples as compared to the hard-negative examples. Thus, RFL has been modified and named as Weighted Hard-Reduced Focal loss (WH-RFL) because only hard examples have been provided weights rather than whole examples i.e., by multiplying the loss calculated for hard-positive examples by a flat weight value
Different values of flat weights have been tested by multiplying with loss of hard positive examples and hard negative examples and the combination of 2 and 0.5 for weight1 and weight2 has achieved the best result in this research work. The comparison of the proposed loss function with other loss functions has been explained using the following example to show that WH-RFL loss function gives more priority to the less frequent fraud transactions and performs better than CEL, FL, and RFL.
Loss value | CEL | FL (without α) | RFL | WH-RFL (2, 0.5) |
---|---|---|---|---|
Suppose k |
CEL(k) = −log (0.3) − log (0.7) − log (0.6) – log (0.3) = 1.422 | FL(k) = −0.7 ∧ 2 ∗ log (0.3) − 0.3 ∧ 2 ∗ log (0.7) − 0.4 ∧ 2 ∗ log (0.6) − 0.7 ∧2 ∗ log (0.3) = 0.562 | RFL(k) = −log (0.3) − 0.3 ∧2 ∗ log (0.7) − 0.4 ∧ 2 ∗ log (0.6) −log (0.3) = 1.095 | WH-RFL(k)= −1/2 ∗ log (0.3) − 0.3 ∧2 ∗ log (0.7) − 0.4 ∧ 2 ∗ log (0.6) −2 ∗ log (0.3) = 1.357 |
CEL(k+1) = −log (0.9) − log (0.7) − log (0.6) – log (0.3) = 0.945 | FL(k+1) = −0.1 ∧ 2 ∗ log (0.9) − 0.3 ∧2 x log (0.7) − 0.4 ∧ 2 ∗ log (0.6) − 0.7 ∧2 ∗ log (0.3) = 0.306 | RFL(k+1) = −0.1∧ 2 ∗ log (0.9) − 0.3 ∧2 ∗ log (0.7) − 0.4 ∧ 2 ∗ log (0.6) − log (0.3) = 0.573 | WH-RFL(k+1) = −0.1 ∧ 2 ∗ log (0.9) − 0.3 ∧2 ∗ log (0.7) − 0.4 ∧ 2 ∗ log (0.6) – 2 ∗ log (0.3) = 1.096 | |
CEL(k+1) = −log (0.3) − log (0.7) − log (0.6) – log (0.9) = 0.945 | FL(k+1) = −0.7∧ 2 ∗ log (0.3) − 0.3 ∧2 ∗ log (0.7) − 0.4 ∧ 2 ∗ log (0.6) − 0.1 ∧2 ∗ log (0.9) = 0.306 | RFL(k+1) = −log (0.3) − 0.3 ∧2 ∗ log (0.7) − 0.4 ∧ 2 ∗ log (0.6) − 0.1 ∧2 ∗ log (0.9) = 0.573 | WH-RFL(k+1) = −1/2 ∗ log (0.3) − 0.3 ∧2 ∗ log (0.7) − 0.4 ∧ 2 ∗ log (0.6) − 0.1 ∧2 ∗ log (0.9) = 0.311 |
Thresholding is performed to adjust the decision process to increase the importance of minority positive class i.e., frauds in our case and to reduce the bias of the model towards the majority negative class i.e., genuine transactions. Receiver Operating Characteristic (ROC) curve has been used for thresholding. The curve generated by plotting the TPR against FPR over a range of decision thresholds is known as ROC curve. The decision threshold has been optimized using the validation data which will be used to predict the probability of unseen test data. The level of class imbalance in the training data affects the range of probabilities generated by the neural network. Thus, selecting an optimal decision threshold using validation data is a crucial component of learning from class imbalanced data [
For each point on the ROC curve, the value of distance ‘D’ is calculated from point (0,1) as per the following formula [
Youden [
In this criterion, the value of G-Mean is maximized. Thus, the G-Mean value is checked over a range of decision thresholds to obtain maximum G-Mean.
The Formula of G-Mean is
Among these three criteria, the closest to (0,1) criterion has been selected since it is better in terms of TPR than other two methods and the same has been demonstrated by experimental results.
Data-level class imbalanced methods have been used by many researchers in which they have modified the data to handle class imbalance. Fu et al. [
Limited research work has been done using the algorithm level class imbalance handling methods. Ghobadi et al. [
The proposed methodology aims to build a model that can train well even in imbalanced data. Deep learning has been selected since it can learn extensively even with imbalanced data. The structure of the proposed methodology is shown in
Three datasets containing transactional data have been used which contain genuine and fraud transactions in an imbalanced manner.
Name of dataset | Total features | Total transactions | Genuine transactions | Fraud transactions | Class imbalance level |
---|---|---|---|---|---|
IEEE CIS [ |
434 | 5,90,540 | 5,69,877 | 20,663 | 27.58: 1 |
Banksim [ |
10 | 5,94,643 | 5,87,443 | 7,200 | 81.59: 1 |
Credit Card [ |
31 | 2,84,807 | 2,84,315 | 492 | 577.88:1 |
Preprocessing of all three datasets has been explained as following.
IEEE CIS Dataset has been divided into two separate Tables. i.e., transaction table and identity table having one common feature TransactionID. Both transaction and identity tables have been merged using TransactionID feature. There is one binary target feature named isFraud which has 0 value for the legitimate transaction and 1 for fraud transaction. There are 434 features including transaction id and isFraud target feature in the dataset and most of them contain missing values and hence features having large missing values have been excluded. Thus, after excluding features based on their missing value percentage, 54 features have been selected out of which 32 are numerical features and 22 are categorical features. Some of the continuous features are right-skewed. For those continuous features, log transformation has been applied. After the log transformation of right-skewed features, standardization of all the numerical features has been performed. Missing values in numerical features have been imputed with 0 and missing indicators have been added to indicate numerical features that are missing and have been imputed. Thus, after handling missing values, the 32 numerical features increased to the count of 49. All categorical features have been converted into numerical. Thus, the cardinality of categorical features has been reduced by using only the most frequent categories. Missing values have been used as ‘nan’ category. The rest of the less frequent categories have been treated as ‘Other’. Then, all the categorical features have been converted into numerical using one-hot encoding [
This dataset consists of synthetic data provided by a bank in Spain containing transactional data to be used for fraud detection research. It contains a total of 10 features. There are no duplicate transactions in the dataset. There are no missing values in the dataset. As the features, zipcodeOri and zipMerchant contain one constant value of zip code and hence, they have been removed. Step column is also not important and has been removed. Thus, only six features have been used as input as the fraud feature is the type of transaction. Categorical features have been converted into numerical by ordinal encoding [
This dataset contains transactions made by credit cards in September 2013 by European cardholders. There are no duplicate transactions in the dataset. Also, there are no missing values in the dataset. There are 31 columns in the dataset and all the columns have been already transformed using PCA in the dataset except ‘Amount’, Time’, and Class. Class columns depicts whether a transaction is genuine or fraud. It has only two values i.e., 1 (in case of fraud) and 0 (in case of genuine transaction). Hence scaling of columns ‘Amount’ and Time’ has been performed. All features of the dataset have been used.
All datasets have been split in training and testing sets in stratified manner in the ratio of 80:20 as shown in
Dataset | Data | Total transactions | Genuine transactions | Fraud transactions | % of frauds |
---|---|---|---|---|---|
Training | 4,72,432 | 4,55,872 | 16,530 | 3.499 | |
Test | 1,18,108 | 1,13,975 | 4,133 | 3.499 | |
Training | 4,75,714 | 4,69,954 | 5,760 | 1.211 | |
Test | 1,18,929 | 1,17,489 | 1,440 | 1.211 | |
Training | 2,26,980 | 2,26,602 | 378 | 0.00167 | |
Test | 56,746 | 5,6651 | 95 | 0.00167 |
Random search approach has been used to finalize the baseline architecture of Deep Neural Network (DNN) for all three datasets and their hypermeters. 20% of validation data has been taken from the training data to evaluate the model performance and to select the best hyperparameters i.e., number of hidden layers, number of neurons per hidden layer, learning rate, batch normalization, dropout rate. Each set of hyperparametrs was evaluated by using five random 20% of validation data in stratified manner from the training data i.e., by training five models. TPR value was averaged for all five models and this avergae TPR value has been compared for all sets of hyperparameters to select the best set of hyperparameters having maximum average TPR value. Once the architectures of DNN models for all three datasets have been finalized, they have been fit to whole training data and then evaluated using test data. This approach has made sure that the hyperparameter tuning would not get influenced by test data.
Layer type | Number of neurons | ||
---|---|---|---|
IEEE CIS | Banksim | European | |
Input layer | 558 | 6 | 30 |
Dense layer | 512 | 16 | 32 |
Batch normalization | |||
ReLU activation function | |||
Dropout (0.3) | |||
Dense layer | 256 | 8 | 16 |
Batch normalization | |||
ReLU activation function | |||
Dropout (0.3) | |||
Dense layer | 1 | 1 | 1 |
Sigmoid function |
The output bias weights have been initialized in the last layer of the model with prior probability
By default, the decision threshold of DNN model is 0.5. In the class imbalance scenario, the threshold needs to be adjusted. For demonstration purposes, the CEL function has been used to select the best thresholding criterion. The results have been obtained by all three thresholding criteria i.e., Closest to (0,1), Youden Index J, max-G-Mean, and default threshold after the first epoch for all three datasets and have been summarized in
Dataset | Thresholding criteria | Threshold | TPR | TNR | G-Mean | AUC-ROC | Accuracy |
---|---|---|---|---|---|---|---|
Closest to (0,1) | 0.04488 | 0.84226 | 0.81731 | 0.89153 | 0.84054 | ||
Youden Index J | 0.04991 | 0.77677 | 0.85982 | 0.81724 | 0.85691 | ||
max-G-Mean | 0.04685 | 0.78645 | 0.84948 | 0.84728 | |||
Default threshold | 0.5 | 0.32517 | 0.56937 | ||||
Closest to (0,1) | 0.13115 | 0.73687 | 0.64132 | 0.66304 | 0.73470 | ||
Youden Index J | 0.14428 | 0.41840 | 0.95874 | 0.63336 | 0.95220 | ||
max-G-Mean | 0.13967 | 0.47309 | 0.88416 | 0.87918 | |||
Default threshold | 0.5 | 0.00000 | 0.00000 | ||||
Closest to (0,1) | 0.07041 | 0.85307 | 0.81379 | 0.81085 | 0.85294 | ||
Youden Index J | 0.10494 | 0.69737 | 0.96860 | 0.82187 | 0.96815 | ||
max-G-Mean | 0.07980 | 0.75000 | 0.90426 | 0.90400 | |||
Default threshold | 0.5 | 0.22368 | 0.47294 |
As per
As the range of probabilities generated by the neural network gets affected by the class imbalance level in dataset. Hence, selection of an optimal decision threshold using validation data is important for learning from class imbalanced data. Optimal decision thresholds have been calculated using the Repeated Stratified K-fold cross validation where k is 5 and have been repeated 2 times with different randomization in each repetition. Thus, total 10 folds of validation data have been obtained. By using 5-folds cross validation, we have 20% of validation data in each fold. For each of the ten folds of validation data, the thresholding has been performed to optimize the threshold using the validation data probabilities calculated by the DNN model. For each model, the decision threshold optimized and number of epochs for which model has been trained are saved so that these values can be used to train the model on entire training data and then for evaluation.
Early stopping has been used to stop the training of model to avoid overfitting of model and hence, save the results of best model. Training of DNN model during cross validation is stopped when maximum value of TPR for validation data is achieved by optimizing decision threshold and in case if TPR stopped improving but TNR is improving, then training is stopped when TNR also stopped improving i.e., the results are saved for best TPR and TNR both.
The DNN model having same architecture used during cross-validation has been trained on the entire training dataset with the same number of epochs used for the optimization of decision threshold for each fold of validation data. The same procedure has been repeated for all ten fold results.
For implementation of neural networks, an open-source library named Keras written in python language has been used. Proposed methodology has been performed using all loss functions. The results for test data have been generated using optimal thresholds calculated using cross-validation and the best test data results have been selected among all folds. For FL, W-FL, and WH-RFL, the value of γ is fixed i.e., 2 as it gives the best results [
Loss function | Threshold | No of epochs | TPR | TNR | G-Mean | AUC-ROC | Accuracy | |
---|---|---|---|---|---|---|---|---|
0.01412 | 60 | 0.91144 | 0.96510 | |||||
0.29216 | 38 | 0.91289 | 0.87255 | 0.89250 | 0.95965 | 0.87396 | ||
without α | 0.15961 | 44 | 0.91483 | 0.88288 | 0.89871 | 0.96494 | 0.88400 | |
α = 0.10 | 0.08690 | 39 | 0.91023 | 0.87809 | 0.89402 | 0.96250 | 0.87921 | |
α = 0.25 | 0.12964 | 30 | 0.90636 | 0.87413 | 0.89010 | 0.96047 | 0.87526 | |
α = 0.50 | 0.14887 | 53 | 0.91628 | 0.88475 | 0.90038 | 0.96658 | 0.88585 | |
α = 0.75 | 0.21887 | 36 | 0.91241 | 0.88713 | 0.89968 | 0.96414 | 0.88801 | |
α = 0.90 | 0.30214 | 30 | 0.91289 | 0.86482 | 0.88853 | 0.96022 | 0.86650 | |
0.39576 | 41 | 0.91604 | 0.86350 | 0.88938 | 0.96112 | 0.86534 | ||
0.15375 | 42 | 0.91217 | 0.88806 | 0.90003 | 0.96476 | 0.88890 | ||
0.17909 | 53 | 0.88824 | 0.90239 | 0.88924 |
Loss function | Threshold | No of epochs | TPR | TNR | G-Mean | AUC-ROC | Accuracy | |
---|---|---|---|---|---|---|---|---|
0.00358 | 22 | 0.90625 | 0.91714 | 0.91168 | 0.95602 | 0.91496 | ||
0.27689 | 10 | 0.91180 | 0.90352 | 0.90765 | 0.95293 | 0.90518 | ||
without α | 0.11795 | 119 | 0.92777 | 0.91255 | 0.92013 | 0.91559 | ||
α = 0.10 | 0.05973 | 27 | 0.90902 | 0.91327 | 0.91114 | 0.96454 | 0.91242 | |
α = 0.25 | 0.08144 | 49 | 0.91250 | 0.90863 | 0.91056 | 0.96531 | 0.90940 | |
α = 0.50 | 0.12115 | 35 | 0.90833 | 0.91360 | 0.96099 | |||
α = 0.75 | 0.16604 | 38 | 0.91736 | 0.90182 | 0.90955 | 0.95926 | 0.90493 | |
α = 0.90 | 0.24950 | 10 | 0.90972 | 0.90116 | 0.90543 | 0.95182 | 0.90287 | |
0.39133 | 13 | 0.91111 | 0.90572 | 0.90841 | 0.95380 | 0.90680 | ||
0.09578 | 49 | 0.92708 | 0.88460 | 0.90559 | 0.96389 | 0.89310 | ||
0.12836 | 88 | 0.90594 | 0.97325 | 0.91253 |
Loss function | Threshold | No of epochs | TPR | TNR | G-Mean | AUC-ROC | Accuracy | |
---|---|---|---|---|---|---|---|---|
0.00028 | 27 | 0.90526 | 0.92150 | 0.91334 | 0.97412 | 0.92147 | ||
0.08353 | 24 | 0.91578 | 0.90669 | 0.91122 | 0.97836 | 0.90671 | ||
without α | 0.06129 | 21 | 0.91578 | 0.90406 | 0.90990 | 0.96696 | 0.90408 | |
α = 0.10 | 0.01918 | 205 | 0.90526 | 0.94859 | 0.92667 | 0.96878 | 0.94852 | |
α = 0.25 | 0.04957 | 36 | 0.89473 | 0.92532 | 0.97328 | |||
α = 0.50 | 0.06129 | 21 | 0.91578 | 0.90407 | 0.90991 | 0.96696 | 0.90409 | |
α = 0.75 | 0.08673 | 74 | 0.93470 | 0.97786 | 0.93469 | |||
α = 0.90 | 0.12120 | 49 | 0.93218 | 0.92924 | 0.97462 | 0.93217 | ||
0.24121 | 50 | 0.88768 | 0.90679 | 0.97401 | 0.88774 | |||
0.05947 | 67 | 0.90526 | 0.94413 | 0.92449 | 0.94406 | |||
0.07262 | 52 | 0.93119 | 0.92875 | 0.97764 | 0.93118 |
It has been found that for all three datasets, the decision thresholds optimized using all loss functions are dependent on the class imbalance level in dataset. As per
For all three datasets, the proposed loss function i.e., WH-RFL has achieved maximum TPR at the cost of a small decrease in TNR as compared to other loss functions. For the Credit Card dataset, W-FL and α balanced FL have also achieved equivalent TPR. CEL has achieved maximum TNR for IEEE CIS dataset and α balanced FL has achieved maximum TNR for Banksim and European datasets. CEL has achieved maximum G-Mean for the IEEE CIS dataset. WH-RFL has achieved maximum G-Mean for the Banksim dataset. α balanced FL has achieved maximum G-Mean for the European dataset. Thus, maximum G-Mean does not ensure maximum TPR. For the IEEE CIS dataset, WH-RFL has achieved the maximum AUC-ROC score. For the Banksim dataset, FL without α parameter has achieved maximum AUC-ROC score and for Credit Card dataset, RFL has achieved maximum AUC-ROC score. Hence, the large AUC-ROC score does not ensure a large TPR. Also, accuracy is high when TNR is high for all three datasets. Hence, a highly accurate model can not be considered better when the data is imbalanced.
From the experimental results, it is evident that the proposed loss function can detect maximum fraud transactions at the cost of misclassification of few genuine transactions as a high TPR is preferred over a high TNR in the fraud detection system. It has also been demonstrated that large value of threshold independent performance metric AUC-ROC score does not necessarily mean high TPR as ROC curve is sensitive towards class imbalance problem. It is also to be mentioned that this is the first-ever study that has optimized the decision threshold for maximizing the TPR as the previous research works have tried to maximize the threshold independent AUC-ROC score to evaluate the performance of their model.
In this research work, a methodology based on DNN has been proposed to detect frauds in online transactions by applying algorithm-level class imbalance techniques and further improving the fraud detection rate by optimizing the decision threshold on the validation data. Also, a novel focal loss function i.e., Reduced Focal Loss function (RFL) has been used and it has been demonstrated that the TPR achieved by modifying Reduced Focal Loss (RFL) i.e., by proposed loss function Weighted Hard-Reduce Focal Loss (WH-RFL) is superior to the CEL, FL and RFL loss functions. It has been demonstrated that selecting the optimal decision threshold yields better results with deep learning. Also, it is evident that by altering the learning of model, decision threshold gets adjusted automatically to achieve the desirable results. Thus, the proposed methodology used in this research can perform better with a large amount of data and able to address the class imbalance problem without modifying data.
The proposed methodology combined with proposed loss function can be applied in other domains like healthcare for disease detection, anomaly detection, etc. as the class imbalance is an inherent problem in these domains. The proposed methodology can also be explored using some other algorithm-level methods to handle the class imbalance problem by giving more priority to the minority class examples.