Outlier detection is a key research area in data mining technologies, as outlier detection can identify data inconsistent within a data set. Outlier detection aims to find an abnormal data size from a large data size and has been applied in many fields including fraud detection, network intrusion detection, disaster prediction, medical diagnosis, public security, and image processing. While outlier detection has been widely applied in real systems, its effectiveness is challenged by higher dimensions and redundant data attributes, leading to detection errors and complicated calculations. The prevalence of mixed data is a current issue for outlier detection algorithms. An outlier detection method of mixed data based on neighborhood combinatorial entropy is studied to improve outlier detection performance by reducing data dimension using an attribute reduction algorithm. The significance of attributes is determined, and fewer influencing attributes are removed based on neighborhood combinatorial entropy. Outlier detection is conducted using the algorithm of local outlier factor. The proposed outlier detection method can be applied effectively in numerical and mixed multidimensional data using neighborhood combinatorial entropy. In the experimental part of this paper, we give a comparison on outlier detection before and after attribute reduction. In a comparative analysis, we give results of the enhanced outlier detection accuracy by removing the fewer influencing attributes in numerical and mixed multidimensional data.

Outlier detection is frequently researched in the field of data mining technologies and is aimed at identifying abnormal data from a data set [

Statistical methods require the presence of a single feature and predicable data distribution. In real environment, many data distributions are unpredictable and multi-featured. In addition, their applications are restricted [

Many studies [

This study aims to use a novel approach by combining attribute reduction with outlier detection technology to improve the outlier detection accuracy, reduce calculation complexity, remove fewer influencing attributes and reduce data dimensions. Data-processing plays a significant role in attribute reduction by using experimental multi-type data to achieve an accuracy boost for outlier detection. Using the neighborhood combinatorial entropy model, we construct mixed data outlier detection algorithms after defining the local outlier factor (LOF) algorithm, neighborhood combinatorial entropy algorithm and relevant concepts. Comparative data analysis is conducted to verify the detection accuracy after attribute reduction.

The rest of the paper is structured as follows. Section 2 provides the LOF algorithm, providing related definitions. Section 3 discusses the neighborhood combinatorial entropy algorithm, providing related definitions. Section 4 constructs the mixed data outlier detection algorithms based on neighborhood combinatorial entropy. Section 5 carries out the experimental analysis on the advantages of attribute reduction prior to outlier detection. Finally, Section 6 concludes the paper.

The LOF algorithm is a representative outlier detection algorithm that computes local density deviation [

where

where ^{.}

To determine the value of

The

As indicated in

If the object

The LOF algorithm examines whether the point is an outlier, which is carried out by comparing the local density of each object

The neighborhood combinatorial entropy calculation follows two aspects: neighborhood approximate accuracy and neighborhood conditional entropy. The neighborhood approximate accuracy is the ability to divide the system from the set perspective [

The information system is an aspect of data mining, expressed in a four-dimensional form

In addition, attributes are categorized into conditional and decision attributes. Therefore, attribute set

When neighborhood range is set manually,

In

where

The calculations on the distance between two objects are different, which depend on attribute types [

where

By eliminating the influence of dimension and value range between the attributes, numerical attributes are pre-processed into dimensionless data where each attribute value falls within the range of [0, 1]

where

Categorical data are converted into numerical data, and their values are determined per their categories. From

where

When processing different data categories, corresponding formulas are selected. Subsequently, the distance of the mixed data is obtained by applying

Their relationship is expressed in

As shown in

Since

From

Though

From the above analysis,

First, we set an attribute subset as an empty set. Then, we compare all conditional attributes

From an information theory perspective, an entropy value reflects the importance of a condition attribute in an information system in determining whether the condition attribute is reduced. Entropy has the following properties: non-negativity, i.e.,

In Algorithm 1, the LOF algorithm focuses on estimating Euclidean distance. Higher data dimension implies more squares and longer calculation time. In the LOF algorithm, a lower data dimension saves significant calculation time.

Algorithm 1 has limited capability for mixed data outlier detection. Algorithm 2 adopts different calculation methods for different attribute. Categorical and mixed data participate in data operation. After data treatment with Algorithm 3, attribute-reduced data are obtained and processed with LOF algorithm (Algorithm 1).

Attribute reduction filters out redundant data and shortens the calculation time without altering system applicability as some attributes are redundant and cause errors in the system. Filtering out errors improves the system’s accuracy. In Algorithm 2, neighborhood combinatorial entropy determines the significance of a conditional attribute in the attribute subset of the entire system. Applying Algorithms 2, 3 and the LOF algorithm for data processing reduces the data size processed by outlier detection, improve the outlier detection accuracy, and reduce misjudgment proportion.

First, 100 2-dimensional data of normal distribution are first randomly generated. To verify the LOF algorithm detection ability, the 100 random data are translated separately to form two clusters of data widely apart from one another. Hence, 200 data are generated. As presented in

A total of 200 8-dimensional dense data and 20 8-dimensional outlier data are randomly generated. The method generating the data is as same as that generating the randomly 2-dimensional data.

A total of 200 16-dimensional dense data and 20 16-dimensional outlier data are generated as the outlined in

The Wisconsin breast cancer data set contains 699 objects with 10 attributes. From the 10 attributes, 9 attributes are numerical conditional attributes, and 1 is a decision attribute. All cases are in two categories: benign (458 cases) and malignant (241 cases). Some malignant examples are removed to form an unbalanced distribution of all the data, assisting the experimental data in accordance with the cases of few outliers in practical applications. The data set contains 444 benign instances (91.93%) and 39 malignant instances (8.07%).

The annealing data set contains 798 objects with 38 attributes. Among the 38 attributes, 30 are classified condition attributes, 7 are numerical condition attributes, and 1 is decision attribute. The 798 objects are in 5 categories. The first category contains 608 objects (76.19%), and the remaining 4 categories contain 190 objects (23.81%). Classes 2–5 are all regarded as outliers. The distribution is visualized in

Data set | Number of objects | Number of attributes | Attribute type |
---|---|---|---|

Randomly generated 8-dimensional data | 220 | 8 | Numerical |

Randomly generated 16-dimensional data | 220 | 16 | Numerical |

Wisconsin breast cancer data set | 483 | 9 | Numerical |

Annealing data set | 798 | 37 | Mixed |

There are 20 outliers in the randomly generated 2-dimensional data set. Given the number of outliers as 20, the objects corresponding to the first 20 largest values of local outlier factor in

For the 8-dimensional data, among the 20 detected outlier objects, only 18 are artificially outliers, and the others are misjudged. The experimental results are not as promising as the 2-dimensional data. In the 16-dimensional data, only 17 of the 20 artificially set objects are identified as outliers, and the others are misjudged.

Data set | Number of data | Number of attributes | Accuracy of detection (%) |
---|---|---|---|

Randomly 2-dimensional data | 220 | 2 | 100 |

Randomly 8-dimensional data | 220 | 8 | 90 |

Randomly 16-dimensional data | 220 | 16 | 85 |

As shown in

The Wisconsin breast cancer data set comprises 39 presupposed outliers. Among the first 39 objects with the largest local outlier factor calculated by the LOF algorithm, 34 are real outliers, and the other 5 are misjudged dense points. The annealing data set has 190 outliers, of which 72 outliers are real outliers, and the others are misjudged.

Data set | Real outliers | Detected outliers | False outliers detected | Accuracy of detection (%) |
---|---|---|---|---|

Wisconsin breast cancer data set | 39 | 34 | 5 | 87.18 |

Annealing data set | 190 | 72 | 118 | 37.89 |

Four multidimensional data sets (excluding 2-dimensional data) separately underwent data attribute reduction in Algorithm 2. In the process, the attribute subset is determined by calculating neighborhood combinatorial entropy. After attribute reduction, the attribute number of the randomly generated 8-dimensional data is still 8. By comparing the number of attributes, randomly generated 16-dimensional data reduced to 15, and 6 for the Wisconsin breast cancer data set, and 21 for the annealing data set. The variations of their attributes before and after reduction are illustrated in

From

Algorithm 1 is adopted to detect outliers from the data sets with reduced attributes. The results are listed in

Data set | Real outliers | Detected outliers | False outliers detected | Accuracy after reduction (%) |
---|---|---|---|---|

Randomly 16-dimensional data | 20 | 17 | 3 | 85.00 |

Wisconsin breast cancer data set | 39 | 36 | 3 | 92.31 |

Annealing data set | 190 | 88 | 102 | 46.32 |

As revealed in

The accuracy of the randomly generated 16-dimensional data remains stable before and after attribute reduction, revealing that the attribute filtering rate is minimal and close to zero, which influences the system. Thus, detection accuracy is intact through attribute reduction.

This study combined an attribute reduction algorithm with the LOF algorithm to improve the accuracy of outlier detection for highly dimension numerical and mixed data. Following the algorithms of attribute reduction and the LOF outlier detection, this study analyzed the monotonicity of the attribute reduction algorithm and explained the application of the data processed in outlier detection. We performed the experiments to apply the LOF algorithm on data sets of different dimensions, to assess its performance by calculating its detection accuracy. We also inserted significance determination and attribute reduction algorithms before the LOF algorithm for the five data sets and calculated new detection accuracy. Finally, we compared the feasibility and effectiveness of the two groups of data to assess their detection accuracy before and after data attribute reduction. In comparison, the proposed method reduced data dimensions, calculation time, and improved the outlier detection accuracy when attribute reduction algorithm was combined with outlier detection technology.