Qianqian Li1, Meng Li2, Lei Guo3,*, Zhen Zhang4
Journal of Information Hiding and Privacy Protection, Vol.2, No.4, pp. 199-205, 2020, DOI:10.32604/jihpp.2020.016299
- 07 January 2021
Abstract On-site programming big data refers to the massive data generated in the
process of software development with the characteristics of real-time, complexity
and high-difficulty for processing. Therefore, data cleaning is essential for on-site
programming big data. Duplicate data detection is an important step in data
cleaning, which can save storage resources and enhance data consistency. Due to
the insufficiency in traditional Sorted Neighborhood Method (SNM) and the
difficulty of high-dimensional data detection, an optimized algorithm based on
random forests with the dynamic and adaptive window size is proposed. The
efficiency of the algorithm can be More >