Open Access
ARTICLE
Picture-Neutrosophic Trusted Safe Semi-Supervised Fuzzy Clustering for Noisy Data
1 Graduate University of Science and Technology, Vietnam Academy of Science and Technology, Hanoi, 100000, Vietnam
2 Institute of Information Technology, Vietnam Academy of Science and Technology, Hanoi, 100000, Vietnam
3 VNU Information Technology Institute, Vietnam National University, Hanoi, 100000, Vietnam
4 Department of Mathematics, University of New Mexico, Gallup, 87301, New Mexico, USA
5 University of Information and Communication Technology, Thai Nguyen University, Thai Nguyen, 250000, Vietnam
6 Faculty of Computer Science and Engineering, Thuyloi University, Hanoi, 100000, Vietnam
* Corresponding Author: Tran Thi Ngan. Email:
Computer Systems Science and Engineering 2023, 46(2), 1981-1997. https://doi.org/10.32604/csse.2023.035692
Received 31 August 2022; Accepted 14 December 2022; Issue published 09 February 2023
Abstract
Clustering is a crucial method for deciphering data structure and producing new information. Due to its significance in revealing fundamental connections between the human brain and events, it is essential to utilize clustering for cognitive research. Dealing with noisy data caused by inaccurate synthesis from several sources or misleading data production processes is one of the most intriguing clustering difficulties. Noisy data can lead to incorrect object recognition and inference. This research aims to innovate a novel clustering approach, named Picture-Neutrosophic Trusted Safe Semi-Supervised Fuzzy Clustering (PNTS3FCM), to solve the clustering problem with noisy data using neutral and refusal degrees in the definition of Picture Fuzzy Set (PFS) and Neutrosophic Set (NS). Our contribution is to propose a new optimization model with four essential components: clustering, outlier removal, safe semi-supervised fuzzy clustering and partitioning with labeled and unlabeled data. The effectiveness and flexibility of the proposed technique are estimated and compared with the state-of-art methods, standard Picture fuzzy clustering (FC-PFS) and Confidence-weighted safe semi-supervised clustering (CS3FCM) on benchmark UCI datasets. The experimental results show that our method is better at least 10/15 datasets than the compared methods in terms of clustering quality and computational time.Keywords
The finding of underlying connections between the human brain and events has made the development of sophisticated clustering algorithms fashionable in cognitive research [1,2]. Dealing with noisy data is one of the most intriguing clustering difficulties. Incorrect data with noises that affect the quality of results could be seen in many applications, such as satellite images [3], medical image processing [4,5], control systems [6], etc.
Semi-supervised fuzzy clustering techniques were introduced with additional information provided by users [7–9] to enhance the range of applications and the quality of clusters. The differences in incorporating various supplementary information forms were demonstrated in [10] which provided a summary of the semi-supervised fuzzy clustering technique. Accordingly, object segmentation using semi-supervised fuzzy clustering is effective as long as the proper supplementary information, also known as “safe information” and clean data are supplied. However, real-world data are frequently unreliable, noisy and inaccurate. These situations require more effective clustering methods.
The safe semi-supervised fuzzy clustering approach introduced in [11–13] is the typical method to deal with safe information in semi-supervised fuzzy clustering. There are two primary phases in their strategy after the core concept. The confidence weights for labeled data are calculated in the first phase. Then, the high confidence weights are used to generate and identify centers of clusters and fuzzy element values under the labeled data in the second phase. Safe semi-supervised Fuzzy C-Means clustering (S3FCM) approach was firstly presented in [11]. By balancing semi-supervised and unsupervised clustering, this technique investigated the incorrectly classified data. A local homogeneous graph was employed in the first phase [12]. The Local Homogeneous Consistent Safe SSFCM (LHC-S3FCM) method performed effectively on datasets with a large percentage of incorrectly categorized data by utilizing this graph. The CS3FCM, an enhanced safe semi-supervised clustering model, based on confidence weights, was put out in [13]. This approach provides good results in minimizing the negative impact of incorrectly labeled samples on the clustering process, assuming each data sample has its own safe confidence weight.
To establish the safe level of each sample in the data set, Guo et al. [14] have recently suggested a safe semi-supervised clustering with a safe degree. The model provides the essential procedures to reduce the adverse effects of risk in both labeled and unlabeled samples based on the safe degree value. Despite performing better than other approaches when dealing with “safe information”, safe semi-supervised fuzzy clustering algorithms can still not solve the challenge of clustering inaccurate data with noises. Noisy data division can lead to incorrect object detection and inference. Data points, isolated or at the edge of some clusters, are considered to contain noisy data. It is a must to improve safe semi-supervised fuzzy clustering algorithms for dealing with noisy data.
This research aims to develop a new clustering method to remove the noise from data and increase the performance of the clustering method. This method integrates the semi-supervised clustering method and the picture fuzzy set [15]. There are four membership degrees in the PFS [3] with Neutrosophic set [16], including the positive degree, neutral degree, negative degree and rejection degree. Noisy data typically have a high rejection rate. Additionally, the neutral degree is used to determine the data points belonging to the boundary of clusters. It is clear that PFS could be used to identify noisy data in datasets.
Based on the original Fuzzy C-Means (FCM) model, a fuzzy clustering algorithm for images (a.k.a. FC-PFS) introduced in [17] outperforms the other fuzzy clustering techniques in terms of average clustering indices such as the mean accuracy and computational time. As an extension of collaborative distributed fuzzy clustering (CDFCM) [18] on PFS, a form of FC-PFS on distributed computing known as DPFCM was demonstrated in [19]. As stated in the paper, a strategy to reduce computational time and increase clustering quality is the idea of semi-supervised clustering using distributed and cloud computing. Wu and Chen presented an adaptive picture fuzzy clustering technique based on entropy weight [20]. This approach improved accuracy, addressed noisy data in image segmentation and overcame the time-consuming limitation in existing picture fuzzy clustering algorithms. Two practical, robust picture fuzzy clustering techniques for decreasing computational time were also introduced [21,22]. Nonetheless, those fuzzy clustering algorithms struggle with managing both the “safe information” and the “noisy data” because if labeled data has noise, the clustering quality will be seriously affected.
To handle problems with enhancing “safe information” and reducing the effect of “noisy data”, Picture-Neutrosophic Trusted Safe Semi-Supervised Fuzzy Clustering (PNTS3FCM) is introduced. This is a new technique to address the issue of data partition with noisy information. The PNTS3FCM approach includes picture fuzzy and neutrosophic set concepts in the semi-supervised fuzzy clustering with a safe information procedure. The research proposes a new optimization model consisting of four essential components: a clustering component, an outlier-solving component and a safe semi-supervised fuzzy clustering using labeled and unlabeled data. The first two parts employed FC-PFS and the last two are the new parts to enhance safe information and reduce noisy data. An iterative technique from the formulation is also provided to construct the cluster centers and memberships. In fact, the survey has revealed a new field of study: safe, semi-supervised clustering on the picture fuzzy set. To compare PNTS3FCM with other available methods on benchmark datasets, two similar algorithms-FC-PFS [17] and CS3FCM [13], are chosen.
The remaining paper is structured as follows: Section 2 offers the essential information underpinning our study. The proposed approach is introduced in Section 3 and the experimental results are presented in Section 4. Some conclusions are given in the last section.
In this section, some fundamental concepts and methods of semi-supervised clustering are presented, including Safe semi-supervised clustering and Picture fuzzy set and picture fuzzy clustering.
2.1 Safe Semi-Supervised Clustering
Safe semi-supervised fuzzy clustering approaches, including S3FCM [11], LHC-S3FCM [12] and CS3FCM [13] are proposed by Gan et al. Herein, we present the fundamental knowledge of these approaches.
For S3FCM, consider the dataset
with:
where
The below function calculates the center
On the other hand, the LHC-S3FCM [12] is expected to deal with wrong labels from additional information. The objective function is defined as follows:
with the constraints:
Therefore, the cluster centers
Another approach of FCM is Confidence-weighted Safe Semi-supervised Clustering (CS3FCM) [13] by using confidence weights. The confidence weights show various effects of samples on performance degradation. The following is the goal:
with
The methods of Gan [11–13] (S3FCM, LHC-S3FCM, CS3FCM) achieved good clustering accuracy. However, if there may be data outliers, they would affect the determination of the final clusters.
2.2 Picture Fuzzy Set and Picture Fuzzy Clustering
By generalizing the fuzzy set in [9] and the intuitionistic fuzzy set [23], Cuong et al. introduced a definition of the picture fuzzy set [15] in 2014 and have the form as follows:
where
Then, the refusal degree is computed by function:
The objective of FC-PFS [17] aims to group the data in clusters and reduce the outliers through the concept of entropy as follows:
with the constraints:
The values of
For the above objective function, the cluster centers
In [15], the refusal degree
where
The detailed steps for the FC-PFS algorithm are shown below.
3 The Proposed Picture-Neutrosophic Trusted Safe Semi-Supervised Fuzzy Clustering
The idea behind the proposed method (PNTS3FCM) is the combination between PFS and safe semi-supervised fuzzy clustering by introducing a novel objective function with four primary components. The first and the second stages are employed from the original picture fuzzy clustering method [17]. The two last stages are the semi-supervised component used to orient the clustering process by labeled and unlabeled data. The main idea is represented in Fig. 1 and the detailed steps are described in Fig. 2.
Fig. 1 illustrates the method and concept in which the input data are provided to the block of PNTS3FCM. Through the use of picture fuzzy degrees, the first step of PNTS3FCM is to reduce the distance between data components and cluster centers. The picture fuzzy set model’s second step involves processing the “noisy data” by integrating the entropy quantity between the neutral and refuse degrees. The refusal degree plays an important role in reducing the effect of noise data in the objective function because of its higher value relating to noise data following [17].
To deal with “safe information”, the two last stages coordinate the safe semi-supervised fuzzy clustering using both labeled and unlabeled data. PNTS3FCM has two phases: Firstly, FC-PFS is used to partition all data to get the clustering result with positive, neutral and refusal values. The second phase uses all data with these values to partition data to archive better clustering quality by enhancing safe data information and reducing noisy data.
The technique produces final clusters that are reliable and confident. We will discuss the formulation and algorithm for this concept in the next section.
As illustrated by the main idea above, this section will describe the details of the proposed model. The objective function is stated by the following formula:
With the constraints
where data set
• The first part represents fuzzy clustering on the PFS.
• The second part represents entropy information which helps to reduce noisy data through the neutral and refusal degrees of a data point.
• The third part is the component for labeled data elements, where
The denominator
• Finally, the fourth part is the component of the unlabeled data elements, where the numerator is the same as the first part and the denominator
• The additional information for semi-supervised fuzzy clustering is the prior picture membership degrees. We use the original FC-PFS algorithm to cluster all data, including labeled and unlabeled data. From that, we calculate four values
Using the Lagrangian method, the optimal solutions to the stated problem are presented in Eqs. (20)–(24) below.
The positive degree u of the labeled data elements is
The positive degree u of the unlabeled data elements is
Other degrees are shown below:
Details of the FPNTS3FCM algorithm are below.
Advantages of the PNTS3FCM algorithm:
a) PNTS3FCM has better clustering quality than the related methods, such as FC-PFS and CS3FCM algorithm, due to the capability to handle noisy data.
b) PNTS3FCM produces more information about the clusters, such as the cluster centers and the picture fuzzy degrees (positive, neutral, negative, refusal). It deals with both “safe information” and “noisy data”.
c) PNTS3FCM is the combination of three major concepts: SAFE, SEMI Clustering and PICTURE Fuzzy Set. The combination is the first trial in the literature toward practical problems.
Disadvantages of the PNTS3FCM algorithm
a) PNTS3FCM takes more computational time than the other algorithms due to the calculation of two additional parts in the objective function (24).
b) The model contains many parameters which need to be tuned in some real-world applications.
4.1 Environmental Configuration
The experiments are performed on a Core i5-powered HP laptop using the C programming language. The selected benchmark UCI datasets [24] are described in Table 1. Outlier Detection DataSets (ODDS) [25] are given in Table 2.
Experiments are executed to compare the proposed PNTS3FCM approach and the state-of-art methods, CS3FCM [13] and FC-PFS [17]. The classification accuracy (CA), computing time (CT) and clustering quality indicators, including DB, PBM and ASWC [26], are the criteria for evaluation. The CT is the amount of time needed to complete the computation. Value CT is computed as in (25).
where
where
The value of ASWC is computed by Eq. (27).
where
The value of PBM [26] is determined by:
where
The DB [27] is determined by (29)
where
The average value and standard deviation value in experimental results are denoted as Ave and STD Dev, respectively.
Herein, the proposed method is assessed by classification accuracy in two situations, including on all data and labeled data. Herein, the experimental results are presented following two of these cases.
Evaluation by classification accuracy on all data
Using all the data elements of 15 datasets, the classification accuracy of PNTS3FCM, FC-PFS and CS3FCM are calculated and presented as follows. Table 3 shows the classification accuracy of all data without outliers.
As shown in Table 3, PNTS3FCM gets the best results of CA on 7/9 datasets (except Australian and WDBC). FC-PFS has not achieved the highest CA on all datasets. CS3FCM is the best model on 2/9 datasets (Australian, WDBC).
From the results in Table 4, it is clear that PNTS3FCM gives correct classification results in 4 out of 6 datasets (Glass, Yeast, Vertebral, Ionosphere). The other FC-PFS is only better on the Wine dataset and CS3FCM only gives good results on the Ecoli dataset.
Summary: During the evaluation by classification accuracy on all data, including outlier and non-outlier (15 datasets), PNTS3FCM is the best on 11 datasets (Balance-scale, Dermatology, Heart, Iris, Spambase, Tae, Waveform, Glass, Yeast, Vertebral, Ionosphere). FC-PFS is the best model on the Wine dataset. CS3FCM is the best model on three datasets (Australian, WDBC, Ecoli).
Evaluation by classification accuracy on labeled data
By using the labeled data elements of 15 datasets, the classification accuracy (CA) of PNTS3FCM, FC-PFS and CS3FCM are calculated and presented as follows. Table 5 shows the classification accuracy of labeled data without outliers.
In Table 5, PNTS3FCM gets the best results of CA on 7/9 datasets (except Iris and WDBC). FC-PFS has no highest value on all datasets. CS3FCM is the best model on 2/9 datasets (Iris, WDBC). As shown in Table 6, PNTS3FCM shows the highest values on 4/6 datasets (Ecoli, Glass, Yeast, Vertebral). FC-PFS has no highest CA on all datasets. CS3FCM is the best model on 2/6 datasets (Wine, Ionosphere).
Summary: During the evaluation by classification accuracy on labeled data, including outlier and non-outlier (15 datasets), PNTS3FCM has better results on 11 datasets (Australian, Balance-scale, Dermatology, Heart, Spambase, Tae, Waveform, Ecoli, Glass, Yeast, Vertebral). FC-PFS has not had the highest CA on all datasets. CS3FCM is the best model on four datasets (Iris, WDBC, Wine, Ionosphere).
4.2.2 Evaluation by Clustering Quality
Summary: As in Table 7, During the evaluation clustering quality by DB index on all data, including outlier and non-outlier (15 datasets), PNTS3FCM gets the best results on ten datasets (Australian, Dermatology, Heart, Tae, Waveform, WDBC, Ecoli, Glass, Yeast, Wine). FC-PFS is the best model on three datasets (Iris, Vertebral, Ionosphere). CS3FCM is the best model on 3 datasets (Balance-scale, Spambase). This pointed out that the proposed method was better in clustering quality in not only outlier not also non-outlier data compared to others.
4.2.3 Evaluation by Computational Time (in seconds)
We compare PNTS3FCM and CS3FCM on 15 datasets using computational time. Table 8 shows the results of evaluation clustering quality by computational time on data without outlier datasets.
Summary: During the evaluation of computational time on all data, including outlier and non-outlier (15 datasets), PNTS3FCM has better results on nine datasets. CS3FCM is the best model on six datasets. The proposed method seems to be better with a more significant number of data clusters. To get better results, the proposed method is firstly based on Picture fuzzy set that has more information to reduce the noise or hesitation in partitioning data. Secondly, PNTS3PFCM has a safe semi-supervised part for labeled and unlabeled data that can cope with the doubt labeled data, then reduce their effectiveness in the clustering process.
This research suggested a novel technique called Picture-Neutrosophic Trusted Safe Semi-Supervised Fuzzy Clustering (PNTS3FCM) to address the issue of data clustering with high confidence and noisy information. PNTS3FCM is constructed based on combining Picture Fuzzy Sets, Neutrosophic Sets and safe fuzzy semi-supervised clustering (PFS). This method consists of 4 critical parts: the clustering portion, the outlier solution part and the safe semi-supervised fuzzy clustering with labeled and unlabeled data. Through the use of PFS, the first stage of PNTS3FCM aims to reduce the distance between data components and cluster centers. The model’s second step involves processing the “noisy data” by integrating the entropy quantity between the neutral and refuse degrees. The third and fourth stages coordinate the safe semi-supervised fuzzy clustering using both labeled and unlabeled data to solve the safety information. We also provide an iterative technique from the formulation to construct the cluster centers and memberships. The method produces final clusters that are reliable and confident.
PNTS3FCM has illustrated its effectiveness by comparing it with two related methods, including FC-PFS and CS3FCM algorithm. The experiment results show that PNTS3FCM is better than the others in terms of computational time and clustering quality. Even though the proposed PNTS3FCM mainly focuses on eliminating or reducing noisy data elements, this method still has some limitations. First of all, PNTS3FCM takes a long time to compute. Secondly, it needs an increased number of parameters. In the future, an effective optimization algorithm will be studied and introduced to overcome these limitations.
Acknowledgement: We are grateful for the support from the staff of the Institute of Information Technology, Vietnam Academy of Science and Technology.
Funding Statement: This research is funded by Graduate University of Science and Technology under grant number GUST.STS.ĐT2020-TT01.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
References
1. X. Ji, S. Liu, P. Zhao, X. Li and Q. Liu, “Clustering ensemble based on sample’s certainty,” Cognitive Computation, vol. 13, no. 4, pp. 1034–1046, 2021. [Google Scholar]
2. J. Zhang, H. Wang, S. Huang, T. Li, P. Jin et al., “Co-adjustment learning for co-clustering,” Cognitive Computation, vol. 13, no. 2, pp. 504–517, 2021. [Google Scholar]
3. P. H. Thong, “Some novel hybrid forecast methods based on picture fuzzy clustering for weather nowcasting from satellite image sequences,” Applied Intelligence, vol. 46, no. 1, pp. 1–15, 2017. [Google Scholar]
4. A. Khosravanian, M. Rahmanimanesh, P. Keshavarzi and S. Mozaffari, “Fuzzy local intensity clustering (FLIC) model for automatic medical image segmentation,” The Visual Computer, vol. 37, no. 5, pp. 1185–1206, 2021. [Google Scholar]
5. S. A. Kumar, B. S. Harish and V. M. Aradhya, “A picture fuzzy clustering approach for brain tumor segmentation,” in 2016 Second Int. Conf. on Cognitive Computing and Information Processing (CCIP), Mysuru, India, pp. 1–6, 2016. [Google Scholar]
6. G. Bode, T. Schreiber, M. Baranski and D. Müller, “A time series clustering approach for building automation and control systems,” Applied Energy, vol. 238, no. 11, pp. 1337–1345, 2019. [Google Scholar]
7. J. C. Bezdek, Pattern recognition with fuzzy objective function algorithms. New York: Plenum Press. http://dx.doi.org/10.1007/978-1-4757-0450-1. [Google Scholar]
8. N. Grira, M. Crucianu and N. Boujemaa, “Active semi-supervised fuzzy clustering,” Pattern Recognition, vol. 41, no. 5, pp. 1834–1844, 2008. [Google Scholar]
9. L. A. Zadeh, “Fuzzy sets,” Information and Control, vol. 8, no. 3, pp. 338–353, 1965. [Google Scholar]
10. P. H. Thong and L. H. Son, “An overview of semi-supervised fuzzy clustering algorithms,” International Journal of Engineering and Technology, vol. 8, no. 4, pp. 301–306, 2016. [Google Scholar]
11. H. Gan, “Safe semi-supervised fuzzy C-means clustering,” IEEE Access, vol. 7, pp. 95659–95664, 2019. [Google Scholar]
12. H. Gan, Y. Fan, Z. Luo and Q. Zhang, “Local homogeneous consistent safe semi-supervised clustering,” Expert Systems with Applications, vol. 97, pp. 384–393, 2018. [Google Scholar]
13. H. Gan, Y. Fan, Z. Luo, R. Huang and Z. Yang, “Confidence-weighted safe semi-supervised clustering,” Engineering Applications of Artificial Intelligence, vol. 81, pp. 107–116, 2019. [Google Scholar]
14. L. Guo, H. Gan, S. Xia, X. Xu and T. Zhou, “Joint exploring of risky labeled and unlabeled samples for safe semi-supervised clustering,” Expert Systems with Applications, vol. 176, pp. 114796–114803, 2021. [Google Scholar]
15. B. C. Cuong and V. Kreinovich, “Picture fuzzy sets,” Journal of Computer Science and Cybernetics, vol. 30, no. 4, pp. 409–420, 2014. [Google Scholar]
16. F. Smarandache, Neutrosophy: Neutrosophic probability, set and logic: analytic synthesis & synthetic analysis. Santa Fe, Rehoboth, MA, USA: American Research Press, 1998. [Google Scholar]
17. P. H. Thong and L. H. Son, “Picture fuzzy clustering: A new computational intelligence method,” Soft Computing, vol. 20, no. 9, pp. 3549–3562, 2016. [Google Scholar]
18. J. Zhou, C. P. Chen, L. Chen and H. X. Li, “A collaborative fuzzy clustering algorithm in distributed network environments,” IEEE Transactions on Fuzzy Systems, vol. 22, no. 6, pp. 1443–1456, 2013. [Google Scholar]
19. L. H. Son, “DPFCM: A novel distributed picture fuzzy clustering method on picture fuzzy sets,” Expert Systems with Applications, vol. 42, no. 1, pp. 51–66, 2015. [Google Scholar]
20. C. Wu and Y. Chen, “Adaptive entropy weighted picture fuzzy clustering algorithm with spatial information for image segmentation,” Applied Soft Computing, vol. 86, no. 4, pp. 105888–105927, 2020. [Google Scholar]
21. C. Wu and Z. Kang, “Robust entropy-based symmetric regularized picture fuzzy clustering for image segmentation,” Digital Signal Processing, vol. 110, no. 1, pp. 102905–102933, 2021. [Google Scholar]
22. C. Wu and N. Liu, “Suppressed robust picture fuzzy clustering for image segmentation,” Soft Computing, vol. 25, no. 5, pp. 3751–3774, 2021. [Google Scholar]
23. K. Atanassov, “Intuitionistic fuzzy sets,” International Journal Bioautomation, vol. 20, pp. S1–S6, 2016. [Google Scholar]
24. UCI Machine learning repository, “Data,” 2021. [Online]. Available: https://archive.ics.uci.edu/ml/index.php. [Google Scholar]
25. Outlier detection datasets (ODDS“Data,” 2016. [Online]. Available: http://odds.cs.stonybrook.edu/. [Google Scholar]
26. L. Vendramin, R. J. Campello and E. R. Hruschka, “Relative clustering validity criteria: A comparative overview,” Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 3, no. 4, pp. 209–235, 2010. [Google Scholar]
Cite This Article
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.