With the extensive application of software collaborative development technology, the processing of code data generated in programming scenes has become a research hotspot. In the collaborative programming process, different users can submit code in a distributed way. The consistency of code grammar can be achieved by syntax constraints. However, when different users work on the same code in semantic development programming practices, the development factors of different users will inevitably lead to the problem of data semantic conflict. In this paper, the characteristics of code segment data in a programming scene are considered. The code sequence can be obtained by disassembling the code segment using lexical analysis technology. Combined with a traditional solution of a data conflict problem, the code sequence can be taken as the declared value object in the data conflict resolution problem. Through the similarity analysis of code sequence objects, the concept of the deviation degree between the declared value object and the truth value object is proposed. A multi-truth discovery algorithm, called the multiple truth discovery algorithm based on deviation (MTDD), is proposed. The basic methods, such as Conflict Resolution on Heterogeneous Data, Voting-K, and MTRuths_Greedy, are compared to verify the performance and precision of the proposed MTDD algorithm.
With the increase in complexity and scale in software development, a conflict between high demand and low efficiency arises. The application of real-time collaborative programming technology and various collaborative programming technologies can enable multiple users to develop and upload software based on their respective collaborative sites [
In this paper, the data semantic conflict problem of multiple users in the programming scene in the function realization of the same code segment is mainly studied.
Recently, numerous research works have appeared that are focused on the truth discovery technology of the data cleaning field in both industry and academia [
In conclusion, most of the existing research works are not suitable for the problem of data conflict problems in the programming field. The current research on the multi-truth discovery technology only considers the reliability of the data source and that of the declared value, in which the factor of the support of the declared value is not considered. In fact, different users have different code ages, and the quality of the submitted code is hierarchical. In addition, the programming habits of different users would also lead to differences in code length and fragments. Therefore, the support of the declared value is a factor that cannot be ignored in the multi-truth discovery. In this paper, an attempt is made to solve the problem of multi-truth discovery in a programming scene considering the support of the declared value. The main contributions of this paper are the following.
The characteristics of multi-source code data are combined to construct a multi-truth discovery problem model, and the corresponding optimization problems are proposed.
The deviation degree between the claims based on the support of the claim and the quality of the data source is defined, and the convergence rate of the function is optimized.
The rest of this paper is organized as follows. In Section 2, the multi-truth discovery problems are proposed. Experiments and results are presented in Section 3. Conclusions are presented in Section 4.
First, the relevant definitions involved in the multi-truth discovery problem are defined.
Definition 1:
Definition 2:
Definition 3:
Definition 4:
Definition 5:
All notations used in this paper are listed in
Notation | Meaning |
---|---|
Collection of objects | |
Collection of code data sources | |
Collection of data source quality | |
Collection of declared values | |
Quality of |
|
Collection of declared values of |
|
Collection of truth of |
|
Claim of |
The problem of multi-truth discovery through the definition of the multi-truth situation can be formulated as follows:
The objective function is the weighted sum of the deviation between the declared value of the data source and the standard true value. When the deviation between the obtained true value and standard value of the conflicting dataset reaches the minimum, the obtained truth vector is closest to the standard true value.
In the process of truth discovery, it is generally assumed that if the quality of the data source is high, the probability that the provided claim is true would be high [
If the similarity distance between the claim of the object provided by a data source and the truth is high, the quality of the data source would be low. Otherwise, one can have a higher quality of the data source. The following formula is used to calculate the data source quality:
It can be found that the weight of the data source is inversely proportional to the distance between the claim and the truth, the value of which can be calculated by the above logarithmic function.
(a) Loss function
In the multi-truth discovery problem for a programming site, first, the data characteristics of the code block are considered, and then the loss function is determined. The declaration values of the data source are collected, and the difference in the length of the declaration values provided by different data sources is considered. A formula is then defined to calculate the offset distance as follows:
(b) Claim support
In the process of collaborative programming, the code data submitted by different users are different in code quantity and quality. Then, it is necessary to use the asymmetric support calculation method to calculate the support of the declaration value in the multi-truth case, as given in the following equation:
(c) Claim deviation
In the collaborative programming environment, it is necessary to combine the claim support and loss function to calculate the deviation, and the formula is given as follows:
Assuming that the concept of high cohesion and low coupling is strictly followed in the process of software collaborative programming, the different code segments are independent of each other. Then, the objective function corresponding to each object can be converted as follows:
In the multi-truth discovery problem, the quality of the data source is determined by the deviation of the claim provided by it. The degree of support between the claims in the definition of the degree of deviation is fixed. Then, the key to solve the optimization model is to refer to the truth. In this paper, the strategy of reference truth selection is based on the enumeration method. When the possible set of objects exceeds a certain threshold, the enumeration method will not meet the needs of real-time truth discovery. Therefore, in the iterative process, the declared value that minimizes the value is selected as the reference true value for subsequent iterations.
In this section, the proposed method is compared with the existing multi-truth methods from the following three aspects [
Precision: Ratio of the truth set returned by the algorithm to the standard set:
Recall rate: Ratio of correct truth values in the standard set to truth values returned by the algorithm:
F-score: Harmonic average of the precision and recall rate:
where
Voting-K: For the multi-truth case, Voting-K selects the declared value as the true value when the voting proportion exceeds the K value.
CRH: For the multi-truth case, the CRH algorithm is an algorithm based on the probability distribution.
MTRuths_Greedy: The MTRuths algorithm is a truth discovery algorithm calculated through greedy.
MTDD: A multi-truth discovery based on the deviation degree; see Section 2 for details.
BOOK: Taking the data characteristics in the programming site into account, the BOOK dataset [
MOVIE: The collected data of 2,000 movies are used as a dataset, sourced from 10 different video sites, including Tencent Video, iQiyi Video, and Douban. the MOVIE dataset contains 23,968 different director names and 11,365 movie entities. The dataset is processed in the same way as the BOOK dataset, and the processed dataset is used as the test set, and 100 sample instances are randomly selected and labeled as the standard set.
All the experiments are implemented in an environment with a Intel
For the multi-truth problem, the corresponding adjustments are made to the baseline methods. For the Voting-K method, the threshold K is set, and all attribute values in which the voting proportion exceeds the K value are considered the true values. In this experiment, the precision, recall rate, and F1-score of Voting-K, MTRuths_Greedy, CRH, and MTDD are compared for the virtual dataset. The analysis results are shown in
Methods | Performance | ||
---|---|---|---|
Precision | Recall | F1 | |
MTRuths_Greedy | 0.8548 | 0.8333 | 0.8439 |
CRH | 0.8441 | 0.8172 | 0.8304 |
Voting-50% | 0.8602 | 0.6452 | 0.7373 |
Voting-70% | 0.9247 | 0.4032 | 0.5615 |
MTDD | 0.8565 | 0.8217 | 0.8387 |
The algorithm time of the Voting-50%, MTRuths_Greedy, CRH, and MTDD algorithms was compared under the same dataset scale, as given in
Method | Runtime (s) |
---|---|
MTruths_Greedy | 5.83 |
CRH | 6.12 |
Voting-50% | 1.47 |
MTDD | 13.54 |
As shown in
The convergence conditions of the algorithms are the following: the quality vector cosine similarity of the data source is obtained from the second iteration, which is used to measure the change in the results of the second iteration. If the similarity is higher, the change would be smaller. When the change reaches a certain threshold, the iteration stops.
It can be seen from
In the process of software online collaborative development, several challenges must be solved that are brought about by the large-scale code data of a programming site. The code data submitted by different users will have semantic inconsistencies; that is, data semantic conflicts. According to the data characteristics of the code segment, the problem is defined as a multi-truth discovery problem. The MTDD algorithm is then proposed to convert the multi-truth discovery problem into an optimization problem. The truth value set obtained should minimize the weighted deviation from different object sets. The support between different declared values and data is considered in the process of calculating the truth value. The optimal solution of the truth value is obtained through an optimized method. This method is slightly better than the existing multi-truth discovery methods in terms of accuracy and has good performance in convergence.