In recent years, graph representation learning has played a huge role in the fields and research of node clustering, node classification, link prediction, etc., among which many excellent models and methods have emerged. These methods can achieve better results for model training and verification of data in a single space domain. However, in real scenarios, the solution of cross-domain problems of multiple information networks is very practical and important, and the existing methods cannot be applied to cross-domain scenarios, so we research on cross-domain representation is based on multi-network space integration. This paper conducts representation learning research for cross-domain scenarios. First, we use different network representation learning methods to perform representation learning in a single network space. Second, we use the attention mechanism to fuse representations in different spaces to obtain a fusion representation of multiple network spaces; Finally, the model is verified through cross-domain experiments. The experimental results show that the fusion model proposed in this paper can improve the performance of cross-domain scenarios.
Today, information networks are widely used in academia, such as social and communication networks, media relations networks, and publication networks. The scale of these networks ranges from hundreds of nodes to millions or even billions of nodes. The data analysis and information aggregation of these information networks and their applications in scenarios such as node classification, node clustering, link prediction, intelligent transportation [
Under normal circumstances, the models we build are limited to a single space, so the model built will have unity. Generally, the relevant models built in this space and the conclusions and indicators drawn from their predictions are very reliable. However, this situation only applies to itself. At present, when many network representation learning methods are proposed, they are only optimized for the features they are concerned about. For example, GCN focuses on how to gather neighbor information, and metapath2vec focuses on how to gather information of different relationships. However, when the content of its concern is missing or its characteristics are not obvious enough, its performance is not as expected.
To solve the problem of cross-domain representation, we applied the method of multi-network space fusion for testing. The results obtained by testing the various models and methods used on different representative data sets use neural networks.
We applied several models and methods that are representative in these scenarios to conduct experiments and perform fusion operations on them. We fused GCN [
For the representation of cross-domain problems, we construct different domain graphs by removing a certain proportion of edges from the original graph and then used the model trained by the complete original graph to perform network embedding on the newly constructed domain graphs. The K-Nearest Neighbor (KNN) classifier method [
In general, the model and cyberspace we tested basically exist as an information network embedding problem, so the overall definition starts from the information network embedding [
Definition 1: A homogeneous information network is defined as
Because we use isomorphic graphs for experimentation and testing, this involves a conversion process. When the dataset we used is constructed into a graph, it will eventually be constructed into a heterogeneous graph containing all types of nodes and all types of edges. We need use metapath [
Definition 2: Based on the description in Definition 1, the heterogeneous information network refers to the type of object
At the same time, to better define metapath, we introduce a concept of Network Schema, which is defined as Definition 3:
Definition 3: Network schema is defined as
Based on Definitions 2 and 3, we give the definition of metapath as Definition 4.
Definition 4: metapath P is defined on the network schema
The schematic diagram of metapath is shown by
As we all know, using different network embedding model methods in the same network space will have different effects. Each network embedding model has its own characteristics in different fields and will get good results in its own good aspects. Different network embedding models have their own different characteristics and are suitable for different scenarios, but in real life, as the perception environment and application scenarios change, the performance of the trained model will be greatly reduced, and the generalization ability of the model is insufficient, the retraining cycle is long and the cost is high, so there is a cross-domain problem. To solve the cross-domain problem, we introduce a neural network model with an attention mechanism. The result vectors obtained by different network embedding models are first superimposed horizontally, and then merged.
First, we explain GCN [
At this,
The GCN is built by stacking multiple convolutional layers in the form of equations. This model can alleviate the overfitting problem of the local neighborhood structure of the graph with very wide node degree distribution, such as social networks, citation networks and many other real-world graph datasets. However, its shortcomings are also very prominent. For example, memory requirements increase linearly with the size of the data set, because the K-th order neighborhood of a GCN with K layers must be stored in memory for precise procedures. Large and densely connected graph data sets may require further approximation; implicitly assumes locality (depending on the order neighborhood of GCN with K layers) and the equal importance of self-connections relative to the edges of adjacent nodes; only can deal with undirected graphs.
Then we elaborate on metapath2vec [
Among them,
Finally, we elaborate on DeepWalk. Deepwalk [
The DeepWalk [
To establish a network embedding model method that performs well on cross-domain problems in multi-network space and can consider the advantages of each existing model, we introduce a neural network with an attention mechanism. The attention mechanism is used in the encoder-decoder structure. The encoder embeds the input as a vector, and the decoder gets the output according to this vector. For the training of the encoder-decoder structure, since this structure is differentiable everywhere, the parameters θ of the model can be obtained through the training data and maximum likelihood estimation to obtain the optimal solution, and the log likelihood function is maximized to obtain the optimal model’s Parameters, namely:
This is an end-to-end training method. The MCR model we proposed here is based on this training method. It can handle various types of nodes and relationships and integrate rich semantics in heterogeneous networks. Information can be transmitted from one node to another through different relationships, and the efficiency and quality of contributed data can be maintained through user and task analysis. The MCR model we proposed has potentially good explanatory properties. By understanding the importance of nodes and meta-paths, the model can pay more attention to certain meaningful nodes or meta-paths to complete specific tasks and provide a more comprehensive description of heterogeneous graphs. It is helpful to analyze and interpret our results. In the following, we will prove the superiority of our model through experiments.
In this module, we first perform network embedding processing on the network model we have selected and the model we have built on different data sets and perform the KNN classification of the result vector and obtain related indicators. Then by removing a certain proportion of edges on the data set to establish a cross-domain problem scenario and re-evaluate and compare each model to prove the superiority of our model.
The heterogeneous information network dataset we use is shown in
Dataset | Relations |
Number of A | Number of B | Number of A-B | Feature | Training | Validation | Test | Meta-paths |
---|---|---|---|---|---|---|---|---|---|
Paper-Author | 14328 | 4057 | 19645 | 334 | 800 | 400 | 2857 | APA | |
Paper-Conf | 14328 | 20 | 14328 | APCPA | |||||
Paper-Term | 14327 | 8789 | 88420 | APTPA | |||||
Paper-Author | 3028 | 5835 | 9744 | 1830 | 600 | 300 | 2125 | PAP | |
Paper-Sub | 3025 | 56 | 3025 | PSP | |||||
Movie-Actor | 4780 | 5841 | 14340 | 1232 | 300 | 300 | 2687 | MAM | |
Movie-Dir | 4780 | 2269 | 4780 | MDM |
DBLP, we extracted a subset from the DBLP data set to be used as the data set for our experiment, which contains 14,328 papers in four fields (database, data mining, machine learning, information retrieval), 4057 authors, 8789 Terms, 20 meetings. To mark their research fields, we obtained information from the conferences submitted by each author in the data and marked them. The characteristics of the author are the elements of a word bag composed of keywords. In our experiment, we use metapath as {APA, APCPA, APTPA} to carry out the experiment.
ACM, we extract related published papers in KDD, SIGMOD, SIGCOMM, MobiCOMM and VLDB according to three categories (database, wireless communication, data mining), thus constructing an information network which includes 3025 papers, 5835 authors and 56 subjects. The paper features correspond to the elements of the word bag represented by keywords, and at the same time, we label them according to the conference where the paper was published. In our experiment, we use metapath as {PAP, PSP} to carry out the experiment.
IMDB, we extracted a subset from the IMDB data set to be used as the data set for our experiment, which contains 4780 movies (M) of three types ((action movies, comedies, dramas), 5841 actors (A) and 2269 directors (D). The features of the movie correspond to the elements of a word package composed of plots. We use metapath as {MAM, MDM} to conduct experiments.
We compare the previously selected network model methods on the listed data sets, including network embedding methods and graph neural network-based methods:
GCN [
Deepwalk [
Metapath2vec [
Here, we tested all meta-paths and reported the best performance. To verify the effectiveness of our proposed artificial neural network model MCR.
Dataset | Metrics | Training | ||||
---|---|---|---|---|---|---|
Macro-F1 | 10% | 87 | 57.27 | 64.01 | 89.57 | |
20% | 87.21 | 61.9 | 65.12 | 90.43 | ||
30% | 87.39 | 65.16 | 67.22 | 90.60 | ||
40% | 87.51 | 67.11 | 69.95 | 91.01 | ||
50% | 87.8 | 69.36 | 70.84 | 91.35 | ||
60% | 88 | 71.3 | 71.5 | 91.65 | ||
Micro-F1 | 10% | 87 | 59.72 | 63.98 | 89.68 | |
20% | 87.22 | 63.68 | 65.01 | 90.51 | ||
30% | 87.38 | 66.46 | 67.53 | 90.67 | ||
40% | 87.5 | 68.28 | 69.73 | 91.08 | ||
50% | 87.81 | 70.35 | 70.23 | 91.39 | ||
60% | 88.01 | 72.12 | 71.32 | 91.71 | ||
Macro-F1 | 10% | 90.57 | 23.76 | 90.01 | 90.64 | |
20% | 90.91 | 24.1 | 90.16 | 90.98 | ||
30% | 91.13 | 24.77 | 90.53 | 91.12 | ||
40% | 91.09 | 24.3 | 90.82 | 91.15 | ||
50% | 91.12 | 24.83 | 90.95 | 91.19 | ||
60% | 90.95 | 25.13 | 91.33 | 91.37 | ||
Micro-F1 | 10% | 91.18 | 26.96 | 90.95 | 91.21 | |
20% | 91.46 | 26.99 | 91.52 | 91.47 | ||
30% | 91.65 | 27.39 | 91.86 | 91.58 | ||
40% | 91.63 | 26.87 | 92.01 | 91.63 | ||
50% | 91.68 | 27.4 | 92.33 | 91.69 | ||
60% | 91.54 | 27.7 | 92.5 | 91.59 | ||
Macro-F1 | 10% | 34.22 | 39.78 | 39.89 | 40.01 | |
20% | 35.47 | 40.73 | 41.12 | 41.21 | ||
30% | 36.21 | 42.85 | 43.13 | 43.22 | ||
40% | 36.8 | 45.21 | 44.24 | 45.31 | ||
50% | 37.42 | 46.95 | 44.93 | 47.02 | ||
60% | 38.13 | 48.13 | 45.13 | 48.21 | ||
Micro-F1 | 10% | 36.34 | 45.11 | 43.95 | 44.99 | |
20% | 37.49 | 46.38 | 45.65 | 46.39 | ||
30% | 38.11 | 48.65 | 47.02 | 48.32 | ||
40% | 38.66 | 50.01 | 48.25 | 50.55 | ||
50% | 39.16 | 51.52 | 48.96 | 51.58 | ||
60% | 39.82 | 52.19 | 49.13 | 52.30 |
It can be seen that the model proposed in this paper improves the node classification task by 1%~5% compared with the existing methods. In addition, all models perform better on the ACM and DBLP datasets than on the IMDB dataset. After analysis, the reason may be that the distribution of the attributes of the IMDB dataset in the field of film and television is more scattered and inferior to the distribution of the attributes of the IMDB dataset. The correlation of the labels is lower, and the structure of the IMDB data set graph is more random, not as stable as the network of other data sets.
Dataset | Metrics | DeepWalk | Metapath2Vec | GCN | MCR |
---|---|---|---|---|---|
ACM | NMI | 0.0970 | 0.0714 | 0.6295 | 0.6345 |
ARI | 0.0680 | 0.0739 | 0.6811 | 0.6842 | |
DBLP | NMI | 0.0020 | 0.0008 | 0.7251 | 0.7266 |
ARI | 0.0001 | 0.0005 | 0.7741 | 0.7743 | |
IMDB | NMI | 0.0180 | 0.0265 | 0.0636 | 0.0678 |
ARI | 0.0088 | 0.0205 | 0.0560 | 0.0575 |
It can be seen intuitively from the figure that in different data sets, the distribution of embedded nodes of different models is very confusing, and the effect of our proposed MCR model is significantly better than that of other models, which is enough to use the method proposed in this article The obtained low-dimensional vector of nodes indicates that it has strong support ability for node classification tasks.
For the selected method, we initialize the parameters randomly and use the Adam optimization model. We set the learning rate to 0.005, the regularization parameter to 0.001, and the dimension of the generated vector q to 128. For GCN, to optimize its parameters, we use a validation set. We separate the training set, validation set, and test set to ensure the fairness of the experimental results. For the meta path-based random walk methods Deepwalk [
After getting the vector results, we use the KNN classifier with k = 3 to classify the nodes. Since the variance of the classified data may be very high, we repeat the process 10 times and average the Macro-F1 and Micro-F1 The value is filled in the table. From the comparison of information in
To create a cross-domain scenario, we used the method of removing different proportions of edges in the data set to create different domains for testing. Use the model that has been trained on the complete data set above to embed the network of the new domain on different domains, and then classify the results obtained according to the above processing method. Taking the index with a training ratio of 0.5 to represent the effect data of each domain, the comparison result can be converted into a line graph as
Through the analysis of
As for the MCR model after our fusion, the indicators on each data set are not the best one. For example, the indicators for cross-domain issues on the ACM data set are not as good as GCN and the cross-domain on the DBLP data set. The index of the problem is not as good as Deepwalk [
At present, when many network representation learning methods are proposed, they are only optimized for the features they are concerned about. For example, GCN focuses on how to gather neighbor information, and metapath2vec focuses on how to gather information of different relationships. However, when the content of its concern is missing or its characteristics are not obvious enough, its performance is not as expected. This paper studies several network embedding methods in the information network and uses a neural network with attention mechanism to fuse them and proposes a new semi-supervised neural network model MCR based on the attention mechanism. First, the model imitates metapath2vec [
The authors are grateful to the anonymous referees for having carefully read earlier versions of the manuscript. Their valuable suggestions substantially improved the quality of exposition, shape, and content of the article.