With the rapid development of personalized services, major websites have launched a recommendation module in recent years. This module will recommend information you are interested in based on your viewing history and other information, thereby improving the economic benefits of the website and increasing the number of users. This paper has introduced content-based recommendation algorithm, K-Nearest Neighbor (KNN)-based collaborative filtering (CF) algorithm and singular value decomposition-based (SVD) collaborative filtering algorithm. However, the mentioned recommendation algorithms all recommend for a certain aspect, and do not realize the recommendation of specific movies input by specific users which will cause the recommended content of the website to deviate from the need of users, and affect the experience of using. Aiming at this problem, this paper combines the above algorithms and proposes three ensemble recommendation algorithms, which are the ensemble recommendation of KNN + text, the recommendation of user KNN + movie KNN, and the recommendation of user KNN + singular value decomposition. Compared with the traditional collaborative filtering algorithm based on matrix factorization, the method we proposed can realize the recommendation of specific movies input by specific users and make more personalized recommendations and can deal with the problem of cold start and sparse matrix processing issues to a certain extent.
With the rapid development of the Internet, the Internet has gradually become an important way for people to obtain information resources. It is easy to obtain abundant movie resources on major search engines. However, it is impossible that users can accurately find the movies they are more interested in a short time because of the dazzling array of movies. In order to help user find movies of interest efficiently, the recommendation system has attracted the attention of many researchers. As the core part of recommendation system, the quality of recommendation algorithm directly determines the quality of recommendation system.
The oldest type of recommendation algorithm is the content-based recommendation algorithm [
Collaborative filtering algorithm [
Aiming at this problem, this paper introduces the third type of collaborative filtering algorithm based on SVD [
The above recommendation algorithms all recommend for a certain aspect, and do not realize the recommendation of specific movies input by specific users. In response to this problem, this paper combines the above algorithms and proposes three ensemble recommendation algorithms, namely KNN + text. The ensemble recommendation of user KNN + movie KNN, the ensemble recommendation of user KNN + singular value decomposition to achieve a better personalized recommendation [
This paper first introduces the principles, advantages and disadvantages of three traditional recommendation algorithms, followed by content-based recommendation algorithm, KNN-based collaborative filtering algorithm and singular value decomposition-based collaborative filtering algorithm.
The content-based recommendation algorithm builds a recommendation algorithm model based on the subject-related information, user-related information, and the user’s operating behavior on the subject to provide users with recommendation services. The subject-related information here may be information such as the category of the movie purchased by the user, tags, and user comments. User-related information refers to information such as the age, gender, preference, and region of the ticket purchaser. The user’s operation behavior on the subject matter can be comments, favorites, likes, watching, browsing, clicking, buying tickets, etc. The basic principle of the content-based recommendation algorithm is to obtain the user’s interest preferences based on the user’s historical behavior, and recommend objects that are similar to his interest preferences, which is very intuitive and easy to understand, and has strong interpretability. The algorithm mainly extracts the characteristics of movies first, then calculates the similarity between movie contents, and finally judges whether users like a movie with similar characteristics. Since the algorithm only needs to calculate the similarity based on the characteristics of the movie and does not involve other user information, there is no cold start problem. The formula for calculating cosine similarity is shown below.
The movie-based collaborative filtering algorithm obtains the relationship between movies by calculating the ratings of different users on different movies, and recommends similar movies to users based on the relationship between movies. For example, if A has watched movie A and movie B, it means that movie A and movie B are highly correlated. When B also watches movie A, it can be inferred that he also needs to watch movie B. The formula for calculating the similarity between movies is:
Among them,
The user-based collaborative filtering algorithm discovers the interest of user through the historical behavior data, such as buying tickets, clicking, commenting, and watching, and measures and scores. The relationship between users is calculated according to the attitudes and preferences of different users towards the same movie, and movie recommendations are made among users with the same preferences. The formula for calculating the similarity between users is:
Among them,
SVD has obvious physical meaning [
The above-mentioned recommendation algorithms all recommend for a certain aspect, and do not realize the recommendation of specific movies input by specific users. This paper combines the above algorithms and proposes three ensemble recommendation algorithms.
1: |
2: input id ← |
3: User ← KNN(id) |
4: Mov ← find_movie(id) |
5: |
6: Movie ← find_movie(User) |
7: |
8: |
9: |
10: Result ← Movie |
11: |
12: |
13: Input txt ← text |
14: A ← tensor(Result) |
15: B ← tensor(txt) |
16: Distance ← dist(A,B) |
17: Prediction Result ← min(Distance) |
18: |
The cosine similarity formula is as follows:
Among them, the cosine value between two vectors can be obtained by using the Euclidean dot product formula:
The value range of the cosine distance is: [0, 2], non-negative.
This paper first uses the KNN algorithm to find the 10 users closest to the input current user ID, and then uses the movies that similar users have seen and the current user has not seen as candidate movies. Afterwards, KNN processing is performed on the movie data, and the 10 movies closest to the user input movie are selected for recommendation. The principle of the KNN algorithm is to find K pieces of data that are most similar to the sample, and the sample belong to the category that most of the K data belong to. The KNN algorithm process is divided into three steps. First, calculate and sort the distance between the test data and all the training data, then select the K points with the smallest distance, and finally use the category of the K points with the smallest distance as the prediction classification. Euclidean distance is commonly used to measure the distance of neighbors, the Euclidean distance formula between two n-dimensional vectors a (
It can also be expressed as a vector operation:
The pseudo code of the algorithm is shown in
1: |
2: input id ← |
3: User ← KNN(id) |
4: Mov ← find_movie(id) |
5: |
6: Movies ← find_movie(User) |
7: |
8: |
9: |
10: Result ← Movies |
11: |
12: |
13: input m ← movie |
14: Prediction Result ← KNN(m) |
15: |
This paper first uses the KNN algorithm to find the 10 users closest to the input current user ID, and then uses the movies that these similar users have watched and the current user has not watched as candidate movies. After that, matrix decomposition is used to simulate the current rating of candidate movies, and the ten highest-rated movies are selected for recommendation. The pseudo code of the algorithm is shown in
1: |
2: input id ← userId |
3: User ← KNN(id) |
4: Mov ← find_movie(id) |
5: |
6: Movies ← find_movie(User) |
7: |
8: |
9: |
10: Result ← Movies |
11: |
12: |
13: Prediction Result ← SVD(id, Result) |
14: |
Singular value decomposition shows the internal structure of a matrix to a certain extent. A more complex matrix can be expressed as the product of several smaller and simpler sub-matrices. These sub-matrices describe important characteristics of the matrix. Singular value decomposition is a decomposition method that can be applied to any matrix. For any matrix A, there is always a singular value decomposition:
Assuming that A is an m*n matrix, then the obtained U is an m*m square matrix, and the orthogonal vector in U is called the left singular vector.
For movie recommendation, we use the TMDB5000 dataset collected from The Movie Database (TMDB) dataset and some data collected from the movie dataset MovieLens [
We use RMSE and Coverage as the criteria for the evaluation of the recommendation system. The root mean square error refers to the average value of the square root of the difference between the recommended scores of all items in the test set and the scores of the actual users. The smaller the error, the better the optimization method and the better the model recommendation effect obtained. The corresponding formula is shown below.
How to evaluate the pros and cons of the recommendation system can be measured by the recommended content coverage. Coverage refers to the proportion of items recommended by the recommendation system to the total items, and describes the ability of a recommendation system to mine long-tail products. The formula for coverage is shown below.
Among them,
In this experiment, firstly, some features of movie data are visualized and analyzed [
The experiment in this paper first draws a diagram of the relationship between box office and budget, welcome, and drama, as shown in
It can be seen from the above diagram that there is a strong relationship between box office and budget, welcome, and drama. Therefore, this paper selects 7 characteristics of box office, budget, welcome, drama, screening time, ID, and release year, and draws the correlation diagram between them, as shown in
Among them, the darker the color indicates the stronger the correlation. It can be seen that the box office has a strong correlation with the budget, the degree of popularity, and the degree of drama. It has almost no correlation with the release time, ID, etc., which is in line with expectations. In addition, this paper also studies the relationship between box office and film language, as shown in
Finally, this paper studies the relationship between the movie budget and the movie release year, and mainly selects the movie budgets of 1983, 1984, 1985, 1991, and 2017 to estimate the density. The results are shown in
It can be seen from the figure that the release year has a greater impact on the budget distribution. Earlier years, the budget is concentrated in the lower part. The closer to the present, the more even the budget distribution of the film, corresponding to more and more large productions movie.
This article first uses the keywords, genres, directors, actors and other information in the movie, and uses the CountVectorizer in sklearn to convert the text into a word frequency matrix, and then uses the cosine distance to measure the similarity between the movies. The experimental results show that the recommended movie content is very similar to the input movie, which is in line with the original intention of the algorithm design.
In this paper, the ratings of all users for a certain movie are used as the feature vector of the movie, and the KNN algorithm is used to select the ten most similar movies for the movie input by the user for recommendation. The experimental results show that the input movie is
In this paper, a rating of all movies is used as the feature vector of the user, using the KNN algorithm, for the input user ID, first select 10 users that are most similar to the current user, and then select these 10 most similar users Movies that have been watched and not watched by the current user, and finally the ten movies with the highest average rating among these movies are selected and recommended to the user. The experimental results show that for different users, the recommended results are different, and personalized recommendations for users are achieved.
This experiment uses 85% of the user rating matrix for training and 15% for testing. The specific division method is: generate a random number for each user’s rating for each movie. If it is greater than 0.85, it is a test; less than or equal to 0.85 is training. Then 10 movies are recommended for each user, a total of 6040 movies are recommended, and the coverage ratios of the above three algorithms are calculated respectively. The values are shown in
Algorithm | Coverage |
---|---|
Content-based recommendation algorithm |
6.17% |
The experimental results show that among the above three algorithms, the recommendation algorithm based on user similarity has a higher coverage rate and a better model recommendation effect.
In this paper, we perform SVD decomposition of the user-movie rating matrix to predict ratings for movies that are not rated by users. This paper uses this algorithm as a secondary screening method. Using the input user id and a series of movie ids to predict the current rating for this series of movies, and select the 10 highest rated movies for recommendation. Since the user-movie rating matrix is a coefficient matrix with very high dimensions, different algorithms are used for decomposition, including the SGD, the SGLD, and the SGHMC. Among them, SGD is an optimization-based method, and SGLD and SGHMC are sampling-based Bayesian [
Method | RMSE |
---|---|
SGD | 0.8538 ± 0.0009 |
SGD (with momentum) | 0.8539 ± 0.0009 |
SGLD | 0.84118 ± 0.0008 |
SGHMC | 0.84117 ± 0.0008 |
Experiments show that the RMSE loss of the SGHMC algorithm is the lowest, reaching 0.84117, which is a better optimization method, and the obtained model also has a better recommendation effect.
In this experiment, 10 movies are recommended for each user, and a total of 6040 movies are recommended. Among them, the number of movies in the test set is 602, accounting for 9.97%, which realizes the recommendation of specific movies input by specific users.
This paper introduces traditional recommendation algorithms firstly, including a content-based recommendation algorithm, two KNN-based collaborative filtering algorithms, and a singular value decomposition-based recommendation algorithm. The advantages and disadvantages of these recommendation algorithms are summarized. On this basis, this paper proposes three ensemble recommendation algorithms that can recommend specific movies entered by specific users. Through experiments on multiple algorithms, it is found that the ensemble algorithm can achieve more personalized recommendations, but the effect is slightly inferior to the algorithm that only uses user KNN. Because the user and movie score matrix contains the score for each movie, the score is high or low. Even if the user has watched the movie, there will be a very low score. In this method, such movies will be excluded. The user recommends movies with high scores after simulated scoring. This problem was not involved in the test, so the difference in results was brought about. The method proposed in this paper can also be rewritten as a similar ensemble recommendation algorithm for other datasets in the future. For example, we can propose an integrated recommendation algorithm of user KNN + product KNN for product recommendation to recommend specific products input by specific users and find products that meet user needs more quickly.
The author would like to thank the researchers in the field of recommendation algorithm and other related fields. This paper cites the research literature of several scholars. It would be difficult for me to complete this paper without being inspired by their research results. Thank you for all the help we have received in writing this article.