【摘要】 基于协同过滤算法实现电影推荐 实验目标掌握如何使用机器学习算法全流程构建一个电影推荐系统的方案。掌握如何载入、查阅、清洗、合并用户的数据,并计算物品相似度矩阵。 案例内容介绍在本案例中,我们将会学习使用人工智能技术技术分析用户对电影的评分数据,并基于这个数据建立一个推荐系统,根据用户输入的一部感兴趣的电影,为其推荐其他可能感兴趣的电影。此案例中,我们使用的数据集是用户对电影的评分数据,包含...
- user_id:用户ID
- age:用户年龄
- sex:性别
- occupation:职业
- zip_code:邮编
- user_id:用户ID
- movide_id:电影ID
- rating:评分
- unix_tiemstamp:评分时间
- movie_id:电影ID
- movie_title:电影标题
- release_date:发行日期
- video_release_date:视频发行日期
- IMDB_URL:电影在IMDB网站上的网址
- rating:评分
- unix_tiemstamp:评分时间
- 其他19个字段:表明电影的类型,如未知类型、动作、冒险、卡通等
1. 准备源代码和数据
这一步准备案例所需的源代码和数据,相关资源已经保存在OBS中,我们通过ModelArts SDK将资源下载到本地,并解压到当前目录下。解压后,当前目录包含ml-100k目录,存有数据集。
import os
from modelarts.session import Session
if not os.path.exists('ml-100k'):
session = Session()
session.download_data(bucket_path="modelarts-labs-bj4-v2/course/ai_in_action/2021/machine_learning/item_item_collaborative_filtering_for_movie_recommendation/movie_recommendation.tar.gz", path="./movie_recommendation.tar.gz")
# 使用tar命令解压资源包
os.system('tar xf movie_recommendation.tar.gz')
Successfully download file modelarts-labs-bj4/course/ai_in_action/2021/machine_learning/item_item_collaborative_filtering_for_movie_recommendation/movie_recommendation.tar.gz from OBS to local ./movie_recommendation.tar.gz
2. 导入基本工具库
numpy是数据分处理工具,pandas是文件读取和数据处理工具,scipy是一个科学计算库,这里导入了cosine, correlation两种距离计算方法。
# import same usefull libraries
import numpy as np
import pandas as pd
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine, correlation
3. 导入并展示样本数据
# 用户信息
users_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('./ml-100k/u.user', sep='|', names=users_cols, parse_dates=True)
user_id | age | sex | occupation | zip_code | |
0 | 1 | 24 | M | technician | 85711 |
1 | 2 | 53 | F | other | 94043 |
2 | 3 | 23 | M | writer | 32067 |
3 | 4 | 24 | M | technician | 43537 |
4 | 5 | 33 | F | other | 15213 |
打印数据表格的大小,可以看到这是一个 943x5的矩阵, 其中943代表有943个用户,5代表每个用户有5项信息
(943, 5)
# Ratings
rating_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('./ml-100k/u.data', sep='\t', names=rating_cols)
user_id | movie_id | rating | unix_timestamp | |
0 | 196 | 242 | 3 | 881250949 |
1 | 186 | 302 | 3 | 891717742 |
2 | 22 | 377 | 1 | 878887116 |
3 | 244 | 51 | 2 | 880606923 |
4 | 166 | 346 | 1 | 886397596 |
打印数据表格的大小,可以看到这是一个 10000x4的矩阵, 其中10000代表有10000条评论,4代表每个评论有5项信息
(100000, 4)
# Movies
movie_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('./ml-100k/u.item', sep='|', names=movie_cols, usecols=range(5), encoding='latin-1')
movie_id | title | release_date | video_release_date | imdb_url | |
0 | 1 | Toy Story (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Toy%20Story%2... |
1 | 2 | GoldenEye (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?GoldenEye%20(... |
2 | 3 | Four Rooms (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Four%20Rooms%... |
3 | 4 | Get Shorty (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Get%20Shorty%... |
4 | 5 | Copycat (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Copycat%20(1995) |
打印数据表格的大小,可以看到这是一个 1682x5的矩阵, 其中1682代表有1682部电影,5代表每部电影有5项信息
(1682, 5)
4. 数据合并
# Merging movie data with their ratings
movie_ratings = pd.merge(movies, ratings)
# merging movie_ratings data with the User's dataframe
df = pd.merge(movie_ratings, users)
movie_id | title | release_date | video_release_date | imdb_url | user_id | rating | unix_timestamp | age | sex | occupation | zip_code | |
0 | 1 | Toy Story (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Toy%20Story%2... | 308 | 4 | 887736532 | 60 | M | retired | 95076 |
1 | 4 | Get Shorty (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Get%20Shorty%... | 308 | 5 | 887737890 | 60 | M | retired | 95076 |
2 | 5 | Copycat (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Copycat%20(1995) | 308 | 4 | 887739608 | 60 | M | retired | 95076 |
3 | 7 | Twelve Monkeys (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Twelve%20Monk... | 308 | 4 | 887738847 | 60 | M | retired | 95076 |
4 | 8 | Babe (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Babe%20(1995) | 308 | 5 | 887736696 | 60 | M | retired | 95076 |
打印数据总表的大小,可以看到这是一个 10000x12的矩阵, 其中10000代表有10000条评论,12代表每条评论有12项属性,包括电影ID,电影信息,用户ID,评分,用户信息等
(100000, 12)
5. 数据清洗
# pre-processing
# dropping colums that aren't needed
df.drop(df.columns[[3, 4, 7]], axis=1, inplace=True)
ratings.drop("unix_timestamp", inplace=True, axis=1)
movies.drop(movies.columns[[3, 4]], inplace=True, axis=1)
movie_id | title | release_date | user_id | rating | age | sex | occupation | zip_code | |
0 | 1 | Toy Story (1995) | 01-Jan-1995 | 308 | 4 | 60 | M | retired | 95076 |
1 | 4 | Get Shorty (1995) | 01-Jan-1995 | 308 | 5 | 60 | M | retired | 95076 |
2 | 5 | Copycat (1995) | 01-Jan-1995 | 308 | 4 | 60 | M | retired | 95076 |
3 | 7 | Twelve Monkeys (1995) | 01-Jan-1995 | 308 | 4 | 60 | M | retired | 95076 |
4 | 8 | Babe (1995) | 01-Jan-1995 | 308 | 5 | 60 | M | retired | 95076 |
6. 创建用户-电影评分矩阵
# Pivot Table(This creates a matrix of users and movie_ratings)
ratings_matrix = ratings.pivot_table(index=['movie_id'], columns=['user_id'], values='rating').reset_index(drop=True)
ratings_matrix.fillna(0, inplace=True)
cmu = ratings_matrix
user_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 934 | 935 | 936 | 937 | 938 | 939 | 940 | 941 | 942 | 943 |
0 | 5.0 | 4.0 | 0.0 | 0.0 | 4.0 | 4.0 | 0.0 | 0.0 | 0.0 | 4.0 | ... | 2.0 | 3.0 | 4.0 | 0.0 | 4.0 | 0.0 | 0.0 | 5.0 | 0.0 | 0.0 |
1 | 3.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 |
2 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 0.0 | 0.0 | 4.0 | ... | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 |
4 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 943 columns
打印数据总表的大小,可以看到这是一个 1682x943的矩阵, 其中1682代表有1682部电影,943代表有943个用户
7. 创建电影的相似矩阵
# Cosine Similarity(Creates a cosine matrix of similaraties ..... which is the pairwise distances
# between two items )
movie_similarity = 1 - pairwise_distances(ratings_matrix.values, metric="cosine")
np.fill_diagonal(movie_similarity, 0)
ratings_matrix = pd.DataFrame(movie_similarity)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 1672 | 1673 | 1674 | 1675 | 1676 | 1677 | 1678 | 1679 | 1680 | 1681 | |
0 | 0.000000 | 0.402382 | 0.330245 | 0.454938 | 0.286714 | 0.116344 | 0.620979 | 0.481114 | 0.496288 | 0.273935 | ... | 0.035387 | 0.0 | 0.000000 | 0.000000 | 0.035387 | 0.0 | 0.0 | 0.0 | 0.047183 | 0.047183 |
1 | 0.402382 | 0.000000 | 0.273069 | 0.502571 | 0.318836 | 0.083563 | 0.383403 | 0.337002 | 0.255252 | 0.171082 | ... | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.078299 | 0.078299 |
2 | 0.330245 | 0.273069 | 0.000000 | 0.324866 | 0.212957 | 0.106722 | 0.372921 | 0.200794 | 0.273669 | 0.158104 | ... | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.032292 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.096875 |
3 | 0.454938 | 0.502571 | 0.324866 | 0.000000 | 0.334239 | 0.090308 | 0.489283 | 0.490236 | 0.419044 | 0.252561 | ... | 0.000000 | 0.0 | 0.094022 | 0.094022 | 0.037609 | 0.0 | 0.0 | 0.0 | 0.056413 | 0.075218 |
4 | 0.286714 | 0.318836 | 0.212957 | 0.334239 | 0.000000 | 0.037299 | 0.334769 | 0.259161 | 0.272448 | 0.055453 | ... | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.094211 |
5 rows × 1682 columns
(1682, 1682)
8. 根据电影的相似矩阵,推荐电影
当用户查看了 Copycat (1995),那么根据电影的相似矩阵,推荐与 Copycat (1995) 近似分数比较高的电影。
根据电影名 Copycat (1995), 查询电影信息表 (movies)中的index序号
# user_inp=input('Enter the reference movie title based on which recommendations are to be made: ')
user_inp = "Copycat (1995)"
inp = movies[movies['title'] == user_inp].index.tolist()
inp = inp[0]
movies['similarity'] = ratings_matrix.iloc[inp]
movies.columns = ['movie_id', 'title', 'release_date', 'similarity']
movie_id | title | release_date | similarity | |
0 | 1 | Toy Story (1995) | 01-Jan-1995 | 0.286714 |
1 | 2 | GoldenEye (1995) | 01-Jan-1995 | 0.318836 |
2 | 3 | Four Rooms (1995) | 01-Jan-1995 | 0.212957 |
3 | 4 | Get Shorty (1995) | 01-Jan-1995 | 0.334239 |
4 | 5 | Copycat (1995) | 01-Jan-1995 | 0.000000 |
recommended_movies = movies.sort_values(["similarity"], ascending=False)[1:6]
print("Recommended movies based on your choice of ", user_inp, ": \n", recommended_movies)
Recommended movies based on your choice of Copycat (1995) :
movie_id title release_date similarity
218 219 Nightmare on Elm Street, A (1984) 01-Jan-1984 0.472725
53 54 Outbreak (1995) 01-Jan-1995 0.472399
233 234 Jaws (1975) 01-Jan-1975 0.450780
52 53 Natural Born Killers (1994) 01-Jan-1994 0.445242
97 98 Silence of the Lambs, The (1991) 01-Jan-1991 0.440996
