Movie Rating with Collaborative Filtering
version: 0.1
This file file serves as your book's preface, a great place to describe your book's content and ideas.
Author: Wente, EK
Introduction
This project will demonstrate how we can use Apache Spark[1] to recommend movies to a user. For this project, we will use a subset dataset of 500,000 ratings (ratings.dat.gz) [4] from the movielens 10M stable benchmark rating dataset. For convenience, the data has been provided here.
Project Description
Instructions
Modeling: We are going to use a technique called collaborative filtering[2]. Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a person chosen randomly.
Data preparation: Split the ratingsRDD dataset into three pieces
A training set (RDD), which we will use to train models A validation set (RDD), which we will use to choose the best model A test set (RDD), which we will use for our experiments For example (in python): trainingRDD, validationRDD, testRDD = ratingsRDD.randomSplit([6, 2, 2], seed=0L) Model training & model selection: Based on the training and validation set, use the Root Mean Square Error (RMSE) [3]to compute the error of each model, then select the best model (hint: The most important parameter to ALS.train() is the rank. You could use a fixed value 0.1 for the regularizationParameter. )
Model evaluation: Apply the best model to the testRDD dataset to decide how good our model is.
Data:
Each line in the ratings dataset (ratings.dat.gz) is formatted as: UserID::MovieID::Rating::Timestamp Create the ratingsRDD : For each line in the ratings dataset, we create a tuple of (UserID, MovieID, Rating). We drop the timestamp because we do not need it here. Submission Instructions Please upload your final code to Github Please record a video explaining the design choices you made including: the overall structure of the code, how you chose your final model, and any other functionality you would like to highlight. Please keep the video under 5 minutes.
References
[1]Spark: http://spark.apache.org [2]Collaborative Filtering: https://en.wikipedia.org/wiki/Collaborative_filtering [3]RMSE: https://en.wikipedia.org/wiki/Root-mean-square_deviation [4]MovieLens: http://grouplens.org/datasets/movielens/