Movie Rating with Collaborative Filtering

version: 0.1

This file file serves as your book's preface, a great place to describe your book's content and ideas.

Author: Wente, EK

Introduction

This project will demonstrate how we can use Apache Spark[1] to recommend movies to a user. For this project, we will use a subset dataset of 500,000 ratings (ratings.dat.gz) [4] from the movielens 10M stable benchmark rating dataset. For convenience, the data has been provided here.

Project Description

Instructions

Modeling: We are going to use a technique called collaborative filtering[2]. Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a person chosen randomly.

Data preparation: Split the ratingsRDD dataset into three pieces

A training set (RDD), which we will use to train models A validation set (RDD), which we will use to choose the best model A test set (RDD), which we will use for our experiments For example (in python): trainingRDD, validationRDD, testRDD = ratingsRDD.randomSplit([6, 2, 2], seed=0L) Model training & model selection: Based on the training and validation set, use the Root Mean Square Error (RMSE) [3]to compute the error of each model, then select the best model (hint: The most important parameter to ALS.train() is the rank. You could use a fixed value 0.1 for the regularizationParameter. )

Model evaluation: Apply the best model to the testRDD dataset to decide how good our model is.

Data:

Each line in the ratings dataset (ratings.dat.gz) is formatted as: UserID::MovieID::Rating::Timestamp Create the ratingsRDD : For each line in the ratings dataset, we create a tuple of (UserID, MovieID, Rating). We drop the timestamp because we do not need it here. Submission Instructions Please upload your final code to Github Please record a video explaining the design choices you made including: the overall structure of the code, how you chose your final model, and any other functionality you would like to highlight. Please keep the video under 5 minutes.

References

[1]Spark: http://spark.apache.org [2]Collaborative Filtering: https://en.wikipedia.org/wiki/Collaborative_filtering [3]RMSE: https://en.wikipedia.org/wiki/Root-mean-square_deviation [4]MovieLens: http://grouplens.org/datasets/movielens/

results matching ""

    No results matching ""