Ben Gorman

Ben Gorman

Life's a garden. Dig it.

You're building a Netflix clone :material-netflix:. You have a dataset of movie reviews, where each review is a (user_id, movie_id, rating) triplet.

  • movie_ids are integers in the range [0, Nmovies)
  • user_ids are integers in the range [0, Nusers)
  • ratings are integers in the range [1, 5]
import random
 
Nmovies = 10
Nusers = 10
Nreviews = 30
 
movie_ids = random.choices(range(Nmovies), k=Nreviews)
user_ids = random.choices(range(Nusers), k=Nreviews)
ratings = random.choices(range(1,6), k=Nreviews)
 
print(movie_ids)
# [4, 9, 8, 7, 3, ... ]
 
print(user_ids)
# [1, 7, 4, 1, 2, ... ]
 
print(ratings)
# [1, 3, 2, 1, 1, ... ]
  1. Build a compressed sparse matrix where (i,j) gives the ith person's review of movie j.
  2. Normalize the movie vectors (column vectors) so that each of them has unit length.
  3. Calculate the Euclidean distance between normalized movie 2 and normalized movie 4.

For example
if our Netflix clone had three users and two movies with a review matrix like this

[[1 0]
 [0 1]
 [3 0]]

The normalized movie vectors would be

[[0.32  0. ]
 [0.    1. ]
 [0.95  0. ]]

The Euclidean distance between these two normalized movie vectors is 1.41.


Solution

This content is gated

Subscribe to the product below to gain access