Movie Recommender

Open In Colab Open nbviewer Open github

This notebook demonstrates how Pinecone helps you build a simple Movie Recommender System. There are three parts to this recommender system:

  • A dataset containing movie ratings
  • Two neural network models for embedding movies and users
  • A vector index to perform similarity search on those embeddings

The architecture of our recommender system is shown below. We have two models, a user model and a movie model, which generate embedding for users and movies. The two models are trained such that the proximity between a user and a movie in the multi-dimensional vector space depends on the rating given by the user for that movie. This means if a user gives a high rating to a movie, the movie will be closer to the user in the multi-dimensional vector space and vice versa. The result is that users with similar movie preferences and the movies they rated highly become closer in the vector space. A similarity search in this vector space for a user would give new recommendations based on the shared movie preference with other users.

Network Architecture Diagram

Install Dependencies

!pip install datasets transformers pinecone-client tensorflow

Load the Dataset

We will use a subset of the MovieLens 25M Dataset in this project. This dataset contains ~1M user ratings provided by over 30k unique users for the most recent ~10k movies from the MovieLens 25M Dataset. The subset is available here on HuggingFace datasets.

from datasets import load_dataset

# load the dataset into a pandas datafame
movies = load_dataset("pinecone/movielens-recent-ratings", split="train").to_pandas()
# drop duplicates to return only unique movies
unique_movies = movies.drop_duplicates(subset="imdb_id")
unique_movies.head()
imdb_id movie_id user_id rating title poster
0 tt5027774 6705 4556 4.0 Three Billboards Outside Ebbing, Missouri (2017) https://m.media-amazon.com/images/M/MV5BMjI0OD...
1 tt5463162 7966 20798 3.5 Deadpool 2 (2018) https://m.media-amazon.com/images/M/MV5BMDkzNm...
2 tt4007502 1614 26543 4.5 Frozen Fever (2015) https://m.media-amazon.com/images/M/MV5BMjY3YT...
3 tt4209788 7022 4106 4.0 Molly's Game (2017) https://m.media-amazon.com/images/M/MV5BNTkzMz...
4 tt2948356 3571 15259 4.0 Zootopia (2016) https://m.media-amazon.com/images/M/MV5BOTMyMj...

Initialize Embedding Models

The user_model and movie_model are trained using Tensorflow Keras. The user_model transforms a given user_id into a 32-dimensional embedding in the same vector space as the movies, representing the user’s movie preference. The movie recommendations are then fetched based on proximity to the user’s location in the multi-dimensional space.

Similarly, the movie_model transforms a given movie_id into a 32-dimensional embedding in the same vector space as other similar movies — making it possible to find movies similar to a given movie.

from huggingface_hub import from_pretrained_keras

# load the user model and movie model from huggingface
user_model = from_pretrained_keras("pinecone/movie-recommender-user-model")
movie_model = from_pretrained_keras("pinecone/movie-recommender-movie-model")

Create Pinecone Index

To create our vector index, we first need to initialize our connection to Pinecone. For this we need a free API key. You can find your environment in the Pinecone console under API Keys. Once we have those, we initialize the connection like so:

import pinecone

# connect to pinecone environment
pinecone.init(
    api_key="<<YOUR_API_KEY>>",
    environment="YOUR_ENVIRONMENT"
)

Now we create a new index called "movie-emb". What we name this isn't important.

index_name = 'movie-emb'

# check if the movie-emb index exists
if index_name not in pinecone.list_indexes():
    # create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=32,
        metric="cosine"
    )

# connect to movie-emb index we created
index = pinecone.Index(index_name)

Create Movie Embeddings

We will be creating movie embeddings using the pretrained movie_model. All of the movie embeddings will be upserted to the new "movie-emb" index in Pinecone.

from tqdm.auto import tqdm

# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(unique_movies), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(unique_movies))
    # extract batch
    batch = unique_movies.iloc[i:i_end]
    # generate embeddings for batch
    emb = movie_model.predict(batch['movie_id']).tolist()
    # get metadata
    meta = batch.to_dict(orient='records')
    # create IDs
    ids = batch["imdb_id"].values.tolist()
    # add all to upsert list
    to_upsert = list(zip(ids, emb, meta))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()
{'dimension': 32,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 10269}},
 'total_vector_count': 10269}

Get Recommendations

We now have movie embeddings stored in Pinecone. To get recommendations, we can do one of two things:

  • Get a user embedding via a user embedding model and our user_ids, and retrieve movie embeddings that are most similar from Pinecone.
  • Use an existing movie embedding to retrieve other similar movies.

Both of these options use the same approach; the only difference is the source of data (user vs. movie) and the embedding model (user vs. movie).

We will start with the strategy of getting recommendations for a user embedding.

Get recommendations for a user

# we do this to display movie posters in a jupyter notebook
from IPython.core.display import HTML

We will start by looking at a user's top rated movies. We can find this information inside the movies dataframe by filtering for movie ratings by a specific user (as per their user_id) and ordering these by the rating score.

def top_movies_user_rated(user):
    # get list of movies that the user has rated
    user_movies = movies[movies["user_id"] == user]
    # order by their top rated movies
    top_rated = user_movies.sort_values(by=['rating'], ascending=False)
    # return the top 14 movies
    return top_rated['poster'].tolist()[:14], top_rated['rating'].tolist()[:14]

After this, we can define a function called display_posters that will take a list of movie posters (like those returned by top_movies_user_rated) and display them in the notebook.

def display_posters(posters):
    figures = []
    for poster in posters:
        figures.append(f'''
            <figure style="margin: 5px !important;">
              <img src="{poster}" style="width: 120px; height: 150px" >
            </figure>
        ''')
    return HTML(data=f'''
        <div style="display: flex; flex-flow: row wrap; text-align: center;">
        {''.join(figures)}
        </div>
    ''')

Let's take a look at user 3's top rated movies:

user = 3
top_rated, scores = top_movies_user_rated(user)
display_posters(top_rated)
print(scores)
[4.5, 4.0, 4.0, 2.5, 2.5]

User 3 has rated these five movies, with Big Hero 6, Civil War, and Avengers being given good scores. They seem less enthusiastic about more sci-fi films like Arrival and The Martian.

Now let's see how to make some movie recommendations for this user.

Start by defining the get_recommendations function. Given a specific user_id, this uses the user_model to create a user embedding (xq). It then retrieves the most similar movie vectors from Pinecone (xc), and extracts the relevant movie posters so we can display them later.

def get_recommendations(user):
    # generate embeddings for the user
    xq = user_model([user]).numpy().tolist()
    # compute cosine similarity between user and movie vectors and return top k movies
    xc = index.query(xq, top_k=14,
                    include_metadata=True)
    result = []
    # iterate through results and extract movie posters
    for match in xc['matches']:
        poster = match['metadata']['poster']
        result.append(poster)
    return result

Now we can retrieve recommendations for the user.

urls = get_recommendations(user)
display_posters(urls)

That looks good: the top results actually match the user's three favorite results. Following this, we see a lot of Marvel superhero films, which user 3 is probably going to enjoy, judging from their current ratings.

Let's see another user. This time, we choose user 128.

user = 128
top_rated, scores = top_movies_user_rated(user)
display_posters(top_rated)
print(scores)
[4.5, 4.5, 4.5, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0]

Because this user seems to like everything, they also get recommended a mix of different things:

urls = get_recommendations(user)
display_posters(urls)
user = 20000
top_rated, scores = top_movies_user_rated(user)
display_posters(top_rated)
print(scores)
[5.0, 4.0, 3.5, 3.5, 3.5, 3.0, 1.0]

We can see more of a trend towards action films with this user, so we can expect to see similar action-focused recommendations.

urls = get_recommendations(user)
display_posters(urls)

Find Similar Movies

Now let's see how to find some similar movies.

Start by defining the get_similar_movies function. Given a specific imdb_id, we query directly using the pre-existing embedding for that ID stored in Pinecone.

# search for similar movies in pinecone index
def get_similar_movies(imdb_id):
    # compute cosine similarity between movie and embedding vectors and return top k movies
    xc = index.query(id=imdb_id, top_k=14, include_metadata=True)
    result = []
    # iterate through results and extract movie posters
    for match in xc['matches']:
        poster = match['metadata']['poster']
        result.append(poster)
    return result
# imdbid of Avengers Infinity War
imdb_id = "tt4154756"
# filter the imdbid from the unique_movies
movie = unique_movies[unique_movies["imdb_id"] == imdb_id]
movie
imdb_id movie_id user_id rating title poster
11 tt4154756 1263 153 4.0 Avengers: Infinity War - Part I (2018) https://m.media-amazon.com/images/M/MV5BMjMxNj...

<svg xmlns="http://www.w3.org/2000/svg" height="24px"viewBox="0 0 24 24"
width="24px">



# display the poster of the movie
display_posters(movie["poster"])

Now we have Avengers: Infinity War. Let's find movies that are similar to this movie.

similar_movies = get_similar_movies(imdb_id)
display_posters(similar_movies)

The top results closely match Avengers: Infinity War, the most similar movie being that movie itself. Following this, we see a lot of other Marvel superhero films.

Let's try another movie, this time a cartoon.

# imdbid of Moana
imdb_id = "tt3521164"
# filter the imdbid from the unique_movies
movie = unique_movies[unique_movies["imdb_id"] == imdb_id]
movie
imdb_id movie_id user_id rating title poster
97 tt3521164 5138 24875 5.0 Moana (2016) https://m.media-amazon.com/images/M/MV5BMjI4Mz...

<svg xmlns="http://www.w3.org/2000/svg" height="24px"viewBox="0 0 24 24"
width="24px">



# display the poster of the movie
display_posters(movie["poster"])
similar_movies = get_similar_movies(imdb_id)
display_posters(similar_movies)

This result quality is good again. The top results include plenty of cartoons.

With that, we have built a recommendation system able to recommend movies based both on user movie ratings and similar movies.