Movie Recommender
This notebook demonstrates how Pinecone helps you build a simple Movie Recommender System. There are three parts to this recommender system:
- A dataset containing movie ratings
- Two neural network models for embedding movies and users
- A vector index to perform similarity search on those embeddings
The architecture of our recommender system is shown below. We have two models, a user model and a movie model, which generate embedding for users and movies. The two models are trained such that the proximity between a user and a movie in the multi-dimensional vector space depends on the rating given by the user for that movie. This means if a user gives a high rating to a movie, the movie will be closer to the user in the multi-dimensional vector space and vice versa. The result is that users with similar movie preferences and the movies they rated highly become closer in the vector space. A similarity search in this vector space for a user would give new recommendations based on the shared movie preference with other users.
Install Dependencies
!pip install datasets transformers pinecone-client tensorflow
Load the Dataset
We will use a subset of the MovieLens 25M Dataset in this project. This dataset contains ~1M user ratings provided by over 30k unique users for the most recent ~10k movies from the MovieLens 25M Dataset. The subset is available here on HuggingFace datasets.
from datasets import load_dataset
# load the dataset into a pandas datafame
movies = load_dataset("pinecone/movielens-recent-ratings", split="train").to_pandas()
# drop duplicates to return only unique movies
unique_movies = movies.drop_duplicates(subset="imdb_id")
unique_movies.head()
imdb_id | movie_id | user_id | rating | title | poster | |
---|---|---|---|---|---|---|
0 | tt5027774 | 6705 | 4556 | 4.0 | Three Billboards Outside Ebbing, Missouri (2017) | https://m.media-amazon.com/images/M/MV5BMjI0OD... |
1 | tt5463162 | 7966 | 20798 | 3.5 | Deadpool 2 (2018) | https://m.media-amazon.com/images/M/MV5BMDkzNm... |
2 | tt4007502 | 1614 | 26543 | 4.5 | Frozen Fever (2015) | https://m.media-amazon.com/images/M/MV5BMjY3YT... |
3 | tt4209788 | 7022 | 4106 | 4.0 | Molly's Game (2017) | https://m.media-amazon.com/images/M/MV5BNTkzMz... |
4 | tt2948356 | 3571 | 15259 | 4.0 | Zootopia (2016) | https://m.media-amazon.com/images/M/MV5BOTMyMj... |
Initialize Embedding Models
The user_model
and movie_model
are trained using Tensorflow Keras. The user_model
transforms a given user_id
into a 32-dimensional embedding in the same vector space as the movies, representing the user’s movie preference. The movie recommendations are then fetched based on proximity to the user’s location in the multi-dimensional space.
Similarly, the movie_model
transforms a given movie_id
into a 32-dimensional embedding in the same vector space as other similar movies — making it possible to find movies similar to a given movie.
from huggingface_hub import from_pretrained_keras
# load the user model and movie model from huggingface
user_model = from_pretrained_keras("pinecone/movie-recommender-user-model")
movie_model = from_pretrained_keras("pinecone/movie-recommender-movie-model")
Create Pinecone Index
To create our vector index, we first need to initialize our connection to Pinecone. For this we need a free API key. You can find your environment in the Pinecone console under API Keys. Once we have those, we initialize the connection like so:
import pinecone
# connect to pinecone environment
pinecone.init(
api_key="<<YOUR_API_KEY>>",
environment="YOUR_ENVIRONMENT"
)
Now we create a new index called "movie-emb"
. What we name this isn't important.
index_name = 'movie-emb'
# check if the movie-emb index exists
if index_name not in pinecone.list_indexes():
# create the index if it does not exist
pinecone.create_index(
index_name,
dimension=32,
metric="cosine"
)
# connect to movie-emb index we created
index = pinecone.Index(index_name)
Create Movie Embeddings
We will be creating movie embeddings using the pretrained movie_model
. All of the movie embeddings will be upserted to the new "movie-emb"
index in Pinecone.
from tqdm.auto import tqdm
# we will use batches of 64
batch_size = 64
for i in tqdm(range(0, len(unique_movies), batch_size)):
# find end of batch
i_end = min(i+batch_size, len(unique_movies))
# extract batch
batch = unique_movies.iloc[i:i_end]
# generate embeddings for batch
emb = movie_model.predict(batch['movie_id']).tolist()
# get metadata
meta = batch.to_dict(orient='records')
# create IDs
ids = batch["imdb_id"].values.tolist()
# add all to upsert list
to_upsert = list(zip(ids, emb, meta))
# upsert/insert these records to pinecone
_ = index.upsert(vectors=to_upsert)
# check that we have all vectors in index
index.describe_index_stats()
{'dimension': 32,
'index_fullness': 0.0,
'namespaces': {'': {'vector_count': 10269}},
'total_vector_count': 10269}
Get Recommendations
We now have movie embeddings stored in Pinecone. To get recommendations, we can do one of two things:
- Get a user embedding via a user embedding model and our
user_id
s, and retrieve movie embeddings that are most similar from Pinecone. - Use an existing movie embedding to retrieve other similar movies.
Both of these options use the same approach; the only difference is the source of data (user vs. movie) and the embedding model (user vs. movie).
We will start with the strategy of getting recommendations for a user embedding.
Get recommendations for a user
# we do this to display movie posters in a jupyter notebook
from IPython.core.display import HTML
We will start by looking at a user's top rated movies. We can find this information inside the movies
dataframe by filtering for movie ratings by a specific user (as per their user_id
) and ordering these by the rating score.
def top_movies_user_rated(user):
# get list of movies that the user has rated
user_movies = movies[movies["user_id"] == user]
# order by their top rated movies
top_rated = user_movies.sort_values(by=['rating'], ascending=False)
# return the top 14 movies
return top_rated['poster'].tolist()[:14], top_rated['rating'].tolist()[:14]
After this, we can define a function called display_posters
that will take a list of movie posters (like those returned by top_movies_user_rated
) and display them in the notebook.
def display_posters(posters):
figures = []
for poster in posters:
figures.append(f'''
<figure style="margin: 5px !important;">
<img src="{poster}" style="width: 120px; height: 150px" >
</figure>
''')
return HTML(data=f'''
<div style="display: flex; flex-flow: row wrap; text-align: center;">
{''.join(figures)}
</div>
''')
Let's take a look at user 3
's top rated movies:
user = 3
top_rated, scores = top_movies_user_rated(user)
display_posters(top_rated)





print(scores)
[4.5, 4.0, 4.0, 2.5, 2.5]
User 3
has rated these five movies, with Big Hero 6, Civil War, and Avengers being given good scores. They seem less enthusiastic about more sci-fi films like Arrival and The Martian.
Now let's see how to make some movie recommendations for this user.
Start by defining the get_recommendations
function. Given a specific user_id
, this uses the user_model
to create a user embedding (xq
). It then retrieves the most similar movie vectors from Pinecone (xc
), and extracts the relevant movie posters so we can display them later.
def get_recommendations(user):
# generate embeddings for the user
xq = user_model([user]).numpy().tolist()
# compute cosine similarity between user and movie vectors and return top k movies
xc = index.query(xq, top_k=14,
include_metadata=True)
result = []
# iterate through results and extract movie posters
for match in xc['matches']:
poster = match['metadata']['poster']
result.append(poster)
return result
Now we can retrieve recommendations for the user.
urls = get_recommendations(user)
display_posters(urls)














That looks good: the top results actually match the user's three favorite results. Following this, we see a lot of Marvel superhero films, which user 3
is probably going to enjoy, judging from their current ratings.
Let's see another user. This time, we choose user 128
.
user = 128
top_rated, scores = top_movies_user_rated(user)
display_posters(top_rated)














print(scores)
[4.5, 4.5, 4.5, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0]
Because this user seems to like everything, they also get recommended a mix of different things:
urls = get_recommendations(user)
display_posters(urls)














user = 20000
top_rated, scores = top_movies_user_rated(user)
display_posters(top_rated)







print(scores)
[5.0, 4.0, 3.5, 3.5, 3.5, 3.0, 1.0]
We can see more of a trend towards action films with this user, so we can expect to see similar action-focused recommendations.
urls = get_recommendations(user)
display_posters(urls)














Find Similar Movies
Now let's see how to find some similar movies.
Start by defining the get_similar_movies
function. Given a specific imdb_id
, we query directly using the pre-existing embedding for that ID stored in Pinecone.
# search for similar movies in pinecone index
def get_similar_movies(imdb_id):
# compute cosine similarity between movie and embedding vectors and return top k movies
xc = index.query(id=imdb_id, top_k=14, include_metadata=True)
result = []
# iterate through results and extract movie posters
for match in xc['matches']:
poster = match['metadata']['poster']
result.append(poster)
return result
# imdbid of Avengers Infinity War
imdb_id = "tt4154756"
# filter the imdbid from the unique_movies
movie = unique_movies[unique_movies["imdb_id"] == imdb_id]
movie
imdb_id | movie_id | user_id | rating | title | poster | |
---|---|---|---|---|---|---|
11 | tt4154756 | 1263 | 153 | 4.0 | Avengers: Infinity War - Part I (2018) | https://m.media-amazon.com/images/M/MV5BMjMxNj... |
<svg xmlns="http://www.w3.org/2000/svg" height="24px"viewBox="0 0 24 24"
width="24px">
# display the poster of the movie
display_posters(movie["poster"])

Now we have Avengers: Infinity War. Let's find movies that are similar to this movie.
similar_movies = get_similar_movies(imdb_id)
display_posters(similar_movies)














The top results closely match Avengers: Infinity War, the most similar movie being that movie itself. Following this, we see a lot of other Marvel superhero films.
Let's try another movie, this time a cartoon.
# imdbid of Moana
imdb_id = "tt3521164"
# filter the imdbid from the unique_movies
movie = unique_movies[unique_movies["imdb_id"] == imdb_id]
movie
imdb_id | movie_id | user_id | rating | title | poster | |
---|---|---|---|---|---|---|
97 | tt3521164 | 5138 | 24875 | 5.0 | Moana (2016) | https://m.media-amazon.com/images/M/MV5BMjI4Mz... |
<svg xmlns="http://www.w3.org/2000/svg" height="24px"viewBox="0 0 24 24"
width="24px">
# display the poster of the movie
display_posters(movie["poster"])

similar_movies = get_similar_movies(imdb_id)
display_posters(similar_movies)














This result quality is good again. The top results include plenty of cartoons.
With that, we have built a recommendation system able to recommend movies based both on user movie ratings and similar movies.
Updated 4 months ago