Article Recommender

Open In Colab Open nbviewer Open github

This notebook demonstrates how to use Pinecone's similarity search to create a simple personalized article or content recommender.

The goal is to create a recommendation engine that retrieves the best article recommendations for each user. When making recommendations with content-based filtering, we evaluate the user’s past behavior and the content items themselves. So in this example, users will be recommended articles that are similar to those they've already read.

Install and Import Python Packages

!pip install --quiet wordcloud pandas
!pip install --quiet sentence-transformers --no-cache-dir
import os
import pandas as pd
import numpy as np
import time
import re
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from statistics import mean
%matplotlib inline

In the following sections, we will use Pinecone to easily build an article recommendation engine. Pinecone will be responsible for storing embeddings for articles, maintaining a live index of those vectors, and returning recommended articles on-demand.

Pinecone Setup

!pip install --quiet -U pinecone-client
import pinecone
# Load Pinecone API key
api_key = os.getenv('PINECONE_API_KEY') or 'YOUR_API_KEY'
# Set Pinecone environment. Default environment is YOUR_ENVIRONMENT
env = os.getenv('PINECONE_ENVIRONMENT') or 'YOUR_ENVIRONMENT'
pinecone.init(api_key=api_key, environment=env)

Get a Pinecone API key if you don’t have one already. You can find your environment in the Pinecone console under API Keys.

index_name = 'articles-recommendation'
# If index of the same name exists, then delete it
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

Create an index.

pinecone.create_index(index_name, dimension=300)

Connect to the new index.

index = pinecone.Index(index_name)
index.describe_index_stats()
{'dimension': 300, 'namespaces': {}}

Upload Articles

Next, we will prepare data for the Pinecone vector index, and insert it in batches.

The dataset used throughout this example contains 2.7 million news articles and essays from 27 American publications.

Let's download the dataset.

!rm all-the-news-2-1.zip
!rm all-the-news-2-1.csv
!wget https://www.dropbox.com/s/cn2utnr5ipathhh/all-the-news-2-1.zip -q --show-progress
!unzip -q all-the-news-2-1.zip
rm: cannot remove 'all-the-news-2-1.zip': No such file or directory
rm: cannot remove 'all-the-news-2-1.csv': No such file or directory
all-the-news-2-1.zi     [             <=>    ]   3.13G  82.7MB/s    in 39s

Create Vector Embeddings

The model used in this example is the Average Word Embeddings Models. This model allows us to create vector embeddings for each article, using the content and title of each.

import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer('average_word_embeddings_komninos', device=device)

Using the complete dataset may require more time for the model to generate vector embeddings. We will use only a sample, but if you want to try uploading the whole dataset, set the NROWS flag to None.

NROWS = 200000      # number of rows to be loaded from the csv, set to None for loading all rows, reduce if you have a low amount of RAM or want a faster execution
BATCH_SIZE = 500    # batch size for upserting

Let's prepare data for upload.

Uploading the data may take a while, and depends on the network you use.

#%%time     

def prepare_data(data) -> pd.DataFrame:
    'Preprocesses data and prepares it for upsert.'
    
    # add an id column
    print("Preparing data...")
    data["id"] = range(len(data))

    # extract only first few sentences of each article for quicker vector calculations
    data['article'] = data['article'].fillna('')
    data['article'] = data.article.apply(lambda x: ' '.join(re.split(r'(?<=[.:;])\s', x)[:4]))
    data['title_article'] = data['title'] + data['article']
    
    # create a vector embedding based on title and article columns
    print('Encoding articles...')
    encoded_articles = model.encode(data['title_article'])
    data['article_vector'] = pd.Series(encoded_articles.tolist())
    
    return data


def upload_items(data):
    'Uploads data in batches.'
    print("Uploading items...")
    
    # create a list of items for upload
    items_to_upload = [(str(row.id), row.article_vector) for i,row in data.iterrows()]
    
    # upsert
    for i in range(0, len(items_to_upload), BATCH_SIZE):
        index.upsert(vectors=items_to_upload[i:i+BATCH_SIZE])

    
def process_file(filename: str) -> pd.DataFrame:
    'Reads csv files in chunks, prepares and uploads data.'
    
    data = pd.read_csv(filename, nrows=NROWS)
    data = prepare_data(data)
    upload_items(data)
    return data
            
uploaded_data = process_file(filename='all-the-news-2-1.csv')
Preparing data...
Encoding articles...
Uploading items...
# Print index statistics
index.describe_index_stats()
{'dimension': 300, 'namespaces': {'': {'vector_count': 200000}}}

Query the Pinecone Index

We will query the index for the specific users. The users are defined as a set of the articles that they previously read. More specifically, we will define 10 articles for each user, and based on the article embeddings, we will define a unique embedding for the user.

We will create three users and query Pinecone for each of them:

  • User who likes to read Sport News
  • User who likes to read Entertainment News
  • User who likes to read Business News

Let's define mappings for titles, sections, and publications for each article.

titles_mapped = dict(zip(uploaded_data.id, uploaded_data.title))
sections_mapped = dict(zip(uploaded_data.id, uploaded_data.section))
publications_mapped = dict(zip(uploaded_data.id, uploaded_data.publication))

Also, we will define a function that uses wordcloud to visualize results.

def get_wordcloud_for_user(recommendations):

    stopwords = set(STOPWORDS).union([np.nan, 'NaN', 'S'])

    wordcloud = WordCloud(
                   max_words=50000, 
                   min_font_size =12, 
                   max_font_size=50, 
                   relative_scaling = 0.9, 
                   stopwords=set(STOPWORDS),
                   normalize_plurals= True
    )

    clean_titles = [word for word in recommendations.title.values if word not in stopwords]
    title_wordcloud = wordcloud.generate(' '.join(clean_titles))

    plt.imshow(title_wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

Let's query the Pinecone index using three users.

Query Sports User

# first create a user who likes to read sport news about tennis
sport_user = uploaded_data.loc[((uploaded_data['section'] == 'Sports News' ) | 
                                (uploaded_data['section'] == 'Sports')) &
                                (uploaded_data['article'].str.contains('Tennis'))][:10]

print('\nHere is the example of previously read articles by this user:\n')
display(sport_user[['title', 'article', 'section', 'publication']])

# then create a vector for this user
a = sport_user['article_vector']
sport_user_vector = [*map(mean, zip(*a))]

# query the pinecone
res = index.query(sport_user_vector, top_k=10)

# print results
ids = [match.id for match in res.matches]
scores = [match.score for match in res.matches]
df = pd.DataFrame({'id': ids, 
                   'score': scores,
                   'title': [titles_mapped[int(_id)] for _id in ids],
                   'section': [sections_mapped[int(_id)] for _id in ids],
                   'publication': [publications_mapped[int(_id)] for _id in ids]
                    })

print("\nThis table contains recommended articles for the user:\n")
display(df)
print("\nA word-cloud representing the results:\n")
get_wordcloud_for_user(df)
Here is the example of previously read articles by this user:
title article section publication
2261 Son of Borg makes quiet debut on London grassc... LONDON (Reuters) - A blonde-haired, blue-eyed ... Sports News Reuters
12373 Cilic offers Nadal a Wimbledon reality check LONDON (Reuters) - Spaniard Rafael Nadal got a... Sports News Reuters
17124 Perth confirmed as host for Fed Cup final (Reuters) - Perth has been named host city for... Sports News Reuters
18411 Fed Cup gets revamp with 12-nation Finals in B... LONDON (Reuters) - The Fed Cup’s existing form... Sports News Reuters
26574 Nadal to prepare for Wimbledon at Hurlingham e... (Reuters) - World number two Rafa Nadal has en... Sports News Reuters
34957 Tennis Legend Margaret Court Went Off the Rail... Margaret Court, the most decorated tennis play... Sports Vice
35508 Puck City: The Enduring Success of Ice Hockey ... This article originally appeared on VICE Sport... Sports Vice
38393 As if by royal command, seven Britons make it ... LONDON (Reuters) - Tennis fan the Duchess of C... Sports News Reuters
62445 Williams fined $17,000 for U.S. Open code viol... NEW YORK (Reuters) - Serena Williams has been ... Sports News Reuters
84122 Kyrgios still wrestling with his tennis soul a... LONDON (Reuters) - Timothy Gallwey’s million-s... Sports News Reuters


This table contains recommended articles for the user:
id score title section publication
0 138865 0.966407 Federer survives first-set wobble to down Wimb... Sports News Reuters
1 26574 0.965867 Nadal to prepare for Wimbledon at Hurlingham e... Sports News Reuters
2 12373 0.965307 Cilic offers Nadal a Wimbledon reality check Sports News Reuters
3 155913 0.963684 U.S. men likely to wander Wimbledon wilderness... Sports News Reuters
4 60613 0.962414 Auger-Aliassime powers past Tsitsipas into Que... Sports News Reuters
5 22764 0.962373 Serena headed to Wimbledon seeking return to form Sports News Reuters
6 71768 0.962168 Venus, Serena, and the Power of Believing Sports Vice
7 2261 0.961590 Son of Borg makes quiet debut on London grassc... Sports News Reuters
8 45469 0.961451 Tennis: Barty a win away from world number one Sports News Reuters
9 55061 0.960677 Warrior on court, diplomat off it, classy Bart... Sports News Reuters


A word-cloud representing the results:

Wordcloud of recommended sports articles

Query Entertainment User

# first create a user who likes to read news about Xbox
entertainment_user = uploaded_data.loc[((uploaded_data['section'] == 'Entertainment') |
                                        (uploaded_data['section'] == 'Games') |
                                        (uploaded_data['section'] == 'Tech by VICE')) &
                                        (uploaded_data['article'].str.contains('Xbox'))][:10]

print('\nHere is the example of previously read articles by this user:\n')
display(entertainment_user[['title', 'article', 'section', 'publication']])

# then create a vector for this user
a = entertainment_user['article_vector']
entertainment_user_vector = [*map(mean, zip(*a))]

# query the pinecone
res = index.query(entertainment_user_vector, top_k=10)

# print results
ids = [match.id for match in res.matches]
scores = [match.score for match in res.matches]
df = pd.DataFrame({'id': ids, 
                   'score': scores,
                   'title': [titles_mapped[int(_id)] for _id in ids],
                   'section': [sections_mapped[int(_id)] for _id in ids],
                   'publication': [publications_mapped[int(_id)] for _id in ids]
                    })

print("\nThis table contains recommended articles for the user:\n")
display(df)
print("\nA word-cloud representing the results:\n")
get_wordcloud_for_user(df)
Here is the example of previously read articles by this user:
title article section publication
4977 A Canadian Man Is Pissed That His Son Ran Up a... A Pembroke, Ontario, gun shop owner is "mad as... Games Vice
12016 'I Expect You to Die' is One of Virtual Realit... The reason I bought a Vive over and Oculus ear... Games Vice
16078 Windows 10's Killer App? Xbox One Games Microsoft's crusade to get the world to instal... Tech by VICE Vice
20318 Black Friday Not Your Thing? Play These Free G... It's Black Friday, the oh-so-American shopping... Games Vice
25785 Nintendo’s Win at E3 Shows That It's a Console... ​ E3 has come and gone for 2016, the LA expo o... Games Vice
29653 You Can Smell Like a Gamer With Lynx’s New Xbo... Gamers in Australia and New Zealand will soon ... Games Vice
33234 It’s Old and It’s Clunky, But You Really Must ... When Dragon's Dogma first popped up in 2012, t... Games Vice
34617 Nintendo’s Win at E3 Shows That It's a Console... E3 has come and gone for 2016, the LA expo of ... Games Vice
38608 PC Gaming Is Still Way Too Hard Here's Motherboard's super simple guide to bui... Tech by VICE Vice
41444 Here’s Everything That Happened at the Xbox E3... That's Xbox's Big Show for E3 2016 over and do... Games Vice


This table contains recommended articles for the user:
id score title section publication
0 34617 0.966389 Nintendo’s Win at E3 Shows That It's a Console... Games Vice
1 63293 0.965053 A Title Card vs Six Teraflops: How Metroid Sto... Games Vice
2 25785 0.964193 Nintendo’s Win at E3 Shows That It's a Console... Games Vice
3 16771 0.963487 The Lo-Fi Flaws That Define Our Favorite Old G... Games Vice
4 38608 0.960349 PC Gaming Is Still Way Too Hard Tech by VICE Vice
5 121140 0.960174 Microsoft’s New Direction All Started With the... Tech by VICE Vice
6 160409 0.959802 Sometimes a David Bowie Song Gets Your Favorit... Tech by VICE Vice
7 29653 0.959628 You Can Smell Like a Gamer With Lynx’s New Xbo... Games Vice
8 156585 0.959380 Google Takes Aim at PlayStation, Xbox With Gam... Games Vice
9 185864 0.958856 The Switch Succeeds on Nintendo's Historic "To... Games Vice


A word-cloud representing the results:

Wordcloud of recommended entertainment articles

Query Business User

# first create a user who likes to read about Wall Street business news
business_user = uploaded_data.loc[((uploaded_data['section'] == 'Business News')|
                                   (uploaded_data['section'] == 'business')) &
                                   (uploaded_data['article'].str.contains('Wall Street'))][:10]

print('\nHere is the example of previously read articles by this user:\n')
display(business_user[['title', 'article', 'section', 'publication']])

# then create a vector for this user
a = business_user['article_vector']
business_user_vector = [*map(mean, zip(*a))]

# query the pinecone
res = index.query(business_user_vector, top_k=10)

# print results
ids = [match.id for match in res.matches]
scores = [match.score for match in res.matches]
df = pd.DataFrame({'id': ids, 
                   'score': scores,
                   'title': [titles_mapped[int(_id)] for _id in ids],
                   'section': [sections_mapped[int(_id)] for _id in ids],
                   'publication': [publications_mapped[int(_id)] for _id in ids]
                    })

print("\nThis table contains recommended articles for the user:\n")
display(df)
print("\nA word-cloud representing the results:\n")
get_wordcloud_for_user(df)
Here is the example of previously read articles by this user:
title article section publication
370 Wall St. falls as investors eye a united hawki... NEW YORK (Reuters) - Wall Street’s major index... Business News Reuters
809 Oil surges on tanker attacks; stocks rise on F... NEW YORK (Reuters) - Oil futures rose on Thurs... Business News Reuters
885 A look at Tesla's nine-member board (Reuters) - Tesla Inc’s board has named a spec... Business News Reuters
1049 Home Depot posts rare sales miss as delayed sp... (Reuters) - Home Depot Inc (HD.N) on Tuesday m... Business News Reuters
1555 PepsiCo's mini-sized sodas boost quarterly res... (Reuters) - PepsiCo Inc’s (PEP.O) quarterly re... Business News Reuters
1638 Wall Street extends rally on U.S.-China trade ... NEW YORK (Reuters) - U.S. stocks rallied on Fr... Business News Reuters
1900 U.S. plans limits on Chinese investment in U.S... WASHINGTON (Reuters) - The U.S. Treasury Depar... Business News Reuters
2109 Exxon Mobil, Chevron dogged by refining, chemi... HOUSTON (Reuters) - Exxon Mobil Corp and Chevr... Business News Reuters
2286 Wall Street soars on U.S. rate cut hopes NEW YORK (Reuters) - Wall Street’s three major... Business News Reuters
2563 Apple shares drop on iPhone suppliers' warnings (Reuters) - Apple Inc (AAPL.O) shares fell to ... Business News Reuters


This table contains recommended articles for the user:
id score title section publication
0 131603 0.970929 US STOCKS-Wall Street muted as rate cut bets t... Market News Reuters
1 93287 0.970408 MONEY MARKETS-U.S. rate-cut bets in June slip ... Bonds News Reuters
2 159587 0.970357 Wall Street ekes out gain, Apple cuts revenue ... Business News Reuters
3 53602 0.969963 US STOCKS-Wall St drops on trade worries, Fed ... Market News Reuters
4 45533 0.969199 Wall Street wavers as tech gives ground and in... Business News Reuters
5 147320 0.968576 Dented Fed rate cut hopes drag on stocks; doll... Davos Reuters
6 152313 0.968503 MIDEAST - Factors to watch - July 9 Earnings Season Reuters
7 34583 0.968178 Global stocks rally after speech by Fed's Powe... Business News Reuters
8 89976 0.968087 Stocks, yields rise after deal announced to en... Business News Reuters
9 96107 0.968018 Wall Street surges on higher oil after U.S. qu... Business News Reuters


A word-cloud representing the results:

Wordcloud of recommended business articles

Query Results

We can see that each user's recommendations have a high similarity to what the user actually reads. A user who likes Tennis news has plenty of Tennis news recommendations. A user who likes to read about Xbox has that kind of news. And a business user has plenty of Wall Street news that he/she enjoys.

From the word-cloud, you can see the most frequent words that appear in the recommended articles' titles.

Since we used only the title and the content of the article to define the embeddings, and we did not take publications and sections into account, a user may get recommendations from a publication/section that he does not regularly read. You may try adding this information when creating embeddings as well and check your query results then!

Also, you may notice that some articles appear in the recommendations, although the user has already read them. These articles could be removed as part of postprocessing the query results, in case you prefer not to see them in the recommendations.

Delete the index

Delete the index once you are sure that you do not want to use it anymore. Once it is deleted, you cannot use it again.

pinecone.delete_index(index_name)