Document Deduplication
This notebook demonstrates how to use Pinecone's similarity search to create a simple application to identify duplicate documents.
The goal is to create a data deduplication application for eliminating near-duplicate copies of academic texts. In this example, we will perform the deduplication of a given text in two steps. First, we will sift a small set of candidate texts using a similarity-search service. Then, we will apply a near-duplication detector over these candidates.
The similarity search will use a vector representation of the texts. With this, semantic similarity is translated to proximity in a vector space. For detecting near-duplicates, we will employ a classification model that examines the raw text.
Install Dependencies
!pip install -qU pinecone-client
!pip install -qU datasketch mmh3 ipywidgets
!pip install -qU gensim==4.0.1
!pip install -qU sentence-transformers --no-cache-dir
!pip install -qU datasets
Download and Process Dataset
This tutorial will use the Deduplication Dataset 2020, which consists of 100,000 scholarly documents. We will use Hugging Face Datasets to download the dataset found at pinecone/core-2020-05-10-deduplication.
from datasets import load_dataset
core = load_dataset("pinecone/core-2020-05-10-deduplication", split="train")
core
Dataset({
features: ['core_id', 'doi', 'original_abstract', 'original_title', 'processed_title', 'processed_abstract', 'cat', 'labelled_duplicates'],
num_rows: 100000
})
We convert the dataset into Pandas dataframe format like so:
df = core.to_pandas()
df.head()
core_id | doi | original_abstract | original_title | processed_title | processed_abstract | cat | labelled_duplicates | |
---|---|---|---|---|---|---|---|---|
0 | 11251086 | 10.1016/j.ajhg.2007.12.013 | Unobstructed vision requires a particular refr... | Mutation of solute carrier SLC16A12 associates... | mutation of solute carrier slc16a12 associates... | unobstructed vision refractive lens differenti... | exact_dup | [82332306] |
1 | 11309751 | 10.1103/PhysRevLett.101.193002 | Two-color multiphoton ionization of atomic hel... | Polarization control in two-color above-thresh... | polarization control in two-color above-thresh... | multiphoton ionization helium combining extrem... | exact_dup | [147599753] |
2 | 11311385 | 10.1016/j.ab.2011.02.013 | Lectin’s are proteins capable of recognising a... | Optimisation of the enzyme-linked lectin assay... | optimisation of the enzyme-linked lectin assay... | lectin’s capable recognising oligosaccharide t... | exact_dup | [147603441] |
3 | 11992240 | 10.1016/j.jpcs.2007.07.063 | In this work, we present a detailed transmissi... | Vertical composition fluctuations in (Ga,In)(N... | vertical composition fluctuations in (ga,in)(n... | microscopy interfacial uniformity wells grown ... | exact_dup | [148653623] |
4 | 11994990 | 10.1016/S0169-5983(03)00013-3 | Three-dimensional (3D) oscillatory boundary la... | Three-dimensional streaming flows driven by os... | three-dimensional streaming flows driven by os... | oscillatory attached deformable walls boundari... | exact_dup | [148656283] |
We will use the following columns from the dataset for our task.
-
core_id - Unique indentifier for each article
-
processed_abstract - This is obtained by applying preprocssing steps like this to the original abstract of the article from the column original abstract.
-
processed_title - Same as the abstract but for the title of the article.
-
cat - Every article falls into one of the three possible categories: 'exact_dup', 'near_dup', 'non_dup'
-
labelled_duplicates - A list of core_ids of articles that are duplicates of current article
Let's calculate the frequency of duplicates per article. Observe that half of the articles have no duplicates, and only a small fraction of the articles have more than ten duplicates.
lens = df.labelled_duplicates.apply(len)
lens.value_counts()
0 50000
1 36166
2 7620
3 3108
4 1370
5 756
6 441
7 216
8 108
10 66
9 60
11 48
13 28
12 13
Name: labelled_duplicates, dtype: int64
Reformat some of the columns to prevent later issues.
# Make sure no processed abstracts are excessively long for upsert to Pinecone
df["processed_abstract"] = df["processed_abstract"].str[:8000]
We will make use of the text data to create vectors for every article. We combine the processed_abstract and processed_title of the article to create a new combined_text column.
# Define a new column for calculating embeddings
df["combined_text"] = df["processed_title"] + " " + df["processed_abstract"]
Initialize Pinecone Index
import pinecone
# Connect to pinecone environment
pinecone.init(
api_key="YOUR_API_KEY",
environment="YOUR_ENVIRONMENT"
)
# Pick a name for the new index
index_name = "deduplication"
# Check if the deduplication index exists
if index_name not in pinecone.list_indexes():
# Create the index if it does not exist
pinecone.create_index(
index_name,
dimension=300,
metadata_config={"indexed": ["processed_abstract"]}
)
# Connect to deduplication index we created
index = pinecone.Index(index_name)
Get a free Pinecone API key if you don’t have one already. You can find your environment in the Pinecone console under API Keys.
Initialize Embedding Model
We will use the Average Word Embedding GloVe model to transform text into vector embeddings. We then upload the embeddings into the Pinecone vector index.
import torch
from sentence_transformers import SentenceTransformer
# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer("average_word_embeddings_glove.6B.300d", device=device)
model
SentenceTransformer(
(0): WordEmbeddings(
(emb_layer): Embedding(400001, 300)
)
(1): Pooling({'word_embedding_dimension': 300, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
Generate Embeddings and Upsert
from tqdm.auto import tqdm
# We will use batches of 256
batch_size = 256
for i in tqdm(range(0, len(df), batch_size)):
# Find end of batch
i_end = min(i+batch_size, len(df))
# Extract batch
batch = df.iloc[i:i_end]
# Generate embeddings for batch
emb = model.encode(batch["combined_text"].to_list()).tolist()
# extract both indexed and not indexed metadata
meta = batch[["processed_abstract"]].to_dict(orient="records")
# create IDs
ids = batch.core_id.astype(str)
# add all to upsert list
to_upsert = list(zip(ids, emb, meta))
# upsert/insert these records to pinecone
_ = index.upsert(vectors=to_upsert)
# check that we have all vectors in index
index.describe_index_stats()
100%|██████████| 391/391 [03:25<00:00, 2.47it/s]
{'dimension': 300,
'index_fullness': 0.1,
'namespaces': {'': {'vector_count': 100000}}}
Searching for Candidates
Now that we have created vectors for the articles and inserted them in the index, we will create a test set for querying. For each article in the test set we will query the index to get the most similar articles, they are the candidates on which we will performs the next classification step.
Below, we list statistics of the number of duplicates per article in the resulting test set.
import math
# Create a sample from the dataset
SAMPLE_FRACTION = 0.002
test_documents = (
df.groupby(df.labelled_duplicates.map(len))
.apply(lambda x: x.head(math.ceil(len(x) * SAMPLE_FRACTION)))
.reset_index(drop=True)
)
print("Number of documents with specified number of duplicates:")
lens = test_documents.labelled_duplicates.apply(len)
lens.value_counts()
Number of documents with specified number of duplicates:
0 100
1 73
2 16
3 7
4 3
5 2
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
Name: labelled_duplicates, dtype: int64
# Use the model to create embeddings for test articles, which will be the query vectors
query_vectors = model.encode(test_documents.combined_text.to_list()).tolist()
# Query the vector index
query_results = []
for xq in tqdm(query_vectors):
query_res = index.query(xq, top_k=100, include_metadata=True)
query_results.append(query_res)
100%|██████████| 209/209 [01:01<00:00, 3.54it/s]
# Save all retrieval recalls into a list
recalls = []
for id, res in tqdm(list(zip(test_documents.core_id.values, query_results))):
# Find document with id in labelled dataset
labeled_df = df[df.core_id.astype(str) == str(id)]
# Calculate the retrieval recall
top_k_list = set([match.id for match in res.matches])
labelled_duplicates = set(labeled_df.labelled_duplicates.values[0])
intersection = top_k_list.intersection(labelled_duplicates)
if len(labelled_duplicates) != 0:
recalls.append(len(intersection) / len(labelled_duplicates))
100%|██████████| 0/209 [00:02<00:00, 104.50it/s]
import statistics
print("Mean for the retrieval recall is " + str(statistics.mean(recalls)))
print("Standard Deviation is " + str(statistics.stdev(recalls)))
Mean for the retrieval recall is 0.9702529886016125
Standard Deviation is 0.16219287104729735
Running the Classifier
We mentioned earlier in the article that we will perform two steps for deduplication, searching to produce candidates and performing classifciation on them.
We will use Deduplication Classifier based on LSH for detecting duplicates on the results from the previous step. We will run this on a sample of query results we got in the previous step. Feel free to try out the results on the entire set of query results.
import pandas as pd
from gensim.utils import tokenize
from datasketch.minhash import MinHash
from datasketch.lsh import MinHashLSH
# Counters for correct/false predictions
all_predictions = {"Correct": 0, "False": 0}
predictions_per_category = {}
# From the results in the previous step, we will take a subset to test our classifier
query_sample = query_results[::10]
ids_sample = test_documents.core_id.to_list()[::10]
for id, res in zip(ids_sample, query_sample):
# Find document with id from the labelled dataset
labeled_df = df[df.core_id.astype(str) == str(id)]
"""
For every article in the result set, we store the scores and abstract of the articles most similar
to it, according to search in the previous step.
"""
df_result = pd.DataFrame(
{
"id": [match.id for match in res.matches],
"document": [match["metadata"]["processed_abstract"] for match in res.matches],
"score": [match.score for match in res.matches],
}
)
print(df_result.head())
# We need content and labels for our classifier which we can get from the df_results
content = df_result.document.values
labels = list(df_result.id.values)
# Create MinHash for each of the documents in result set
min_hashes = {}
for label, text in zip(labels, content):
m = MinHash(num_perm=128, seed=5)
tokens = set(tokenize(text))
for d in tokens:
m.update(d.encode('utf8'))
min_hashes[label] = m
# Create LSH index
lsh = MinHashLSH(threshold=0.7, num_perm=128, )
for i, j in min_hashes.items():
lsh.insert(str(i), j)
query_minhash = min_hashes[str(id)]
duplicates = lsh.query(query_minhash)
duplicates.remove(str(id))
# Check whether prediction matches labeled duplicates. Here the groud truth is the set of duplicates from our original set
prediction = (
"Correct"
if set(labeled_df.labelled_duplicates.values[0]) == set(duplicates)
else "False"
)
# Add to all predictions
all_predictions[prediction] += 1
# Create and/or add to the specific category based on number of duplicates in original dataset
num_of_duplicates = len(labeled_df.labelled_duplicates.values[0])
if num_of_duplicates not in predictions_per_category:
predictions_per_category[num_of_duplicates] = [0, 0]
if prediction == "Correct":
predictions_per_category[num_of_duplicates][0] += 1
else:
predictions_per_category[num_of_duplicates][1] += 1
# Print the results for a document
print(
"{}: expected: {}, predicted: {}, prediction: {}".format(
id, labeled_df.labelled_duplicates.values[0], duplicates, prediction
)
)
id document score
0 15080768 analyse centred methodology. discretisation so... 1.000000
1 52682462 audiencethe tissues pulses modelled compartmen... 0.787797
2 52900859 audiencethe tissues pulses modelled compartmen... 0.787797
3 2553555 multilayered illuminated acoustic electromagne... 0.781398
4 50544308 heterostructure schr dinger poisson numericall... 0.778778
15080768: expected: [], predicted: [], prediction: Correct
id document score
0 55110306 latrepirdine orally administered molecule init... 1.000000
1 188404434 cysteamine potentially numerous huntington dis... 0.903964
2 81634102 deutetrabenazine molecule deuterium attenuates... 0.880078
3 42021224 comorbidities. safe drugs available. efficacy ... 0.857741
4 78271101 promising prevent onset ultrahigh psychosis di... 0.849158
55110306: expected: [], predicted: [], prediction: Correct
id document score
0 10914205 read objectives schoolchildren sunscreen morni... 1.000000
1 77409456 overeating harmful alcohol tobacco aetiology c... 0.669037
2 10896024 sunlight cutaneous vitamin production. highlig... 0.633516
3 15070865 drink heavily nonstudent peers unaware drinkin... 0.633497
4 52131855 dette siste tekst versjon artikkelen inneholde... 0.627933
10914205: expected: [], predicted: [], prediction: Correct
id document score
0 43096919 publishedcomparative studymulticenter tcontext... 1.000000
1 77165332 cerebral amyloid aggregation pathological alzh... 0.871247
2 70343569 neurodegenerative heterogeneous disorders prog... 0.867806
3 18448676 beta amyloid beta deposition hallmarks alzheim... 0.855655
4 46964510 alzheimer unexplained. sought loci detect robu... 0.855137
43096919: expected: [], predicted: [], prediction: Correct
id document score
0 12203626 hypernatremia recipients homografts postoperat... 1.000000
1 82542813 abstractobjectivesto intravenous maintenance f... 0.800283
2 81206306 uromodulin tamm–horsfall abundant excreted uri... 0.794892
3 36026525 drinking sodium bicarbonated mineral cardiovas... 0.793452
4 83567081 drinking sodium bicarbonated mineral cardiovas... 0.793252
12203626: expected: [], predicted: [], prediction: Correct
id document score
0 15070865 drink heavily nonstudent peers unaware drinkin... 1.000000
1 154671698 updated alcohol suicidal level. searches retri... 0.889408
2 52132897 updated alcohol suicidal level. searches retri... 0.889408
3 43606482 fulltext .pdf publisher effectiveness drinking... 0.883402
4 82484980 abstractthe effectiveness drinking motive tail... 0.883145
15070865: expected: [], predicted: [], prediction: Correct
id document score
0 80341690 potentially inappropriate medicines pims older... 1.000000
1 39320843 elderly receive medications adverse effects. e... 0.807533
2 82162292 abstractbackgroundrisk assessments widely pred... 0.780006
3 77027179 assessments widely predict opioid disorder unc... 0.779406
4 153514317 yesbackground challenging person dementia. beh... 0.757255
80341690: expected: [], predicted: [], prediction: Correct
id document score
0 9066821 commotio retinae opacification retina blunt oc... 1.000000
1 78051578 neovascular macular degeneration anti–vascular... 0.731147
2 86422032 automated lesions challenging diagnostic lesio... 0.703925
3 48174418 audiencewe propose voxelwise images. relies ge... 0.699708
4 52434306 audiencewe propose voxelwise images. relies ge... 0.699708
9066821: expected: [], predicted: [], prediction: Correct
id document score
0 15052827 indirect schizophrenia australia incidence cos... 1.000000
1 154860392 illness schizophrenia bipolar disorder depress... 0.795662
2 51964867 audiencebackground cholesterol lowering jupite... 0.791904
3 75913230 thesis characterize burden cardiovascular deme... 0.775635
4 154672015 aims depression anxiety myocardial infarction ... 0.765936
15052827: expected: [], predicted: [], prediction: Correct
id document score
0 12203661 glomerulonephritis serious hemoptysis. antiglo... 1.000000
1 12204810 twenty alagille syndrome underwent transplanta... 0.811871
2 52198725 audiencepatients autoimmune polyendocrine synd... 0.810457
3 47112592 audiencepatients autoimmune polyendocrine synd... 0.810457
4 52460385 audiencepatients autoimmune polyendocrine synd... 0.810457
12203661: expected: [], predicted: [], prediction: Correct
id document score
0 11251086 unobstructed vision refractive lens differenti... 1.000000
1 82332306 unobstructed vision refractive lens differenti... 1.000000
2 61371524 aims osmotic oxidative progression advancement... 0.839048
3 59036307 aims osmotic oxidative progression advancement... 0.839048
4 11249430 dysfunction cilia nearly ubiquitously solitary... 0.796622
11251086: expected: ['82332306'], predicted: ['82332306'], prediction: Correct
id document score
0 12001088 presents vision successfully discriminates wee... 1.000000
1 148662402 presents vision successfully discriminates wee... 1.000000
2 148666025 proposes oriented crop maize weed pressure. vi... 0.904243
3 18424329 proposes oriented crop maize weed pressure. vi... 0.904243
4 18424394 proposes oriented identifying crop rows maize ... 0.861464
12001088: expected: ['148662402'], predicted: ['148662402'], prediction: Correct
id document score
0 11307919 reflectance exciton–polariton film polycrystal... 1.000000
1 147595688 reflectance exciton–polariton film polycrystal... 1.000000
2 147595695 photoluminescence reflectance oriented polycry... 0.816958
3 11307922 photoluminescence reflectance oriented polycry... 0.816958
4 33106913 macroscopic dielectric polycrystalline commonl... 0.804686
147595688: expected: ['11307919'], predicted: ['11307919'], prediction: Correct
id document score
0 12002296 thanks inherent probabilistic graphical prime ... 1.000000
1 148663921 thanks inherent probabilistic graphical prime ... 1.000000
2 52634130 audienceobject oriented brms platform automati... 0.869993
3 52294731 audienceobject oriented brms platform automati... 0.869993
4 34403460 acceptance artificial intelligence aims learn ... 0.865814
148663921: expected: ['12002296'], predicted: ['12002296'], prediction: Correct
id document score
0 151641478 stabilised soems unstable aircraft presented. ... 1.000000
1 11874260 stabilised soems unstable aircraft presented. ... 1.000000
2 29528077 projection snapshot balanced truncation unstab... 0.724496
3 77005252 projection snapshot balanced truncation unstab... 0.724496
4 148663435 ideas robust computationally amenable industri... 0.722027
151641478: expected: ['11874260'], predicted: ['11874260'], prediction: Correct
id document score
0 188365084 installed rapidly decade deployments deeper wa... 1.000000
1 158351487 installed rapidly decade deployments deeper wa... 1.000000
2 158370190 offshore turbine reliability biggest paper. un... 0.853790
3 83926778 offshore turbine reliability biggest paper. un... 0.853790
4 74226591 investigates overruns underruns occurring onsh... 0.834363
188365084: expected: ['158351487'], predicted: ['158351487'], prediction: Correct
id document score
0 2097371 propose vulnerability network. analogy balls l... 1.000000
1 9030380 propose vulnerability network. analogy balls l... 1.000000
2 49270269 audiencethis introduces validates sensor propa... 0.754055
3 43094896 peer reviewed brownjohn displacement sensor co... 0.745553
4 49271868 audiencea predictive giving displacement digit... 0.734554
2097371: expected: ['9030380'], predicted: ['9030380'], prediction: Correct
id document score
0 148674298 race segments swimmers. analysed finals sessio... 1.000000
1 33176265 race segments swimmers. analysed finals sessio... 1.000000
2 148674300 swimming race parameters. hundred fifty eight ... 0.886608
3 33176267 swimming race parameters. hundred fifty eight ... 0.886608
4 143900637 swimmers swimmers coaches trainers. video sens... 0.736030
33176265: expected: ['148674298'], predicted: ['148674298'], prediction: Correct
id document score
0 52844591 audiencehere geochemical lopevi volcano volcan... 1.000000
1 52308905 audiencehere geochemical lopevi volcano volcan... 1.000000
2 52722823 audiencehere geochemical lopevi volcano volcan... 1.000000
3 52717537 audiencethe volcanism cameroon volcanic mantle... 0.893717
4 52840980 audiencethe volcanism cameroon volcanic mantle... 0.893717
52308905: expected: ['52722823' '52844591'], predicted: ['52722823', '52844591'], prediction: Correct
id document score
0 44119402 lagrangian formalism supermembrane supergravit... 1.000000
1 35093363 lagrangian formalism supermembrane supergravit... 1.000000
2 2531039 lagrangian formalism supermembrane supergravit... 1.000000
3 35078501 lagrangian formalism supermembrane supergravit... 1.000000
4 35089833 supergravity correlators worldsheet analogous ... 0.847565
44119402: expected: ['2531039' '35078501' '35093363'], predicted: ['2531039', '35078501', '35093363'], prediction: Correct
id document score
0 52739626 microlensing surveys tens millions stars. unpr... 1.0
1 52456923 microlensing surveys tens millions stars. unpr... 1.0
2 47110549 microlensing surveys tens millions stars. unpr... 1.0
3 52695218 microlensing surveys tens millions stars. unpr... 1.0
4 152091185 microlensing surveys tens millions stars. unpr... 1.0
47110549: expected: ['46770666' '52456923' '152091185' '52695218' '52739626'], predicted: ['52456923', '52695218', '52739626', '152091185', '46770666'], prediction: Correct
all_predictions
{'Correct': 21, 'False': 0}
# Overall accuracy on a test
accuracy = round(
all_predictions["Correct"]
/ (all_predictions["Correct"] + all_predictions["False"]),
4,
)
accuracy
1.0
# Print the prediction count for each class depending on the number of duplicates in labeled dataset
pd.DataFrame.from_dict(
predictions_per_category, orient="index", columns=["Correct", "False"]
)
Correct | False | |
---|---|---|
0 | 10 | 0 |
1 | 8 | 0 |
2 | 1 | 0 |
3 | 1 | 0 |
5 | 1 | 0 |
Delete the Index
Delete the index once you are sure that you do not want to use it anymore. Once the index is deleted, you cannot use it again.
# Delete the index if it's not going to be used anymore
pinecone.delete_index(index_name)
Summary
In this notebook we demonstrate how to perform a deduplication task of over 100,000 articles using Pinecone. With articles embedded as vectors, you can use Pinecone's vector index to find similar articles. For each query article, we then use an LSH classifier on the similar articles to identify duplicate articles. Overall, we show that it is ease to incorporate Pinecone wtih article embedding models and duplication classifiers to build a deduplication service.
Updated 4 months ago