Document Deduplication

Open In Colab Open nbviewer Open github

This notebook demonstrates how to use Pinecone's similarity search to create a simple application to identify duplicate documents.

The goal is to create a data deduplication application for eliminating near-duplicate copies of academic texts. In this example, we will perform the deduplication of a given text in two steps. First, we will sift a small set of candidate texts using a similarity-search service. Then, we will apply a near-duplication detector over these candidates.

The similarity search will use a vector representation of the texts. With this, semantic similarity is translated to proximity in a vector space. For detecting near-duplicates, we will employ a classification model that examines the raw text.

Install Dependencies

!pip install -qU pinecone-client
!pip install -qU datasketch mmh3 ipywidgets
!pip install -qU gensim==4.0.1
!pip install -qU sentence-transformers --no-cache-dir
!pip install -qU datasets

Download and Process Dataset

This tutorial will use the Deduplication Dataset 2020, which consists of 100,000 scholarly documents. We will use Hugging Face Datasets to download the dataset found at pinecone/core-2020-05-10-deduplication.

from datasets import load_dataset

core = load_dataset("pinecone/core-2020-05-10-deduplication", split="train")
core
Dataset({
    features: ['core_id', 'doi', 'original_abstract', 'original_title', 'processed_title', 'processed_abstract', 'cat', 'labelled_duplicates'],
    num_rows: 100000
})

We convert the dataset into Pandas dataframe format like so:

df = core.to_pandas()
df.head()
core_id doi original_abstract original_title processed_title processed_abstract cat labelled_duplicates
0 11251086 10.1016/j.ajhg.2007.12.013 Unobstructed vision requires a particular refr... Mutation of solute carrier SLC16A12 associates... mutation of solute carrier slc16a12 associates... unobstructed vision refractive lens differenti... exact_dup [82332306]
1 11309751 10.1103/PhysRevLett.101.193002 Two-color multiphoton ionization of atomic hel... Polarization control in two-color above-thresh... polarization control in two-color above-thresh... multiphoton ionization helium combining extrem... exact_dup [147599753]
2 11311385 10.1016/j.ab.2011.02.013 Lectin’s are proteins capable of recognising a... Optimisation of the enzyme-linked lectin assay... optimisation of the enzyme-linked lectin assay... lectin’s capable recognising oligosaccharide t... exact_dup [147603441]
3 11992240 10.1016/j.jpcs.2007.07.063 In this work, we present a detailed transmissi... Vertical composition fluctuations in (Ga,In)(N... vertical composition fluctuations in (ga,in)(n... microscopy interfacial uniformity wells grown ... exact_dup [148653623]
4 11994990 10.1016/S0169-5983(03)00013-3 Three-dimensional (3D) oscillatory boundary la... Three-dimensional streaming flows driven by os... three-dimensional streaming flows driven by os... oscillatory attached deformable walls boundari... exact_dup [148656283]

We will use the following columns from the dataset for our task.

  1. core_id - Unique indentifier for each article

  2. processed_abstract - This is obtained by applying preprocssing steps like this to the original abstract of the article from the column original abstract.

  3. processed_title - Same as the abstract but for the title of the article.

  4. cat - Every article falls into one of the three possible categories: 'exact_dup', 'near_dup', 'non_dup'

  5. labelled_duplicates - A list of core_ids of articles that are duplicates of current article

Let's calculate the frequency of duplicates per article. Observe that half of the articles have no duplicates, and only a small fraction of the articles have more than ten duplicates.

lens = df.labelled_duplicates.apply(len)
lens.value_counts()
0     50000
1     36166
2      7620
3      3108
4      1370
5       756
6       441
7       216
8       108
10       66
9        60
11       48
13       28
12       13
Name: labelled_duplicates, dtype: int64

Reformat some of the columns to prevent later issues.

# Make sure no processed abstracts are excessively long for upsert to Pinecone
df["processed_abstract"] = df["processed_abstract"].str[:8000]

We will make use of the text data to create vectors for every article. We combine the processed_abstract and processed_title of the article to create a new combined_text column.

# Define a new column for calculating embeddings
df["combined_text"] = df["processed_title"] + " " + df["processed_abstract"]

Initialize Pinecone Index

import pinecone

# Connect to pinecone environment
pinecone.init(
    api_key="YOUR_API_KEY",
    environment="YOUR_ENVIRONMENT"
)

# Pick a name for the new index
index_name = "deduplication"

# Check if the deduplication index exists
if index_name not in pinecone.list_indexes():
    # Create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=300,
        metadata_config={"indexed": ["processed_abstract"]}
    )

# Connect to deduplication index we created
index = pinecone.Index(index_name)

Get a free Pinecone API key if you don’t have one already. You can find your environment in the Pinecone console under API Keys.

Initialize Embedding Model

We will use the Average Word Embedding GloVe model to transform text into vector embeddings. We then upload the embeddings into the Pinecone vector index.

import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer("average_word_embeddings_glove.6B.300d", device=device)
model
SentenceTransformer(
  (0): WordEmbeddings(
    (emb_layer): Embedding(400001, 300)
  )
  (1): Pooling({'word_embedding_dimension': 300, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Generate Embeddings and Upsert

from tqdm.auto import tqdm

# We will use batches of 256
batch_size = 256
for i in tqdm(range(0, len(df), batch_size)):
    # Find end of batch
    i_end = min(i+batch_size, len(df))
    # Extract batch
    batch = df.iloc[i:i_end]
    # Generate embeddings for batch
    emb = model.encode(batch["combined_text"].to_list()).tolist()
    # extract both indexed and not indexed metadata
    meta = batch[["processed_abstract"]].to_dict(orient="records")
    # create IDs
    ids = batch.core_id.astype(str)
    # add all to upsert list
    to_upsert = list(zip(ids, emb, meta))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)
    
# check that we have all vectors in index
index.describe_index_stats()
100%|██████████| 391/391 [03:25<00:00, 2.47it/s]


{'dimension': 300,
 'index_fullness': 0.1,
 'namespaces': {'': {'vector_count': 100000}}}

Searching for Candidates

Now that we have created vectors for the articles and inserted them in the index, we will create a test set for querying. For each article in the test set we will query the index to get the most similar articles, they are the candidates on which we will performs the next classification step.

Below, we list statistics of the number of duplicates per article in the resulting test set.

import math

# Create a sample from the dataset
SAMPLE_FRACTION = 0.002
test_documents = (
    df.groupby(df.labelled_duplicates.map(len))
    .apply(lambda x: x.head(math.ceil(len(x) * SAMPLE_FRACTION)))
    .reset_index(drop=True)
)

print("Number of documents with specified number of duplicates:")
lens = test_documents.labelled_duplicates.apply(len)
lens.value_counts()
Number of documents with specified number of duplicates:
0     100
1      73
2      16
3       7
4       3
5       2
6       1
7       1
8       1
9       1
10      1
11      1
12      1
13      1
Name: labelled_duplicates, dtype: int64
# Use the model to create embeddings for test articles, which will be the query vectors
query_vectors = model.encode(test_documents.combined_text.to_list()).tolist()
# Query the vector index
query_results = []
for xq in tqdm(query_vectors):
    query_res = index.query(xq, top_k=100, include_metadata=True)
    query_results.append(query_res)
100%|██████████| 209/209 [01:01<00:00, 3.54it/s]
# Save all retrieval recalls into a list
recalls = []

for id, res in tqdm(list(zip(test_documents.core_id.values, query_results))):
    # Find document with id in labelled dataset
    labeled_df = df[df.core_id.astype(str) == str(id)]
    # Calculate the retrieval recall
    top_k_list = set([match.id for match in res.matches])
    labelled_duplicates = set(labeled_df.labelled_duplicates.values[0])
    intersection = top_k_list.intersection(labelled_duplicates)
    if len(labelled_duplicates) != 0:
        recalls.append(len(intersection) / len(labelled_duplicates))
100%|██████████| 0/209 [00:02<00:00, 104.50it/s]
import statistics

print("Mean for the retrieval recall is " + str(statistics.mean(recalls)))
print("Standard Deviation is  " + str(statistics.stdev(recalls)))
Mean for the retrieval recall is 0.9702529886016125
Standard Deviation is  0.16219287104729735

Running the Classifier

We mentioned earlier in the article that we will perform two steps for deduplication, searching to produce candidates and performing classifciation on them.

We will use Deduplication Classifier based on LSH for detecting duplicates on the results from the previous step. We will run this on a sample of query results we got in the previous step. Feel free to try out the results on the entire set of query results.

import pandas as pd
from gensim.utils import tokenize
from datasketch.minhash import MinHash
from datasketch.lsh import MinHashLSH
# Counters for correct/false predictions
all_predictions = {"Correct": 0, "False": 0}
predictions_per_category = {}

# From the results in the previous step, we will take a subset to test our classifier
query_sample = query_results[::10]
ids_sample = test_documents.core_id.to_list()[::10]

for id, res in zip(ids_sample, query_sample):
    
    # Find document with id from the labelled dataset
    labeled_df = df[df.core_id.astype(str) == str(id)]

    """
    For every article in the result set, we store the scores and abstract of the articles most similar 
    to it, according to search in the previous step.
    """

    df_result = pd.DataFrame(
        {
            "id": [match.id for match in res.matches],
            "document": [match["metadata"]["processed_abstract"] for match in res.matches],
            "score": [match.score for match in res.matches],
        }
    )

    print(df_result.head())

    # We need content and labels for our classifier which we can get from the df_results
    content = df_result.document.values
    labels = list(df_result.id.values)
    
    # Create MinHash for each of the documents in result set
    min_hashes = {}
    for label, text in zip(labels, content):
        m = MinHash(num_perm=128, seed=5)
        tokens = set(tokenize(text))
        for d in tokens:
            m.update(d.encode('utf8'))
        min_hashes[label] = m
    
    # Create LSH index
    lsh = MinHashLSH(threshold=0.7, num_perm=128, )
    for i, j in min_hashes.items():
        lsh.insert(str(i), j)
    
    query_minhash = min_hashes[str(id)]
    duplicates = lsh.query(query_minhash)
    duplicates.remove(str(id))
    
    # Check whether prediction matches labeled duplicates. Here the groud truth is the set of duplicates from our original set
    prediction = (
        "Correct"
        if set(labeled_df.labelled_duplicates.values[0]) == set(duplicates)
        else "False"
    )
    
    # Add to all predictions
    all_predictions[prediction] += 1
    
    # Create and/or add to the specific category based on number of duplicates in original dataset
    num_of_duplicates = len(labeled_df.labelled_duplicates.values[0])
    if num_of_duplicates not in predictions_per_category:
        predictions_per_category[num_of_duplicates] = [0, 0]

    if prediction == "Correct":
        predictions_per_category[num_of_duplicates][0] += 1
    else:
        predictions_per_category[num_of_duplicates][1] += 1

    # Print the results for a document
    print(
        "{}: expected: {}, predicted: {}, prediction: {}".format(
            id, labeled_df.labelled_duplicates.values[0], duplicates, prediction
        )
    )
         id                                           document     score
0  15080768  analyse centred methodology. discretisation so...  1.000000
1  52682462  audiencethe tissues pulses modelled compartmen...  0.787797
2  52900859  audiencethe tissues pulses modelled compartmen...  0.787797
3   2553555  multilayered illuminated acoustic electromagne...  0.781398
4  50544308  heterostructure schr dinger poisson numericall...  0.778778
15080768: expected: [], predicted: [], prediction: Correct
          id                                           document     score
0   55110306  latrepirdine orally administered molecule init...  1.000000
1  188404434  cysteamine potentially numerous huntington dis...  0.903964
2   81634102  deutetrabenazine molecule deuterium attenuates...  0.880078
3   42021224  comorbidities. safe drugs available. efficacy ...  0.857741
4   78271101  promising prevent onset ultrahigh psychosis di...  0.849158
55110306: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  10914205  read objectives schoolchildren sunscreen morni...  1.000000
1  77409456  overeating harmful alcohol tobacco aetiology c...  0.669037
2  10896024  sunlight cutaneous vitamin production. highlig...  0.633516
3  15070865  drink heavily nonstudent peers unaware drinkin...  0.633497
4  52131855  dette siste tekst versjon artikkelen inneholde...  0.627933
10914205: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  43096919  publishedcomparative studymulticenter tcontext...  1.000000
1  77165332  cerebral amyloid aggregation pathological alzh...  0.871247
2  70343569  neurodegenerative heterogeneous disorders prog...  0.867806
3  18448676  beta amyloid beta deposition hallmarks alzheim...  0.855655
4  46964510  alzheimer unexplained. sought loci detect robu...  0.855137
43096919: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  12203626  hypernatremia recipients homografts postoperat...  1.000000
1  82542813  abstractobjectivesto intravenous maintenance f...  0.800283
2  81206306  uromodulin tamm–horsfall abundant excreted uri...  0.794892
3  36026525  drinking sodium bicarbonated mineral cardiovas...  0.793452
4  83567081  drinking sodium bicarbonated mineral cardiovas...  0.793252
12203626: expected: [], predicted: [], prediction: Correct
          id                                           document     score
0   15070865  drink heavily nonstudent peers unaware drinkin...  1.000000
1  154671698  updated alcohol suicidal level. searches retri...  0.889408
2   52132897  updated alcohol suicidal level. searches retri...  0.889408
3   43606482  fulltext .pdf publisher effectiveness drinking...  0.883402
4   82484980  abstractthe effectiveness drinking motive tail...  0.883145
15070865: expected: [], predicted: [], prediction: Correct
          id                                           document     score
0   80341690  potentially inappropriate medicines pims older...  1.000000
1   39320843  elderly receive medications adverse effects. e...  0.807533
2   82162292  abstractbackgroundrisk assessments widely pred...  0.780006
3   77027179  assessments widely predict opioid disorder unc...  0.779406
4  153514317  yesbackground challenging person dementia. beh...  0.757255
80341690: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0   9066821  commotio retinae opacification retina blunt oc...  1.000000
1  78051578  neovascular macular degeneration anti–vascular...  0.731147
2  86422032  automated lesions challenging diagnostic lesio...  0.703925
3  48174418  audiencewe propose voxelwise images. relies ge...  0.699708
4  52434306  audiencewe propose voxelwise images. relies ge...  0.699708
9066821: expected: [], predicted: [], prediction: Correct
          id                                           document     score
0   15052827  indirect schizophrenia australia incidence cos...  1.000000
1  154860392  illness schizophrenia bipolar disorder depress...  0.795662
2   51964867  audiencebackground cholesterol lowering jupite...  0.791904
3   75913230  thesis characterize burden cardiovascular deme...  0.775635
4  154672015  aims depression anxiety myocardial infarction ...  0.765936
15052827: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  12203661  glomerulonephritis serious hemoptysis. antiglo...  1.000000
1  12204810  twenty alagille syndrome underwent transplanta...  0.811871
2  52198725  audiencepatients autoimmune polyendocrine synd...  0.810457
3  47112592  audiencepatients autoimmune polyendocrine synd...  0.810457
4  52460385  audiencepatients autoimmune polyendocrine synd...  0.810457
12203661: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  11251086  unobstructed vision refractive lens differenti...  1.000000
1  82332306  unobstructed vision refractive lens differenti...  1.000000
2  61371524  aims osmotic oxidative progression advancement...  0.839048
3  59036307  aims osmotic oxidative progression advancement...  0.839048
4  11249430  dysfunction cilia nearly ubiquitously solitary...  0.796622
11251086: expected: ['82332306'], predicted: ['82332306'], prediction: Correct
          id                                           document     score
0   12001088  presents vision successfully discriminates wee...  1.000000
1  148662402  presents vision successfully discriminates wee...  1.000000
2  148666025  proposes oriented crop maize weed pressure. vi...  0.904243
3   18424329  proposes oriented crop maize weed pressure. vi...  0.904243
4   18424394  proposes oriented identifying crop rows maize ...  0.861464
12001088: expected: ['148662402'], predicted: ['148662402'], prediction: Correct
          id                                           document     score
0   11307919  reflectance exciton–polariton film polycrystal...  1.000000
1  147595688  reflectance exciton–polariton film polycrystal...  1.000000
2  147595695  photoluminescence reflectance oriented polycry...  0.816958
3   11307922  photoluminescence reflectance oriented polycry...  0.816958
4   33106913  macroscopic dielectric polycrystalline commonl...  0.804686
147595688: expected: ['11307919'], predicted: ['11307919'], prediction: Correct
          id                                           document     score
0   12002296  thanks inherent probabilistic graphical prime ...  1.000000
1  148663921  thanks inherent probabilistic graphical prime ...  1.000000
2   52634130  audienceobject oriented brms platform automati...  0.869993
3   52294731  audienceobject oriented brms platform automati...  0.869993
4   34403460  acceptance artificial intelligence aims learn ...  0.865814
148663921: expected: ['12002296'], predicted: ['12002296'], prediction: Correct
          id                                           document     score
0  151641478  stabilised soems unstable aircraft presented. ...  1.000000
1   11874260  stabilised soems unstable aircraft presented. ...  1.000000
2   29528077  projection snapshot balanced truncation unstab...  0.724496
3   77005252  projection snapshot balanced truncation unstab...  0.724496
4  148663435  ideas robust computationally amenable industri...  0.722027
151641478: expected: ['11874260'], predicted: ['11874260'], prediction: Correct
          id                                           document     score
0  188365084  installed rapidly decade deployments deeper wa...  1.000000
1  158351487  installed rapidly decade deployments deeper wa...  1.000000
2  158370190  offshore turbine reliability biggest paper. un...  0.853790
3   83926778  offshore turbine reliability biggest paper. un...  0.853790
4   74226591  investigates overruns underruns occurring onsh...  0.834363
188365084: expected: ['158351487'], predicted: ['158351487'], prediction: Correct
         id                                           document     score
0   2097371  propose vulnerability network. analogy balls l...  1.000000
1   9030380  propose vulnerability network. analogy balls l...  1.000000
2  49270269  audiencethis introduces validates sensor propa...  0.754055
3  43094896  peer reviewed brownjohn displacement sensor co...  0.745553
4  49271868  audiencea predictive giving displacement digit...  0.734554
2097371: expected: ['9030380'], predicted: ['9030380'], prediction: Correct
          id                                           document     score
0  148674298  race segments swimmers. analysed finals sessio...  1.000000
1   33176265  race segments swimmers. analysed finals sessio...  1.000000
2  148674300  swimming race parameters. hundred fifty eight ...  0.886608
3   33176267  swimming race parameters. hundred fifty eight ...  0.886608
4  143900637  swimmers swimmers coaches trainers. video sens...  0.736030
33176265: expected: ['148674298'], predicted: ['148674298'], prediction: Correct
         id                                           document     score
0  52844591  audiencehere geochemical lopevi volcano volcan...  1.000000
1  52308905  audiencehere geochemical lopevi volcano volcan...  1.000000
2  52722823  audiencehere geochemical lopevi volcano volcan...  1.000000
3  52717537  audiencethe volcanism cameroon volcanic mantle...  0.893717
4  52840980  audiencethe volcanism cameroon volcanic mantle...  0.893717
52308905: expected: ['52722823' '52844591'], predicted: ['52722823', '52844591'], prediction: Correct
         id                                           document     score
0  44119402  lagrangian formalism supermembrane supergravit...  1.000000
1  35093363  lagrangian formalism supermembrane supergravit...  1.000000
2   2531039  lagrangian formalism supermembrane supergravit...  1.000000
3  35078501  lagrangian formalism supermembrane supergravit...  1.000000
4  35089833  supergravity correlators worldsheet analogous ...  0.847565
44119402: expected: ['2531039' '35078501' '35093363'], predicted: ['2531039', '35078501', '35093363'], prediction: Correct
          id                                           document  score
0   52739626  microlensing surveys tens millions stars. unpr...    1.0
1   52456923  microlensing surveys tens millions stars. unpr...    1.0
2   47110549  microlensing surveys tens millions stars. unpr...    1.0
3   52695218  microlensing surveys tens millions stars. unpr...    1.0
4  152091185  microlensing surveys tens millions stars. unpr...    1.0
47110549: expected: ['46770666' '52456923' '152091185' '52695218' '52739626'], predicted: ['52456923', '52695218', '52739626', '152091185', '46770666'], prediction: Correct
all_predictions
{'Correct': 21, 'False': 0}
# Overall accuracy on a test
accuracy = round(
    all_predictions["Correct"]
    / (all_predictions["Correct"] + all_predictions["False"]),
    4,
)
accuracy
1.0
# Print the prediction count for each class depending on the number of duplicates in labeled dataset
pd.DataFrame.from_dict(
    predictions_per_category, orient="index", columns=["Correct", "False"]
)
Correct False
0 10 0
1 8 0
2 1 0
3 1 0
5 1 0

Delete the Index

Delete the index once you are sure that you do not want to use it anymore. Once the index is deleted, you cannot use it again.

# Delete the index if it's not going to be used anymore
pinecone.delete_index(index_name)

Summary

In this notebook we demonstrate how to perform a deduplication task of over 100,000 articles using Pinecone. With articles embedded as vectors, you can use Pinecone's vector index to find similar articles. For each query article, we then use an LSH classifier on the similar articles to identify duplicate articles. Overall, we show that it is ease to incorporate Pinecone wtih article embedding models and duplication classifiers to build a deduplication service.