Encode sparse vectors
This page shows you how to use the Pinecone Text Client to encode sparse vectors for use in hybrid search.
To convert your own text corpus to sparse vectors, you can either use BM25 or SPLADE. This guide uses BM25, which is more common.
This feature is in public preview.
Install the Pinecone Text Client
The Pinecone Text Client is a public Python package that provides text utilities designed for seamless integration with Pinecone’s sparse-dense (hybrid) search.
To install the Pinecone Text Client, use the following command:
pip install pinecone-text
Initialize the encoder and fit the corpus
Before encoding sparse vectors for use in hybrid search, you must initialize the BM25 encoder and fit it to your corpus of documents.
Example
The following example initializes a BM25Encoder
object and calls the fit()
function on the corpus, formatted as an array of strings.
from pinecone_text.sparse import BM25Encoder
corpus = ["The quick brown fox jumps over the lazy dog",
"The lazy dog is brown",
"The fox is brown"]
# Initialize BM25 and fit the corpus.
bm25 = BM25Encoder()
bm25.fit(corpus)
If you want to use the default parameters for BM25Encoder
, you can call the default
method. The default parameters were fitted on the MS MARCO passage ranking dataset.
Example
bm25 = BM25Encoder.default()
Encode documents for Pinecone
After the encoder is initialized and fit, you can encode documents as sparse vectors.
Example
The following example encodes a new document as a sparse vector for upsert to a Pinecone index.
doc_sparse_vector = bm25.encode_documents("The brown fox is quick")
Now, the contents of doc_sparse_vector
are like the following:
{"indices": [102, 18, 12, ...], "values": [0.21, 0.38, 0.15, ...]}
You can encode a string as a sparse vector for use in a hybrid search query in the same way.
Example
The following example encodes a string as a sparse vector for use in a hybrid search query.
query_sparse_vector = bm25.encode_queries("Which fox is brown?")
Now, the contents of query_sparse_vector
are like the following:
{"indices": [102, 16, 18, ...], "values": [0.21, 0.11, 0.15, ...]}
See also
Was this page helpful?