This feature is in public preview.
- Analyzing charts, graphs, and diagrams in financial reports
- Understanding infographics and visual data in research papers
- Interpreting visual layouts in technical documentation
How it works
When you enable multimodal context for a PDF:- Pinecone extracts text and images (raster or vector) from the file and analyzes their contents. For each image, the assistant generates a descriptive caption and set of keywords. Additionally, when it makes sense, the assistant captures data points found in the image (for example, values from a table or chart).
- During chat or context queries, the assistant searches for relevant text and image context it captured when analyzing the PDF. Image context can include the original image data (base64-encoded).
- The assistant passes this context to the LLM, which uses it to generate responses.
For an overview of how Pinecone Assistant works, see Pinecone Assistant architecture.
Try it out
The following steps demonstrate how to create an assistant, provide it with a PDF that contains images, and then query that assistant using chat and context APIs.All versions of Pinecone’s Assistant API allow you to upload multimodal PDFs.
1. Create an assistant
First, if you don’t have one, create an assistant:You don’t need to create a new assistant to use multimodal context. Existing assistants can enable multimodal context for newly uploaded PDFs, as described in the next section.
2. Upload a multimodal PDF
To enable multimodal context for a PDF, when uploading the file, set themultimodal
URL parameter to true (defaults to false).
- The
multimodal
parameter is only available for PDF files. - To check the status of a file, use the describe a file upload endpoint.
3. Chat with the assistant
Now, chat with your assistant. To tell the assistant to provide image-related context to the LLM:- Set the
multimodal
request parameter to true (default) in thecontext_options
object. Settingmultimodal
to false means the LLM only receives text snippets. - When
multimodal
is true, useinclude_binary_content
to specify what image context the LLM should receive: base64 image data and captions (true) or captions only (false).
Sending image-related context to the LLM (whether captions, base64 data, or both) increases token usage. Learn about monitoring spend and usage.
If your assistant uses multimodal context snippets to generate a response, no highlights are returned—even when
include_highlights
is true.4. Query for context
To query context for a custom RAG workflow, you can retrieve context snippets directly. Then, you can pass these snippets to an LLM as context. To fetch image-related context snippets (as well as text snippets), set themultimodal
request parameter to true (default). When multimodal
is true, use include_binary_content
to specify what image context you’d like to receive: base64 image data and captions (true) or captions only (false).
If you set
multimodal
to true and include_binary_content
to false, image objects are not returned in the snippets. If you set multimodal
to false, only text snippets are returned.Snippets are returned based on their semantic relevance to the provided query. When you set
multimodal
to true, you’ll receive the most relevant snippets, regardless of the types of content they contain. You can receive text snippets, multimodal snippets, or both.Limits
Multimodal context for assistants is only available for PDF files. Additionally, the following limits apply:Metric | Starter plan | Standard plan | Enterprise plan |
---|---|---|---|
Max file size | 10 MB | 50 MB | 50 MB |
Page limit | 100 | 100 | 100 |
Multimodal PDFs per assistant | 1 | 20 | 20 |