Multimodal context for assistants

This feature is in public preview.

Pinecone assistants support multimodal context, allowing them to understand and respond to questions about images embedded in PDF documents. This enables use cases like:

Analyzing charts, graphs, and diagrams in financial reports
Understanding infographics and visual data in research papers
Interpreting visual layouts in technical documentation

When working with multimodal PDFs, assistants attempt to filter out purely decorative images (such as example logos, background graphics, generic stock photos), so they can focus on images that contain meaningful information. Additionally, assistants use Optical Character Recognition (OCR) to extract text from images. This allows them to read and analyze scanned PDFs (PDFs that contain images of text, but no actual embedded text).

How it works

When you enable multimodal context for a PDF:

Pinecone extracts text and images (raster or vector) from the file and analyzes their contents. For each image, the assistant generates a descriptive caption and set of keywords. Additionally, when it makes sense, the assistant captures data points found in the image (for example, values from a table or chart).
During chat or context queries, the assistant searches for relevant text and image context it captured when analyzing the PDF. Image context can include the original image data (base64-encoded).
The assistant passes this context to the LLM, which uses it to generate responses.

For an overview of how Pinecone Assistant works, see Pinecone Assistant architecture.

Try it out

The following steps demonstrate how to create an assistant, provide it with a PDF that contains images, and then query that assistant using chat and context APIs.

All versions of Pinecone’s Assistant API allow you to upload multimodal PDFs.

1. Create an assistant

First, if you don’t have one, create an assistant:

from pprint import pprint
from pinecone import Pinecone

pc = Pinecone("YOUR_API_KEY") 
assistant = pc.assistant.create_assistant(
    assistant_name="example-assistant-multimodal", 
    instructions="You are a helpful assistant that can understand both text and images in documents.",
    region="us",
    timeout=30
)

print(f"Type: {type(assistant).__name__}")
pprint(assistant)

Response:

Type: AssistantModel
{'created_at': '2025-08-28T23:35:26.917953498Z',
  'host': 'https://prod-1-data.ke.pinecone.io',
  'instructions': 'You are a helpful assistant that can understand both text '
                  'and images in documents.',
  'metadata': {},
  'name': 'example-assistant-multimodal',
  'status': 'Ready',
  'updated_at': '2025-08-28T23:35:28.507639215Z'}

You don’t need to create a new assistant to use multimodal context. Existing assistants can enable multimodal context for newly uploaded PDFs, as described in the next section.

2. Upload a multimodal PDF

To enable multimodal context for a PDF, when uploading the file, set the multimodal URL parameter to true (defaults to false).

from pprint import pprint
from pinecone import Pinecone

pc = Pinecone("YOUR_API_KEY") 
assistant = pc.assistant.Assistant(assistant_name="example-assistant-multimodal")

# timeout=None allows the SDK to wait for file processing to complete before returning.
# This parameter is only available in the SDK, not in direct API calls.
file_model = assistant.upload_file(
    file_path="./document.pdf",
    multimodal=True,
    timeout=None
)

pprint(file_model)

Response:

# Formatted for readability
FileModel(
  name='document.pdf', 
  id='9c322597-58d6-4ebc-84b5-a398b620da01', 
  metadata=None, 
  created_on='2025-08-28T23:41:41.982805815Z', 
  updated_on='2025-08-28T23:42:09.562949544Z', 
  status='Available', 
  percent_done=1.0, 
  signed_url=None, 
  error_message=None, 
  size=1236044.0, 
  multimodal=True
)

The multimodal parameter is only available for PDF files.
To check the status of a file, use the describe a file upload endpoint.

3. Chat with the assistant

Now, chat with your assistant. To tell the assistant to provide image-related context to the LLM:

Set the multimodal request parameter to true (default) in the context_options object. Setting multimodal to false means the LLM only receives text snippets.
When multimodal is true, use include_binary_content to specify what image context the LLM should receive: base64 image data and captions (true) or captions only (false).

Sending image-related context to the LLM (whether captions, base64 data, or both) increases token usage. Learn about monitoring spend and usage.

from pprint import pprint
from pinecone import Pinecone
from pinecone_plugins.assistant.models.chat import Message

pc = Pinecone("YOUR_API_KEY") 
assistant = pc.assistant.Assistant(assistant_name="example-assistant-multimodal")

msg = Message(
    role="user", 
    content="Describe the symbol on the paper tray that indicates the maximum fill level."
)

chat_response = assistant.chat(
  messages=[msg],
  context_options={
      "multimodal": True,
      "include_binary_content": True,
      "top_k": 10,
      "snippet_size": 2048
  }
)

pprint(chat_response)

Response:

# Formatted for readability
ChatResponse(
  id='00000000000000000fe49626f3ee5164', 
  model='gpt-4o-2024-11-20', 
  usage=Usage(
    prompt_tokens=8703, 
    completion_tokens=41, 
    total_tokens=8744
  ), 
  message=Message(
    content='The symbol on the paper tray that indicates...', 
    role='assistant'
  ), 
  finish_reason='stop', 
  citations=[
    Citation(
      position=209, 
      references=[
        Reference(
            file=FileModel(
              name='document.pdf', 
              id='9c322597-58d6-4ebc-84b5-a398b620da01', 
              metadata=None, 
              created_on='2025-08-28T23:41:41.982805815Z', 
              updated_on='2025-08-28T23:42:09.562949544Z', 
              status='Available', 
              percent_done=1.0, 
              signed_url='https://storage.googleapis.com/...', 
              error_message=None, 
              size=1236044.0, 
              multimodal=True
          ), 
          pages=[3, 4, 5, 6, 7, 8, 9, 10, 11], 
          highlight=None
        )
      ]
    )
  ]
)

If your assistant uses multimodal context snippets to generate a response, no highlights are returned—even when include_highlights is true.

4. Query for context

To query context for a custom RAG workflow, you can retrieve context snippets directly. Then, you can pass these snippets to an LLM as context. To fetch image-related context snippets (as well as text snippets), set the multimodal request parameter to true (default). When multimodal is true, use include_binary_content to specify what image context you’d like to receive: base64 image data and captions (true) or captions only (false).

from pprint import pprint
from pinecone import Pinecone

pc = Pinecone("PINECONE_API_KEY") 
assistant = pc.assistant.Assistant(assistant_name="example-assistant-multimodal")
context_response = assistant.context(
    query="Describe the symbol on the paper tray that indicates the maximum fill level.",
    multimodal=True,
    include_binary_content=True
)

pprint(context_response)

If you set multimodal to true and include_binary_content to false, image objects are not returned in the snippets. If you set multimodal to false, only text snippets are returned.

Response:

# Formatted for readability
ContextResponse(
  id='00000000000000001e3ef84bd493e612', 
  snippets=[
    MultimodalSnippet(
      type='multimodal', 
      content=[
        TextBlock(type='text', text="..."), 
        ImageBlock(
          type='image', 
          caption='...', 
          image=Image(mime_type='image/jpeg', data='...', type='base64')), 
        // ...
      ], 
      score=0.16321887, 
      reference=PdfReference(
        type='pdf', 
        pages=[3, 4, 5, 6, 7, 8, 9, 10, 11], 
        file=FileModel(
          name='document.pdf', 
          id='9c322597-58d6-4ebc-84b5-a398b620da01', 
          metadata=None, 
          created_on='2025-08-28T23:41:41.982805815Z', 
          updated_on='2025-08-28T23:42:09.562949544Z', 
          status='Available', 
          percent_done=1.0, 
          signed_url='https://storage.googleapis.com/...', 
          error_message=None, 
          size=1236044, 
          multimodal=True
        )
      )
    ), 
    // ...
  ], 
  usage=TokenCounts(
    prompt_tokens=7061, 
    completion_tokens=0, 
    total_tokens=7061
  )
)

Snippets are returned based on their semantic relevance to the provided query. When you set multimodal to true, you’ll receive the most relevant snippets, regardless of the types of content they contain. You can receive text snippets, multimodal snippets, or both.

Limits

Multimodal context for assistants is only available for PDF files. Additionally, the following limits apply:

Metric	Starter plan	Standard plan	Enterprise plan
Max file size	10 MB	50 MB	50 MB
Page limit	100	100	100
Multimodal PDFs per assistant	1	20	20

To learn about other assistant-related limits, see Pinecone Assistant limits.

Get started

Build an assistant

Upload your data

Chat with an assistant

Evaluate answers

Retrieve context snippets

Integrate with AI agents

Admin

Multimodal context for assistants

How it works

Try it out

1. Create an assistant

2. Upload a multimodal PDF

3. Chat with the assistant

4. Query for context

Limits

Get started

Build an assistant

Upload your data

Chat with an assistant

Evaluate answers

Retrieve context snippets

Integrate with AI agents

Admin

​How it works

​Try it out

​1. Create an assistant

​2. Upload a multimodal PDF

​3. Chat with the assistant

​4. Query for context

​Limits

How it works

Try it out

1. Create an assistant

2. Upload a multimodal PDF

3. Chat with the assistant

4. Query for context

Limits