Chat through the OpenAI-compatible interface

After uploading files to an assistant, you can chat with the assistant. This page shows you how to chat with an assistant using the OpenAI-compatible chat interface. This interface is based on the OpenAI Chat Completion API, a commonly used and adopted API. It is useful if you need inline citations or OpenAI-compatible responses, but has limited functionality compared to the standard chat interface.

The standard chat interface is the recommended way to chat with an assistant, as it offers more functionality and control over the assistant’s responses and references.

Chat with an assistant

The OpenAI-compatible chat interface can return responses in two different formats:

Default response: The assistant returns a response in a single string field, which includes citation information.
Streaming response: The assistant returns the response as a text stream.

Default response

The following example sends a message and requests a response in the default format:

The content parameter in the request cannot be empty.

# To use the Python SDK, install the plugin:
# pip install --upgrade pinecone pinecone-plugin-assistant

from pinecone import Pinecone
from pinecone_plugins.assistant.models.chat import Message

pc = Pinecone(api_key="YOUR_API_KEY")

# Get your assistant.
assistant = pc.assistant.Assistant(
    assistant_name="example-assistant", 
)

# Chat with the assistant.
chat_context = [Message(role="user", content='What is the maximum height of a red pine?')]
response = assistant.chat_completions(messages=chat_context)

print(response)

The example above returns a result like the following:

{"chat_completion":
  {
    "id":"chatcmpl-9OtJCcR0SJQdgbCDc9JfRZy8g7VJR",
    "choices":[
      {
        "finish_reason":"stop",
        "index":0,
        "message":{
          "role":"assistant",
          "content":"The maximum height of a red pine (Pinus resinosa) is up to 25 meters."
        }
      }
    ],
    "model":"my_assistant"
  }
}

Streaming response

The following example sends a messages and requests a streaming response:

The content parameter in the request cannot be empty.

# To use the Python SDK, install the plugin:
# pip install --upgrade pinecone pinecone-plugin-assistant

from pinecone import Pinecone
from pinecone_plugins.assistant.models.chat import Message

pc = Pinecone(api_key="YOUR_API_KEY")

# Get your assistant.
assistant = pc.assistant.Assistant(
    assistant_name="example-assistant" 
)

# Streaming chat with the Assistant.
chat_context = [Message(role="user", content="What is the maximum height of a red pine?")]
response = assistant.chat_completions(messages=[chat_context], stream=True)

for data in response:
    if data:
        print(data)

The example above returns a result like the following:

{
  'id': '000000000000000009de65aa87adbcf0', 
  'choices': [
      {
      'index': 0, 
      'delta': 
        {
        'role': 'assistant', 
        'content': 'The'
        }, 
      'finish_reason': None
      }
    ], 
  'model': 'gpt-4o-2024-05-13'
}

...

{
  'id': '00000000000000007a927260910f5839',
  'choices': [
      {
      'index': 0,
      'delta':
        {
          'role': '', 
          'content': 'The'
        }, 
      'finish_reason': None
      }
    ], 
  'model': 'gpt-4o-2024-05-13'
}

...

{
  'id': '00000000000000007a927260910f5839', 
  'choices': [
    {
      'index': 0, 
      'delta': 
        {
        'role': None, 
        'content': None
        }, 
      'finish_reason': 'stop'
      }
    ], 
  'model': 'gpt-4o-2024-05-13'
}

There are three types of messages in a chat completion response:

Message start: Includes "role":"assistant", which indicates that the assistant is responding to the user’s message.
Content: Includes a value in the content field (e.g., "content":"The"), which is part of the assistant’s streamed response to the user’s message.
Message end: Includes "finish_reason":"stop", which indicates that the assistant has finished responding to the user’s message.

Extract the response content

In the assistant’s response, the message string is contained in the following JSON object:

choices.[0].message.content for the default chat response
choices[0].delta.content for the streaming chat response

You can extract the message content and print it to the console:

Default response
Streaming response

print(str(response.choices[0].message.content))

This creates output like the following:

A red pine, scientifically known as *Pinus resinosa*, is a medium-sized tree that can grow up to 25 meters high and 75 centimeters in diameter. [1, pp. 1]

Choose a model

Pinecone Assistant supports the following models:

gpt-4o (default)
gpt-4.1
o4-mini
claude-3-5-sonnet
claude-3-7-sonnet
gemini-2.5-pro

To choose a non-default model for your assistant, set the model parameter in the request:

# To use the Python SDK, install the plugin:
# pip install --upgrade pinecone pinecone-plugin-assistant

from pinecone import Pinecone
from pinecone_plugins.assistant.models.chat import Message

pc = Pinecone(api_key="YOUR_API_KEY")

# Get your assistant.
assistant = pc.assistant.Assistant(
    assistant_name="example-assistant", 
)

# Chat with the assistant.
chat_context = [Message(role="user", content="What is the maximum height of a red pine?")]
response = assistant.chat_completions(
    messages=chat_context, 
    model="gpt-4.1"
)

Filter chat with metadata

You can filter which documents to use for chat completions. The following example filters the responses to use only documents that include the metadata "resource": "encyclopedia".

# To use the Python SDK, install the plugin:
# pip install --upgrade pinecone pinecone-plugin-assistant

from pinecone import Pinecone
from pinecone_plugins.assistant.models.chat import Message

pc = Pinecone(api_key="YOUR_API_KEY")

# Get your assistant.
assistant = pc.assistant.Assistant(
    assistant_name="example-assistant", 
)

# Chat with the assistant.
chat_context = [Message(role="user", content="What is the maximum height of a red pine?")]
response = assistant.chat_completions(messages=chat_context, stream=True, filter={"resource": "encyclopedia"})

Set the sampling temperature

This is available in API versions 2025-04 and later.

Temperature is a parameter that controls the randomness of a model’s predictions during text generation. Lower temperatures (~0.0) yield more consistent, predictable answers, while higher temperatures increase the model’s explanatory power and is generally better for creative tasks. To control the sampling temperature for a model, set the temperarture parameter in the request. If a model does not support a temperature parameter, the parameter is ignored.

# To use the Python SDK, install the plugin:
# pip install --upgrade pinecone pinecone-plugin-assistant

from pinecone import Pinecone
from pinecone_plugins.assistant.models.chat import Message

pc = Pinecone(api_key="YOUR_API_KEY")
assistant = pc.assistant.Assistant(assistant_name="example-assistant")

msg = Message(role="user", content="Who is the CFO of Netflix?")
response = assistant.chat_completions(
    messages=[msg], 
    temperature=0.8
)

print(response)

Get started

Build an assistant

Upload your data

Chat with an assistant

Evaluate answers

Retrieve context snippets

Integrate with AI agents

Admin

Chat through the OpenAI-compatible interface

Chat with an assistant

Default response

Streaming response

Extract the response content

Choose a model

Filter chat with metadata

Set the sampling temperature

Get started

Build an assistant

Upload your data

Chat with an assistant

Evaluate answers

Retrieve context snippets

Integrate with AI agents

Admin

​Chat with an assistant

​Default response

​Streaming response

​Extract the response content

​Choose a model

​Filter chat with metadata

​Set the sampling temperature

Chat with an assistant

Default response

Streaming response

Extract the response content

Choose a model

Filter chat with metadata

Set the sampling temperature