This page shows you how to chat with an assistant. There are two operations you can use to chat with an assistant:

  • chat_assistant: This is the recommended way to chat with an assistant, as it offers more functionality and control over the assistant’s responses and references than the chat_completion_assistant operation. For more information, see Chat with an assistant.

  • chat_completion_assistant: This operation is based on the OpenAI Chat Completion API, a commonly used and adopted API. It is useful if you need inline citations or OpenAI-compatible responses, but has limited functionality compared to the chat_assistant operation. For more information, see Chat through an OpenAI-compatible interface.

You can chat with an assistant using the Pinecone console. Select the assistant to chat with, and use the Assistant playground.

This feature is in public preview.

Chat with an assistant

To chat with a Pinecone assistant, use the chat_assistant endpoint. It returns either a JSON object or a text stream.

This is the recommended way to chat with an assistant, as it offers more functionality and control over the assistant’s responses and references. However, if you need your assistant to be OpenAI-compatible or need inline citations, use the chat_completion_assistant endpoint.

Request a JSON response

The following example requests a JSON response to the message, “What is the inciting incident in Pride and Prejudice?“:

The content parameter in the request cannot be empty.

The example above returns a result like the following:

JSON
{
  "finish_reason": "stop",
  "message": {
    "role": "\"assistant\"",
    "content": "The inciting incident of \"Pride and Prejudice\" occurs when Mrs. Bennet informs Mr. Bennet that Netherfield Park has been let at last, and she is eager to share the news about the new tenant, Mr. Bingley, who is wealthy and single. This sets the stage for the subsequent events of the story, including the introduction of Mr. Bingley and Mr. Darcy to the Bennet family and the ensuing romantic entanglements."
  },
  "id": "00000000000000004ac3add5961aa757",
  "model": "gpt-4o-2024-05-13",
  "usage": {
    "prompt_tokens": 9736,
    "completion_tokens": 105,
    "total_tokens": 9841
  },
  "citations": [
    {
      "position": 406,
      "references": [
        {
          "file": {
            "status": "Available",
            "id": "ae79e447-b89e-4994-994b-3232ca52a654",
            "name": "Pride-and-Prejudice.pdf",
            "size": 2973077,
            "metadata": null,
            "updated_on": "2024-06-14T15:01:57.385425746Z",
            "created_on": "2024-06-14T15:01:02.910452398Z",
            "percent_done": 0,
            "signed_url": "https://storage.googleapis.com/..."
          },
          "pages": [
            1
          ]
        }
      ]
    }
  ]
}

Request a streaming response

The following example requests a JSON response to the message, “What is the inciting incident in Pride and Prejudice?“:

The content parameter in the request cannot be empty.

The example above returns a result like the following:

data:{"type":"message_start","id":"0000000000000000111b35de85e8a8f9","model":"gpt-4o-2024-05-13","role":"assistant"}

data:{"type":"content_chunk","id":"0000000000000000111b35de85e8a8f9","model":"gpt-4o-2024-05-13","delta":{"content":"The"}}

...

data:{"type":"citation","id":"0000000000000000111b35de85e8a8f9","model":"gpt-4o-2024-05-13","citation":{"position":406,"references":[{"file":{"status":"Available","id":"ae79e447-b89e-4994-994b-3232ca52a654","name":"Pride-and-Prejudice.pdf","size":2973077,"metadata":null,"updated_on":"2024-06-14T15:01:57.385425746Z","created_on":"2024-06-14T15:01:02.910452398Z","percent_done":0.0,"signed_url":"https://storage.googleapis.com/..."},"pages":[1]}]}}

data:{"type":"message_end","id":"0000000000000000111b35de85e8a8f9","model":"gpt-4o-2024-05-13","finish_reason":"stop","usage":{"prompt_tokens":9736,"completion_tokens":102,"total_tokens":9838}}

There are four types of chunks in a streaming chat response:

  • Starting chunk: Includes "role":"assistant", which indicates that the assistant is responding to the user’s message.
  • Content chunk: Includes a value in the content field (e.g., "content":"The"), which is part of the assistant’s streamed response to the user’s message.
  • Citation chunk: Includes a citation to the document that the assistant used to generate the response.
  • Ending chunk: Includes "finish_reason":"stop", which indicates that the assistant has finished responding to the user’s message.

Chat through an OpenAI-compatible interface

The chat_completion_assistant endpoint is based on the OpenAI Chat Completion API, a commonly used and adopted API. It is useful if you need inline citations or OpenAI-compatible responses, but has limited functionality compared to the chat_assistant endpoint. It returns either a JSON object or a text stream.

If you do not need your assistant to be OpenAI-compatible or need inline citations, use the chat_assistant endpoint.

Request a JSON response

The following example requests a JSON response to the message, “What is the maximum height of a red pine?“:

The content parameter in the request cannot be empty.

The example above returns a result like the following:

{"chat_completion":
  {
    "id":"chatcmpl-9OtJCcR0SJQdgbCDc9JfRZy8g7VJR",
    "choices":[
      {
        "finish_reason":"stop",
        "index":0,
        "message":{
          "role":"assistant",
          "content":"The maximum height of a red pine (Pinus resinosa) is up to 25 meters."
        }
      }
    ],
    "model":"my_assistant"
  }
}

Request a streaming response

The following example requests a text streaming response to the message, “What is the maximum height of a red pine?“:

The content parameter in the request cannot be empty.

The example above returns a result like the following:

{
  'id': '000000000000000009de65aa87adbcf0', 
  'choices': [
      {
      'index': 0, 
      'delta': 
        {
        'role': 'assistant', 
        'content': 'The'
        }, 
      'finish_reason': None
      }
    ], 
  'model': 'gpt-4o-2024-05-13'
}

...

{
  'id': '00000000000000007a927260910f5839',
  'choices': [
      {
      'index': 0,
      'delta':
        {
          'role': '', 
          'content': 'The'
        }, 
      'finish_reason': None
      }
    ], 
  'model': 'gpt-4o-2024-05-13'
}

...

{
  'id': '00000000000000007a927260910f5839', 
  'choices': [
    {
      'index': 0, 
      'delta': 
        {
        'role': None, 
        'content': None
        }, 
      'finish_reason': 'stop'
      }
    ], 
  'model': 'gpt-4o-2024-05-13'
}

There are three types of chunks in a chat completion response:

  • Starting chunk: Includes "role":"assistant", which indicates that the assistant is responding to the user’s message.
  • Content chunk: Includes a value in the content field (e.g., "content":"The"), which is part of the assistant’s streamed response to the user’s message.
  • Ending chunk: Includes "finish_reason":"stop", which indicates that the assistant has finished responding to the user’s message.

Provide conversation history in a chat request

Models lack memory of previous requests, so any relevant messages from earlier in the conversation must be present in the messages object.

In the following example, the messages object includes prior messages that are necessary for interpreting the newest message.

The above example request returns a response like the following:

{"chat_completion":
  {
    "id":"chatcmpl-9OtJCcR0SJQdgbCDc9JfRZy8g7VJR",
    "choices":[
      {
        "finish_reason":"stop",
        "index":0,
        "message":{
          "role":"assistant",
          "content":"The maximum diameter of a red pine (Pinus resinosa) is 75 centimeters [1, pp. 1]"
        }
      }
    ],
    "model":"my_assistant"
  }
}

Filter chat with metadata

You can filter which documents to use for chat completions. The following example filters the responses to use only documents that include the metadata "resource": "encyclopedia".

For more information about filtering with metadata, see Filter with metadata.

Choose a model for your assistant

Pinecone Assistant uses the gpt-4o model by default. Alternatively, you can use the claude-3-5-sonnet model. Select the LLM to use by setting the model parameter in the request:

Extract the response content

Both the chat_assistant and chat_completion_assistant operations return a JSON response object containing the assistant’s chat response along with other information. The message string is contained in the following JSON object:

  • choices.[0].message.content for a JSON chat response
  • choices[0].delta.content for a streaming chat response

You can extract the message content and print it to the console:

This creates output like the following:

JSON response
A red pine, scientifically known as *Pinus resinosa*, is a medium-sized tree that can grow up to 25 meters high and 75 centimeters in diameter. [1, pp. 1]