Unstructured builds ETL tools for LLMs, including an open source Python library, a SaaS API, and an ETL platform. Unstructured extracts content and metadata from 25+ document types, including PDFs, Word documents and PowerPoints. After extracting content and metadata, Unstructured performs additional preprocessing steps for LLMs such as chunking. Unstructured maintains upstream connections to data sources such as SharePoint and Google drive, and downstream connections to databases such as Pinecone.

Integrating Pinecone with Unstructured enables developers to load data from an source or document type into Pinecone with a single click, accelerating the building of LLM apps that connect to organizational data.