A semantic search app to perform semantic search over PDF documents
$ npx create-pinecone-app@latest --template legal-semantic-search
The Legal Semantic Search app demonstrates how to programmatically bootstrap a custom knowledge base based on a Pinecone vector database with arbitrary PDF files included in the codebase. This app is focused on semantic search over legal documents, but this exact same technique and code can be applied to any content stored locally.
The fastest way to get started is to use the create-pinecone-app
CLI tool to get up and running:
You need an API key to make API calls to your Pinecone project:
Then copy your generated key:
Alternatively, follow these steps:
Create a Pinecone index for this project. The index should have the following properties:
1024
The Voyage voyage-law-2
embeddings model has 1024 dimensions.cosine
us-east-1
You can create the index in the console, or by following the instructions here.
Requires Node version 20+
From the project root directory, run the following command.
Make sure you have populated the client .env
with relevant keys.
Start the app.
In this example we opted to use a standard Next.js application structure.
Frontend Client
The frontend uses Next.js, tailwind and custom React components to power the search experience. It also leverages API routes to make calls to the server to initiate bootstrapping of the Pinecone vector database as a knowledge store, and to fetch relevant document chunks for the UI.
Backend Server
This project uses Next.js API routes to handle file chunking, upsertion, and context provision etc. Learn more about the implementation details below.
This project uses a basic semantic search architecture that achieves low latency natural language search across all embedded documents. When the app is loaded, it performs background checks to determine if the Pinecone vector database needs to be created and populated.
Componentized suggested search interface
To make it easier for you to clone this app as a starting point and quickly adopt it to your own purposes, we’ve built the search interface as a component that accepts a list of suggested searches and renders them as a dropdown, helping the user find things:
You can define your suggested searches in your parent component:
This means you can pass in any suggested searches you wish given your specific use case.
The SearchForm component is exported from src/components/SearchForm.tsx
. It handles:
Local document processing via a bootstrapping service
We store several landmark legal cases as PDFs in the codebase, so that developers cloning and running the app locally can immediately build off the same experience being demonstrated by the legal semantic search app running on our Docs site.
We use Langchain to parse the PDFs, convert them into chunks, and embed them. We store the resulting vectors in the Pinecone vector database.
Knowledge base bootstrapping
This project demonstrates how to programmatically bootstrap a knowledge base backed by a Pinecone vector database using arbitrary PDF files that are included in the codebase.
The sample app use case is focused on semantic search over legal documents, but this exact same technique and code can be applied to any content stored locally.
When a user access the app, it runs a check to determine if the bootsrapping procedure needs to be run.
If the Pinecone index does not already exist, or if it exists but does not yet contain vectors, the bootstrapping procedure is run.
The bootsrapping procedure:
PINECONE_INDEX
environment variabledocs/db.json
filedocs
directoryDomain-specific embeddings model
This app uses Voyage AI’s embeddings model, voyage-law-2
, which is purpose-built for use with legal text. This app includes a small handfull of landmark U.S. cases from Justia.
During the bootstrapping phase, the case documents are chunked and passed to Voyage’s embeddings model for embedding:
When the user executes a search, their query is sent to the /api/search
route, which also uses
Voyage’s embeddings model to convert the user’s query into query vectors:
Experiencing any issues with the sample app? Submit an issue, create a PR, or post in our community forum!
A semantic search app to perform semantic search over PDF documents
$ npx create-pinecone-app@latest --template legal-semantic-search
The Legal Semantic Search app demonstrates how to programmatically bootstrap a custom knowledge base based on a Pinecone vector database with arbitrary PDF files included in the codebase. This app is focused on semantic search over legal documents, but this exact same technique and code can be applied to any content stored locally.
The fastest way to get started is to use the create-pinecone-app
CLI tool to get up and running:
You need an API key to make API calls to your Pinecone project:
Then copy your generated key:
Alternatively, follow these steps:
Create a Pinecone index for this project. The index should have the following properties:
1024
The Voyage voyage-law-2
embeddings model has 1024 dimensions.cosine
us-east-1
You can create the index in the console, or by following the instructions here.
Requires Node version 20+
From the project root directory, run the following command.
Make sure you have populated the client .env
with relevant keys.
Start the app.
In this example we opted to use a standard Next.js application structure.
Frontend Client
The frontend uses Next.js, tailwind and custom React components to power the search experience. It also leverages API routes to make calls to the server to initiate bootstrapping of the Pinecone vector database as a knowledge store, and to fetch relevant document chunks for the UI.
Backend Server
This project uses Next.js API routes to handle file chunking, upsertion, and context provision etc. Learn more about the implementation details below.
This project uses a basic semantic search architecture that achieves low latency natural language search across all embedded documents. When the app is loaded, it performs background checks to determine if the Pinecone vector database needs to be created and populated.
Componentized suggested search interface
To make it easier for you to clone this app as a starting point and quickly adopt it to your own purposes, we’ve built the search interface as a component that accepts a list of suggested searches and renders them as a dropdown, helping the user find things:
You can define your suggested searches in your parent component:
This means you can pass in any suggested searches you wish given your specific use case.
The SearchForm component is exported from src/components/SearchForm.tsx
. It handles:
Local document processing via a bootstrapping service
We store several landmark legal cases as PDFs in the codebase, so that developers cloning and running the app locally can immediately build off the same experience being demonstrated by the legal semantic search app running on our Docs site.
We use Langchain to parse the PDFs, convert them into chunks, and embed them. We store the resulting vectors in the Pinecone vector database.
Knowledge base bootstrapping
This project demonstrates how to programmatically bootstrap a knowledge base backed by a Pinecone vector database using arbitrary PDF files that are included in the codebase.
The sample app use case is focused on semantic search over legal documents, but this exact same technique and code can be applied to any content stored locally.
When a user access the app, it runs a check to determine if the bootsrapping procedure needs to be run.
If the Pinecone index does not already exist, or if it exists but does not yet contain vectors, the bootstrapping procedure is run.
The bootsrapping procedure:
PINECONE_INDEX
environment variabledocs/db.json
filedocs
directoryDomain-specific embeddings model
This app uses Voyage AI’s embeddings model, voyage-law-2
, which is purpose-built for use with legal text. This app includes a small handfull of landmark U.S. cases from Justia.
During the bootstrapping phase, the case documents are chunked and passed to Voyage’s embeddings model for embedding:
When the user executes a search, their query is sent to the /api/search
route, which also uses
Voyage’s embeddings model to convert the user’s query into query vectors:
Experiencing any issues with the sample app? Submit an issue, create a PR, or post in our community forum!