Overview
nvidia/llama-text-embed-v2 is a state-of-the-art embedding model available natively in Pinecone Inference. Developed by NVIDIA Research, it is built on the Llama 3.2 1B architecture and optimized for high retrieval quality with low-latency inference. Also known asllama-3_2-nv-embedqa-1b-v2, the model distills techniques from NVIDIA’s industry-leading NV-2 (7B parameters) into an efficient, production-ready solution.- Retrieval quality: The model surpasses OpenAI’s text-embedding-3-large across multiple benchmarks, in some cases improving accuracy by more than 20%
- Real-time queries: Predictable and consistent query speeds for responsive search with p99 latencies 12x faster than OpenAI Large
- Multilingual: Supports 26 languages, including English, Spanish, Chinese, Hindi, Japanese, Korean, French, and German
You can call the
embed operation through Pinecone Inference to turn text into vectors without writing to an index. That differs from upsert_records on an index with integrated embedding, where each request embeds and stores records in one step. To see how embedding consumption appears in billing and usage reports, see Embedding tokens.