Overview
nvidia/llama-text-embed-v2 is a state-of-the-art embedding model available natively in Pinecone Inference. Developed by NVIDIA Research, it is built on the Llama 3.2 1B architecture and optimized for high retrieval quality with low-latency inference. Also known as llama-3_2-nv-embedqa-1b-v2
, the model distills techniques from NVIDIA’s industry-leading NV-2
(7B parameters) into an efficient, production-ready solution.
- Retrieval quality: The model surpasses OpenAI’s text-embedding-3-large across multiple benchmarks, in some cases improving accuracy by more than 20%
- Real-time queries: Predictable and consistent query speeds for responsive search with p99 latencies 12x faster than OpenAI Large
- Multilingual: Supports 26 languages, including English, Spanish, Chinese, Hindi, Japanese, Korean, French, and German