Building Intelligent RAG Systems with Azure Cosmos DB

In today’s AI world, many people want to build smart systems that give correct answers using their own data. This is called Retrieval Augmented Generation, or RAG. But making RAG work properly can be difficult. Common problems include working with unorganized or very large data, getting wrong or incomplete answers, and slow performance. If these issues are not fixed, the AI system can give a bad user experience.
Here is the solution to these problems is by using Azure tools like Cosmos DB, Azure OpenAI, and LangChain. Cosmos DB stores both your data and special number versions of that data called embeddings. Azure AI Search helps quickly find the most useful information. LangChain connects everything together and sends your query to GPT, which gives a good answer. When used together, these tools help you build a fast, smart, and reliable AI system that gives better results.
Introduction to RAG with Azure Cosmos DB
Azure Cosmos DB is a globally distributed, fully managed NoSQL database service designed for high-performance AI applications. Specifically for the Retrieval-Augmented Generation (RAG) architectures, Cosmos DB is increasingly proving itself as the database for the AI era.
Cosmos DB for AI Workloads
Azure Cosmos DB is built to be very fast, always available, and easy to scale up or down. It offers 99.999% SLA for reading and writing data across many regions with almost no downtime. It supports different ways of storing data, like documents, key-value pairs, graphs, and columns, so developers can use it for many types of AI and data projects.
It also provides support for multiple APIs, such as:
- NoSQL API
- MongoDB API
- PostgreSQL API
- Cassandra API
- Gremlin (for graph databases)
It is a good solution for the teams that are using both old systems and modern cloud-based systems.
Throughput Models:
Azure Cosmos DB offers two primary throughput models:
- Serverless: Best for prototypes and low-traffic apps. You only pay for what you use.
- Provisioned Throughput: Ideal for production workloads. This includes:
- Standard: Fixed throughput for predictable workloads.
- Autoscale: Dynamically scales up to 10x based on real-time demand perfect for apps like e-commerce platforms during sales seasons.
Traditional databases grow by making one server more powerful. Cosmos DB grows by spreading data across many servers using partition keys. These keys split the data into smaller parts, so it can handle more users and requests easily. You don’t have to manage this yourself Cosmos DB does it automatically.
Vector Search for RAG Architecture:
One of Cosmos DB’s standout features for RAG systems is its built-in support for storing both data and vector embeddings in a single database. This removes the need for complex ETL pipelines and eliminates data silos.
With vector search and flexible indexing, Cosmos DB enables you to run algorithms like:
- Cosine similarity
- Dot product
- Euclidean distance
This helps you find the most relevant documents for a user’s question, which is an important part of building a good RAG system.
DiskANN Indexing:
Cosmos DB uses Microsoft’s DiskANN algorithm to store and search vectors quickly and cheaply. It mixes the speed of memory-based search with the low cost of storing data on disk.
- Compressed vectors are stored in RAM (for speed)
- Full vectors are stored on disk (for scale)
Real-World Use Cases:
Azure Cosmos DB is already used by big companies and AI systems:
- KPMG made Kim Chat, a private AI assistant, using Cosmos DB with the MongoDB API. It reached 90% accuracy and improved employee productivity by 50%.
- OpenAI’s ChatGPT uses Cosmos DB to save chats, manage many users by using user IDs as partition keys, and handle sudden increases in traffic, all with very low delay and no downtime.
RAG with Cosmos DB for NoSQL API: Architecture
When building a Retrieval-Augmented Generation (RAG) system with Azure Cosmos DB for NoSQL API, we follow a specific architecture to store and search data efficiently.
Overview of the Process
- Data Preparation
- The dataset contains JSON documents with fields such as title, description, and price.
- Each document is sent to an embedding model (via API) to generate vector embeddings.
- Both the original document and its vector embeddings are stored together in Azure Cosmos DB for NoSQL API.
- Handling a User Query
- When a user asks a question, we generate vector embeddings for that query.
- We search for the most relevant documents using a similarity algorithm.
- The matched content is sent to the GPT model along with the query.
- The GPT model generates the final answer for the user.
Vector Indexing
To make searching faster, vector embeddings need to be arranged and indexed.
- Indexing Algorithm Used: DiskANN (developed by Microsoft)
- Stores most vectors on disk (cost-efficient)
- Keeps reduced-size vectors in RAM (fast search)
- Indexing helps group similar vectors together in high-dimensional space.
Similarity Search
- Algorithm Used: Cosine Similarity
- Purpose: Compare the query embedding with stored embeddings to find the closest matches.
Vector Storage Settings in Cosmos DB
When storing vectors in Cosmos DB:
- Data type:
float32
- Vector dimensions: Depend on the embedding model (e.g., Ada 002 produces 1536 dimensions)
- You must define:
- The path for storing embeddings
- The distance function for searches (e.g., cosine similarity)
- The indexing policy (DiskANN, Flat, or Quantized Flat)
Types of Vector Indexes
- Flat: Works with vectors up to 505 dimensions
- Quantized Flat: Works with vectors up to 4096 dimensions
- DiskANN: Works with vectors up to 4096 dimensions (used in this setup for best performance and cost balance)
Step-by-Step Guide:
Here is the step-by-step guide to building a Retrieval‑Augmented Generation (RAG) system using Azure Cosmos DB (NoSQL API) for vector storage and Azure OpenAI for embeddings and text generation
1. Prepare the GitHub Repository
- Download the repository from GitHub.
- After extraction, open in VS Code or any other popular programming platform of your choice.
- Open the NoSQL_rag folder.
- Note the contents:
.env
(environment variable file)- Dataset JSON file (contains food item data)
- Python notebook file (main execution flow)

Deploy Azure Cosmos DB for NoSQL API
- Sign in to Portal.azure.com.
- Search Azure Cosmos DB and click Create.
- Select Azure Cosmos DB for NoSQL API.
- Create a new resource group and set the account name, choose the location, and also disable availability zones (development only).
- Set Capacity Mode to Serverless (development/testing).
- Click Review + Create, then Create.
- Wait 3 minutes for the deployment to complete.

Deploy Azure OpenAI Resources
- In the Azure Portal, locate your existing or create your Azure OpenAI resource (e.g., hybridOpenAI in Sweden Central).
- Open Azure OpenAI Studio.
- Deploy a GPT model (GPT-4 or GPT-3.5):
- Click Deploy Model → Select base model → Confirm.
- Set deployment type: Standard.
- Rate limit: 10K tokens/minute.
- Deploy a Text Embedding model (text-embedding-ada-002):
- Deployment type: Standard.
- Capacity: 120K tokens/minute.

Collect Keys and Endpoints
- Go to Azure OpenAI resource → Keys and Endpoint.
- Copy Primary Key and paste into
.env
. - Copy Endpoint URL and paste into
.env
. - Add GPT engine name and embedding engine name in
.env
.

Get Cosmos DB Connection
- Open your Cosmos DB for NoSQL API resource.
- Go to Keys (under Settings).
- Copy Primary Connection String.
- Paste into
.env
file (enclose in double quotes if it contains special characters). - Set your database name (e.g.,
DB
) and container name (e.g.,container
) in.env
.

7. Enable Vector Search in Cosmos DB
- In the Cosmos DB resource, go to Features (under Settings).
- Enable Vector Search for NoSQL API (Preview).
- Wait up to 15 minutes for the feature to be fully applied.

Understand the Dataset
- Open the dataset JSON file.
- Review structure:
category
name
description
pric
e

Data Processing and Storage Flow
- Connect to Cosmos DB using the connection string.
- Create the database (if not exists) with the specified name.
- Define Vector Embedding Policy:
- Path:
/vector
- Data type:
float32
- Distance function:
cosine
- Dimensions:
1536
- Path:
- Set Partition Key:
/category
. - Define Vector Indexing Policy:
- Path:
/vector
- Algorithm: DiskANN.
- Path:
- Create the container in Cosmos DB with the above configurations.
Generate and Store Embeddings
- Connect to Azure OpenAI embedding engine.
- For each dataset record:
- Pass
description
,title
, andprice
to embedding engine. - Receive 1536-dimension vector.
- Store record + vector into Cosmos DB container.
- Pass
- Confirm in Data Explorer that container now contains all records with vectors.
Run User Queries
- Accept a user query (e.g., “Are pizzas available? I am lactose intolerant”).
- Generate embeddings for the query.
- Perform vector similarity search in Cosmos DB:
- Use
vector_distance
function with cosine similarity. - Retrieve Top 5 most relevant items.
- Use

Pass Results to GPT Engine
- Construct a system prompt describing:
- The chatbot’s purpose.
- Structure of provided context (Python list of items).
- Rules: Answer only from context, no extra info or links.
- Pass user query + retrieved items as context in a user message.
- Set temperature (e.g.,
0.7
). - Receive GPT-generated answer referencing only the provided context.

Architecture for RAG with Azure Cosmos DB MongoDB vCore
Azure Cosmos DB for MongoDB API (vCore) plays a critical role in storing both your raw data and the vector embeddings that make semantic search possible when we building Retrieval Augmented Generation (RAG) systems. This approach keeps your data and embeddings in a single place, avoiding the complexity of external ETL pipelines and preventing data silos.
Key Components
1. Azure Cosmos DB MongoDB Resource
The MongoDB vCore resource stores the application’s dataset as in our example, a JSON collection of food items.
Each record contains:
- Basic attributes such as name, description, category, and price.
- A vector embedding generated for its text content.
This dual storage (data + embeddings) means that the same database can power both traditional lookups and advanced similarity searches.
2. LangChain as the Orchestration Engine
LangChain serves as the “glue” between different parts of the system. It:
- Loads and preprocesses documents.
- Connects directly to Azure Cosmos DB MongoDB for storing and retrieving vectors.
- Integrates with Azure OpenAI to:
- Generate embeddings for both documents and user queries.
- Perform cosine similarity searches.
- Abstracts away the lower-level algorithms like vector distance calculation, so you can focus on building the application’s logic.
3. Azure OpenAI Resource
Two Azure OpenAI models are central to this workflow:
- Text Embedding Model (
text-embedding-ada-002
): Converts text into 1,536-dimensional vectors for semantic matching. - Chat Model (GPT-3.5 Turbo or GPT-4): Processes user queries and generates responses using retrieved context.
RAG Workflow with MongoDB vCore
The process follows a consistent sequence:
- Document Chunking: Large documents are split into smaller chunks to improve embedding accuracy.
- Vector Embedding Generation: Azure OpenAI converts each chunk into a high-dimensional vector.
- Data + Vector Storage: Both the original text and its vector are stored in Azure Cosmos DB MongoDB.
- User Query Embedding: A query is transformed into a vector using the same embedding model.
- Vector Search in Cosmos DB: LangChain runs a cosine similarity search to find the most relevant chunks.
- Contextual Prompting: The retrieved chunks, along with the query, are sent to a GPT model.
- Final Response: The model returns a natural-language answer enriched with context from the database.
Comparison with the NoSQL API Demo
In the earlier NoSQL API version, most of the connection handling, embedding generation, and storage logic was written manually.
With MongoDB vCore, LangChain takes on these responsibilities, providing built-in connectors and simplifying the implementation.
Why we Use LangChain?
Using an orchestration engine like LangChain allows developers to:
- Work with multiple vector stores and models without rewriting core logic.
- Easily switch to alternatives like Semantic Kernel SDK in the future.
- Focus on the application flow instead of low-level database and algorithm details.
Building a RAG System with Azure Cosmos DB for MongoDB vCore
Here, we are setting up and running a Retrieval Augmented Generation (RAG) workflow using Azure Cosmos DB MongoDB vCore as the vector store, LangChain as the orchestration layer, and Azure OpenAI for embeddings and responses.

Download the repository
Download the repository from GitHub.
- Navigate to the MongoDB Core folder, it contains:
.env
: environment variable configurationfood_items.json
: datasetrag_app.ipynb
: main Jupyter Notebook for the setup

Deploy Azure Cosmos DB MongoDB vCore
- Search for Azure Cosmos DB in the portal.
- Select Create → MongoDB API → vCore Cluster (supports vector storage).
- Configure:
- Resource Group
- Cluster Name
- Region
- MongoDB Version (8.0)
- Admin Credentials
- Networking: Allow public access or firewall and add your current client IP.
- Review + Create → Create.

Configure .env
Variables
Populate the following environment variables:
Azure Cosmos DB MongoDB vCore
- Connection String (from Settings → Connection Strings in Azure Portal)
- Database Name (e.g.,
food_db
) - Collection Name (e.g.,
menu_items
) - Username & Password (set during cluster creation)
Azure OpenAI
- API Key (from Keys and Endpoint)
- Endpoint URL
- API Version (from View Code in Azure AI Studio)
- GPT Model Name (e.g.,
gpt-4
) - Embedding Model Name (
text-embedding-ada-002
) - Embedding Deployment Name (matching your Azure deployment)
- Index Name (custom name for vector index)

Install Dependencies
In the notebook, install:
pip install langchain openai pymongo
Generate Vector Embeddings
- Initialize the OpenAI embedding client with your model name and deployment.
- Create embeddings for all dataset entries.
- Store them in Azure Cosmos DB as a vector index with cosine similarity for distance calculation.
- Query the Vector Store
Use LangChain’s similarity_search() to:
- Embed the user’s query
- Compare with stored embeddings in Cosmos DB
- Return the most relevant matches
Example:
results = vector_store.similarity_search("beef bacon")
Inspect in MongoDB Shell
From the Azure Portal:
- Open Mongo Shell
- Connect to your database
- Run
show collections
and db.<collection>.findOne() to view documents
You’ll see original text fields plus avector_content
array of 1536 values.
Send Results to GPT for Final Answer
Pass the retrieved context to your GPT deployment via the Chat Completions API with a system prompt like: You are a cooking assistant. Answer the user query using only the provided context.
Example:
query: "Is iced cappuccino available and at what price?"
Response: "Yes, iced cappuccino is available for 4.99 USD."
Conclusion
By combining Azure Cosmos DB, Azure OpenAI, and LangChain, developers can build highly efficient Retrieval-Augmented Generation (RAG) systems that deliver accurate, context-rich answers from private datasets. Cosmos DB provides fast, scalable storage for both raw data and vector embeddings, removing the need for separate vector databases. Azure OpenAI enables powerful semantic search and natural-language responses, while LangChain streamlines orchestration and integration. Whether using the NoSQL API or MongoDB vCore, this architecture ensures speed, reliability, and scalability, making it ideal for enterprise-grade AI applications that require precise and relevant answers with minimal latency.