Building Intelligent RAG Systems with Azure Cosmos DB

Muhammad ZubairJuly 30, 2025

0 995 9 minutes read

In today’s AI world, many people want to build smart systems that give correct answers using their own data. This is called Retrieval Augmented Generation, or RAG. But making RAG work properly can be difficult. Common problems include working with unorganized or very large data, getting wrong or incomplete answers, and slow performance. If these issues are not fixed, the AI system can give a bad user experience.

Here is the solution to these problems is by using Azure tools like Cosmos DB, Azure OpenAI, and LangChain. Cosmos DB stores both your data and special number versions of that data called embeddings. Azure AI Search helps quickly find the most useful information. LangChain connects everything together and sends your query to GPT, which gives a good answer. When used together, these tools help you build a fast, smart, and reliable AI system that gives better results.

Introduction to RAG with Azure Cosmos DB

Azure Cosmos DB is a globally distributed, fully managed NoSQL database service designed for high-performance AI applications. Specifically for the Retrieval-Augmented Generation (RAG) architectures, Cosmos DB is increasingly proving itself as the database for the AI era.

Cosmos DB for AI Workloads

Azure Cosmos DB is built to be very fast, always available, and easy to scale up or down. It offers 99.999% SLA for reading and writing data across many regions with almost no downtime. It supports different ways of storing data, like documents, key-value pairs, graphs, and columns, so developers can use it for many types of AI and data projects.

It also provides support for multiple APIs, such as:

NoSQL API
MongoDB API
PostgreSQL API
Cassandra API
Gremlin (for graph databases)

It is a good solution for the teams that are using both old systems and modern cloud-based systems.

Throughput Models:

Azure Cosmos DB offers two primary throughput models:

Serverless: Best for prototypes and low-traffic apps. You only pay for what you use.
Provisioned Throughput: Ideal for production workloads. This includes:
- Standard: Fixed throughput for predictable workloads.
- Autoscale: Dynamically scales up to 10x based on real-time demand perfect for apps like e-commerce platforms during sales seasons.

Traditional databases grow by making one server more powerful. Cosmos DB grows by spreading data across many servers using partition keys. These keys split the data into smaller parts, so it can handle more users and requests easily. You don’t have to manage this yourself Cosmos DB does it automatically.

Vector Search for RAG Architecture:

One of Cosmos DB’s standout features for RAG systems is its built-in support for storing both data and vector embeddings in a single database. This removes the need for complex ETL pipelines and eliminates data silos.

With vector search and flexible indexing, Cosmos DB enables you to run algorithms like:

Cosine similarity
Dot product
Euclidean distance

This helps you find the most relevant documents for a user’s question, which is an important part of building a good RAG system.

DiskANN Indexing:

Cosmos DB uses Microsoft’s DiskANN algorithm to store and search vectors quickly and cheaply. It mixes the speed of memory-based search with the low cost of storing data on disk.

Compressed vectors are stored in RAM (for speed)
Full vectors are stored on disk (for scale)

Real-World Use Cases:

Azure Cosmos DB is already used by big companies and AI systems:

KPMG made Kim Chat, a private AI assistant, using Cosmos DB with the MongoDB API. It reached 90% accuracy and improved employee productivity by 50%.
OpenAI’s ChatGPT uses Cosmos DB to save chats, manage many users by using user IDs as partition keys, and handle sudden increases in traffic, all with very low delay and no downtime.

RAG with Cosmos DB for NoSQL API: Architecture

When building a Retrieval-Augmented Generation (RAG) system with Azure Cosmos DB for NoSQL API, we follow a specific architecture to store and search data efficiently.

Overview of the Process

Data Preparation
- The dataset contains JSON documents with fields such as title, description, and price.
- Each document is sent to an embedding model (via API) to generate vector embeddings.
- Both the original document and its vector embeddings are stored together in Azure Cosmos DB for NoSQL API.
Handling a User Query
- When a user asks a question, we generate vector embeddings for that query.
- We search for the most relevant documents using a similarity algorithm.
- The matched content is sent to the GPT model along with the query.
- The GPT model generates the final answer for the user.

Vector Indexing

To make searching faster, vector embeddings need to be arranged and indexed.

Indexing Algorithm Used: DiskANN (developed by Microsoft)
- Stores most vectors on disk (cost-efficient)
- Keeps reduced-size vectors in RAM (fast search)
Indexing helps group similar vectors together in high-dimensional space.

Similarity Search

Algorithm Used: Cosine Similarity
Purpose: Compare the query embedding with stored embeddings to find the closest matches.

Vector Storage Settings in Cosmos DB

When storing vectors in Cosmos DB:

Data type: float32
Vector dimensions: Depend on the embedding model (e.g., Ada 002 produces 1536 dimensions)
You must define:
- The path for storing embeddings
- The distance function for searches (e.g., cosine similarity)
- The indexing policy (DiskANN, Flat, or Quantized Flat)

Types of Vector Indexes

Flat: Works with vectors up to 505 dimensions
Quantized Flat: Works with vectors up to 4096 dimensions
DiskANN: Works with vectors up to 4096 dimensions (used in this setup for best performance and cost balance)

Step-by-Step Guide:

Here is the step-by-step guide to building a Retrieval‑Augmented Generation (RAG) system using Azure Cosmos DB (NoSQL API) for vector storage and Azure OpenAI for embeddings and text generation

1. Prepare the GitHub Repository

Download the repository from GitHub.
After extraction, open in VS Code or any other popular programming platform of your choice.
Open the NoSQL_rag folder.
Note the contents:
- .env (environment variable file)
- Dataset JSON file (contains food item data)
- Python notebook file (main execution flow)

Deploy Azure Cosmos DB for NoSQL API

Sign in to Portal.azure.com.
Search Azure Cosmos DB and click Create.
Select Azure Cosmos DB for NoSQL API.
Create a new resource group and set the account name, choose the location, and also disable availability zones (development only).
Set Capacity Mode to Serverless (development/testing).
Click Review + Create, then Create.
Wait 3 minutes for the deployment to complete.

Deploy Azure OpenAI Resources

In the Azure Portal, locate your existing or create your Azure OpenAI resource (e.g., hybridOpenAI in Sweden Central).
Open Azure OpenAI Studio.
Deploy a GPT model (GPT-4 or GPT-3.5):
- Click Deploy Model → Select base model → Confirm.
- Set deployment type: Standard.
- Rate limit: 10K tokens/minute.
Deploy a Text Embedding model (text-embedding-ada-002):
- Deployment type: Standard.
- Capacity: 120K tokens/minute.

Collect Keys and Endpoints

Go to Azure OpenAI resource → Keys and Endpoint.
Copy Primary Key and paste into .env.
Copy Endpoint URL and paste into .env.
Add GPT engine name and embedding engine name in .env.

Get Cosmos DB Connection

Open your Cosmos DB for NoSQL API resource.
Go to Keys (under Settings).
Copy Primary Connection String.
Paste into .env file (enclose in double quotes if it contains special characters).
Set your database name (e.g., DB) and container name (e.g., container) in .env.

7. Enable Vector Search in Cosmos DB

In the Cosmos DB resource, go to Features (under Settings).
Enable Vector Search for NoSQL API (Preview).
Wait up to 15 minutes for the feature to be fully applied.

Understand the Dataset

Open the dataset JSON file.
Review structure:
- category
- name
- description
- price

Data Processing and Storage Flow

Connect to Cosmos DB using the connection string.
Create the database (if not exists) with the specified name.
Define Vector Embedding Policy:
- Path: /vector
- Data type: float32
- Distance function: cosine
- Dimensions: 1536
Set Partition Key: /category.
Define Vector Indexing Policy:
- Path: /vector
- Algorithm: DiskANN.
Create the container in Cosmos DB with the above configurations.

Generate and Store Embeddings

Connect to Azure OpenAI embedding engine.
For each dataset record:
- Pass description, title, and price to embedding engine.
- Receive 1536-dimension vector.
- Store record + vector into Cosmos DB container.
Confirm in Data Explorer that container now contains all records with vectors.

Run User Queries

Accept a user query (e.g., “Are pizzas available? I am lactose intolerant”).
Generate embeddings for the query.
Perform vector similarity search in Cosmos DB:
- Use vector_distance function with cosine similarity.
- Retrieve Top 5 most relevant items.

Pass Results to GPT Engine

Construct a system prompt describing:
- The chatbot’s purpose.
- Structure of provided context (Python list of items).
- Rules: Answer only from context, no extra info or links.
Pass user query + retrieved items as context in a user message.
Set temperature (e.g., 0.7).
Receive GPT-generated answer referencing only the provided context.

Architecture for RAG with Azure Cosmos DB MongoDB vCore

Azure Cosmos DB for MongoDB API (vCore) plays a critical role in storing both your raw data and the vector embeddings that make semantic search possible when we building Retrieval Augmented Generation (RAG) systems. This approach keeps your data and embeddings in a single place, avoiding the complexity of external ETL pipelines and preventing data silos.

Key Components

1. Azure Cosmos DB MongoDB Resource

The MongoDB vCore resource stores the application’s dataset as in our example, a JSON collection of food items.
Each record contains:

Basic attributes such as name, description, category, and price.
A vector embedding generated for its text content.

This dual storage (data + embeddings) means that the same database can power both traditional lookups and advanced similarity searches.

2. LangChain as the Orchestration Engine

LangChain serves as the “glue” between different parts of the system. It:

Loads and preprocesses documents.
Connects directly to Azure Cosmos DB MongoDB for storing and retrieving vectors.
Integrates with Azure OpenAI to:
- Generate embeddings for both documents and user queries.
- Perform cosine similarity searches.
Abstracts away the lower-level algorithms like vector distance calculation, so you can focus on building the application’s logic.

3. Azure OpenAI Resource

Two Azure OpenAI models are central to this workflow:

Text Embedding Model (text-embedding-ada-002): Converts text into 1,536-dimensional vectors for semantic matching.
Chat Model (GPT-3.5 Turbo or GPT-4): Processes user queries and generates responses using retrieved context.

RAG Workflow with MongoDB vCore

The process follows a consistent sequence:

Document Chunking: Large documents are split into smaller chunks to improve embedding accuracy.
Vector Embedding Generation: Azure OpenAI converts each chunk into a high-dimensional vector.
Data + Vector Storage: Both the original text and its vector are stored in Azure Cosmos DB MongoDB.
User Query Embedding: A query is transformed into a vector using the same embedding model.
Vector Search in Cosmos DB: LangChain runs a cosine similarity search to find the most relevant chunks.
Contextual Prompting: The retrieved chunks, along with the query, are sent to a GPT model.
Final Response: The model returns a natural-language answer enriched with context from the database.

Comparison with the NoSQL API Demo

In the earlier NoSQL API version, most of the connection handling, embedding generation, and storage logic was written manually.
With MongoDB vCore, LangChain takes on these responsibilities, providing built-in connectors and simplifying the implementation.

Why we Use LangChain?

Using an orchestration engine like LangChain allows developers to:

Work with multiple vector stores and models without rewriting core logic.
Easily switch to alternatives like Semantic Kernel SDK in the future.
Focus on the application flow instead of low-level database and algorithm details.

Building a RAG System with Azure Cosmos DB for MongoDB vCore

Here, we are setting up and running a Retrieval Augmented Generation (RAG) workflow using Azure Cosmos DB MongoDB vCore as the vector store, LangChain as the orchestration layer, and Azure OpenAI for embeddings and responses.

Download the repository

Download the repository from GitHub.

Navigate to the MongoDB Core folder, it contains:
- .env: environment variable configuration
- food_items.json: dataset
- rag_app.ipynb: main Jupyter Notebook for the setup

Deploy Azure Cosmos DB MongoDB vCore

Search for Azure Cosmos DB in the portal.
Select Create → MongoDB API → vCore Cluster (supports vector storage).
Configure:
- Resource Group
- Cluster Name
- Region
- MongoDB Version (8.0)
- Admin Credentials
Networking: Allow public access or firewall and add your current client IP.
Review + Create → Create.

Configure `.env` Variables

Populate the following environment variables:

Azure Cosmos DB MongoDB vCore

Connection String (from Settings → Connection Strings in Azure Portal)
Database Name (e.g., food_db)
Collection Name (e.g., menu_items)
Username & Password (set during cluster creation)

Azure OpenAI

API Key (from Keys and Endpoint)
Endpoint URL
API Version (from View Code in Azure AI Studio)
GPT Model Name (e.g., gpt-4)
Embedding Model Name (text-embedding-ada-002)
Embedding Deployment Name (matching your Azure deployment)
Index Name (custom name for vector index)

Install Dependencies

In the notebook, install:

pip install langchain openai pymongo

Generate Vector Embeddings

Initialize the OpenAI embedding client with your model name and deployment.
Create embeddings for all dataset entries.
Store them in Azure Cosmos DB as a vector index with cosine similarity for distance calculation.
Query the Vector Store

Use LangChain’s similarity_search() to:

Embed the user’s query
Compare with stored embeddings in Cosmos DB
Return the most relevant matches

Example:

results = vector_store.similarity_search("beef bacon")

Inspect in MongoDB Shell

From the Azure Portal:

Open Mongo Shell
Connect to your database
Run show collections and db.<collection>.findOne() to view documents
You’ll see original text fields plus a vector_content array of 1536 values.

Send Results to GPT for Final Answer

Pass the retrieved context to your GPT deployment via the Chat Completions API with a system prompt like: You are a cooking assistant. Answer the user query using only the provided context.

Example:

query: "Is iced cappuccino available and at what price?"
Response: "Yes, iced cappuccino is available for 4.99 USD."

Conclusion

By combining Azure Cosmos DB, Azure OpenAI, and LangChain, developers can build highly efficient Retrieval-Augmented Generation (RAG) systems that deliver accurate, context-rich answers from private datasets. Cosmos DB provides fast, scalable storage for both raw data and vector embeddings, removing the need for separate vector databases. Azure OpenAI enables powerful semantic search and natural-language responses, while LangChain streamlines orchestration and integration. Whether using the NoSQL API or MongoDB vCore, this architecture ensures speed, reliability, and scalability, making it ideal for enterprise-grade AI applications that require precise and relevant answers with minimal latency.

Introduction to RAG with Azure Cosmos DB

Cosmos DB for AI Workloads

Throughput Models:

Vector Search for RAG Architecture:

DiskANN Indexing:

Real-World Use Cases:

RAG with Cosmos DB for NoSQL API: Architecture

Overview of the Process

Vector Indexing

Similarity Search

Vector Storage Settings in Cosmos DB

Types of Vector Indexes

Step-by-Step Guide:

1. Prepare the GitHub Repository

Deploy Azure Cosmos DB for NoSQL API

Deploy Azure OpenAI Resources

Collect Keys and Endpoints

Get Cosmos DB Connection

7. Enable Vector Search in Cosmos DB

Understand the Dataset

Data Processing and Storage Flow

Generate and Store Embeddings

Run User Queries

Pass Results to GPT Engine

Architecture for RAG with Azure Cosmos DB MongoDB vCore

Key Components

1. Azure Cosmos DB MongoDB Resource

2. LangChain as the Orchestration Engine

3. Azure OpenAI Resource

RAG Workflow with MongoDB vCore

Comparison with the NoSQL API Demo

Why we Use LangChain?

Building a RAG System with Azure Cosmos DB for MongoDB vCore

Download the repository

Deploy Azure Cosmos DB MongoDB vCore

Configure .env Variables

Azure Cosmos DB MongoDB vCore

Azure OpenAI

Install Dependencies

Generate Vector Embeddings

Inspect in MongoDB Shell

Send Results to GPT for Final Answer

Conclusion

Muhammad Zubair

Related Articles

RAG With Azure Document Intelligence

Azure AI Content Understanding: Comprehensive Guide and Reference

Captioning with Speech to Text in Azure Speech Studio

A Practical Guide to Azure AI Foundry and Prompt Flow

Leave a Reply Cancel reply

Configure `.env` Variables