Agentic RAG: Think–Act–Retrieve–Answer

shivani.moze · August 19, 2025, 3:19am

Introduction

Large Language Models (LLMs) are powerful, but they face inherent limitations such as outdated knowledge, hallucinations, and an inability to access private or real-time data. Retrieval-Augmented Generation (RAG) was introduced to solve these issues by combining retrieval from external sources with generation by the LLM.

While effective, traditional RAG systems are limited when dealing with complex, multi-domain, or dynamic tasks. This is where Agentic RAG comes in, an advanced evolution of RAG that incorporates intelligent agents to make retrieval and generation processes smarter, more adaptive, and more reliable.

Limitations of Traditional RAG

Traditional RAG works in three basic steps:

Retrieval: Search relevant documents from an external knowledge base.
Augmentation: Insert retrieved context into the prompt.
Generation: LLM produces the final response.

This approach works well for simple Q&A, but it has drawbacks:

Single-path retrieval: Always fetches from the same knowledge source.
Shallow responses: Lacks reasoning and context synthesis.
Rigid pipeline: Cannot adapt to query type (e.g., code, charts, text).
No decomposition: Struggles with complex, multi-part questions.
Limited fallback: Often hallucinates when relevant data is missing.

What is Agentic RAG?

Traditional RAG (Retrieval-Augmented Generation) = LLM + Retriever (Vector DB).
Agentic RAG = RAG + agents with reasoning/planning abilities.

Instead of a simple “retrieve → answer”, the agent can plan, decide, use tools, call APIs, combine multiple DBs, and refine answers.

Think of it as:

Simple RAG = “memory + brain”.
Agentic RAG = “brain + memory + tools + reasoning”.

Agentic RAG enhances the traditional RAG framework by introducing autonomous agents that orchestrate the entire workflow. Instead of following a fixed pipeline, these agents dynamically decide:

Which knowledge source to use.
Whether to split queries into sub-queries.
How to summarise and synthesise multiple sources.
What response format (text, code, chart) is most appropriate?
How to gracefully handle failures when data is not available.

In other words, Agentic RAG transforms a static retrieval process into an adaptive, multi-step reasoning system.

Agentic RAG Architecture

User Query
The process begins when a user submits a query. Instead of directly passing this query to a Retrieval-Augmented Generation (RAG) pipeline, the system routes it through an LLM Agent, which acts as a reasoning engine.
LLM Agent

The agent interprets the query and decides the next step.
Unlike traditional RAG, which performs a fixed retrieval step, the agent applies reasoning and planning to determine whether additional context is required.
The agent can break the query into sub-questions, chain reasoning steps, or orchestrate external tools.

Context Retrieval Decision

If the query is answerable with internal knowledge → the agent proceeds without retrieval.
If more context is required → the agent triggers a web search or vector database lookup.

Web Search / External Knowledge Access

The agent queries external sources (web search, APIs, or structured DBs) to gather missing information.
Retrieved content is filtered, ranked, and structured before being fed back into the reasoning loop.

Retrieved Context Integration

When new context is retrieved, it is validated by the agent.
If relevant → the agent incorporates it into the prompt.
If irrelevant → the agent retries retrieval or reformulates the query.

Answer Generation

The agent synthesizes retrieved context and its own reasoning to generate a final structured response.
This ensures answers are accurate, grounded, and dynamically adapted based on user needs.

How Agentic RAG Works

Agentic RAG Workflow:

User Query → Input enters the system
User submits a query to the system.
Query Assessment & Dynamic Routing → Decide & choose source
Determine if the query can be answered from internal knowledge or needs external retrieval, and select the appropriate tool(s) or data source (PDF retrievers, vector DBs, web search, etc.).
Query Decomposition → Break down complex queries
Split complex queries into smaller sub-queries for targeted retrieval.
Targeted Retrieval → Fetch relevant information
Retrieve top-k relevant chunks from each selected data source.
Context Injection & Reasoning Loop → Merge & iterate
Combine retrieved info with the query for context, let the agent reason, and decide if further retrieval is needed.
Summarisation & Response Type Selection → Draft answer
Merge retrieved content and reasoning into a coherent answer, selecting the appropriate output format (text, code, table, visualisation, etc.).
Failsafe Handling & Final Response → Deliver answer
Use fallback tools if needed to reduce hallucinations, then deliver a complete, context-aware response to the user.

Traditional RAG VS Agentic RAG

Category	Traditional RAG	Agentic RAG
Retrieval	Fetches from a single DB	Routes queries to multiple specialized DBs
Query Handling	One-shot, no decomposition	Breaks into sub-queries and orchestrates
Output	Primarily text	Text, code, charts, or mixed outputs
Adaptability	Static pipeline	Dynamic, agent-driven decision-making
Accuracy	Limited by shallow retrieval	Improved via summarization + synthesis
Reliability	Prone to hallucinations	Failsafe fallback mechanisms
Scalability	Effective for narrow tasks	Handles multi-domain, complex workflows

Types of Agents in Agentic RAG

There are multiple agent architectures (depends on complexity of queries):

1. ReAct Agent (Reasoning + Acting)

Thinks step by step: “First retrieve from DB → then do math → then answer.”
Keeps reasoning + tool calls interleaved.
Best for: multi-step reasoning with tool usage.

Example: LLM says:

Thought: I need info from DB.
Action: Call Retriever.
Observation: Got text.
Thought: Need to calculate.
Action: Call Math tool.
… until final answer.

2. Tool-Calling Agent

LLM directly calls tools via function calling (like OpenAI function calling or LangChain tools).
No explicit reasoning visible, just structured function calls.
Best for: API integrations, structured data pipelines.

3. Query Planning Agent

Breaks a complex query into sub-queries.
Runs retrieval for each.
Combines results in final synthesis.
Best for: long, multi-part queries.

Example: “Summarize Paper A, then compare with Paper B, then extract FAQs.”

4. Multi-Agent System

Multiple agents specialized in tasks:
- Research Agent (fetch papers).
- Summarizer Agent (make FAQs).
- Evaluator Agent (check answer quality).
They collaborate and pass info.
Best for: big pipelines with multiple DBs or tasks.

5. Conversational Agent (Memory + RAG)

Keeps track of conversation context (chat history).
Uses RAG + memory.
Best for: chatbots, assistants.

6. Self-Reflective Agent

Generates an answer.
Critiques or improves its own answer before finalizing.
Best for: higher accuracy and quality checks.

Why Shift to Agentic RAG?

Organizations should prefer Agentic RAG because:

It handles complex, multi-step queries more effectively.
It improves accuracy and reliability through intelligent orchestration.
It allows multi-modal outputs.
It adapts to different domains and specialized datasets.
It reduces hallucinations with failsafe fallback mechanisms.

In short, Agentic RAG transforms LLMs into adaptive problem solvers rather than static Q&A systems.

Here’s an example illustrating the full workflow of Agentic RAG from retrieving relevant chunks from PDFs to generating a final answer using the agent.

Step1. Imports + why

import os
from pathlib import Path
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_sambanova import ChatSambaNovaCloud, SambaNovaCloudEmbeddings
from langchain.vectorstores import DeepLake
from langchain.agents import initialize_agent, Tool
from langchain_community.tools.tavily_search.tool import TavilySearchResults

import os — used to access and set environment variables (API keys) and other OS-level operations.
from pathlib import Path — nicer, cross-platform file/path utilities (Path(...).exists(), .stem, etc.).
PyPDFLoader — LangChain loader that reads a PDF and returns Document objects (text + metadata). We use it to extract raw text from each PDF.
RecursiveCharacterTextSplitter — splits long documents into smaller chunks (by characters) with overlap. We do this before creating embeddings so each chunk fits in the retriever/LLM context and retrievals are more focused.
ChatSambaNovaCloud — an LLM wrapper that sends prompts to Samba cloud models (used here as the main LLM).
SambaNovaCloudEmbeddings — wrapper to produce embeddings from Samba cloud models (used to convert text chunks into vectors).
DeepLake — vector store to persist embeddings + document chunks (from Activeloop / DeepLake). It provides retrieval APIs and persistence.
initialize_agent, Tool — LangChain agent initializer and Tool object wrapper. Tools let the agent call retrievers or other functions.
TavilySearchResults — a web search tool from langchain_community that uses the Tavily service as a fallback web search.

Step2. API keys

The LLM/embedding tools and the Tavily tool will read these env vars to authenticate API calls.

code snippet

#API keys
os.environ["SAMBANOVA_API_KEY"] = "your_sambanova_key"
os.environ["TAVILY_API_KEY"] = "your_tavily_key"

Step3. LLM + Embeddings

What it does

llm: creates a wrapper object to call a named Samba cloud model (used to generate natural-language outputs).
embeddings: creates an embeddings client that will convert text chunks into vectors using the specified embeddings model.

Why

You need an LLM to synthesize answers and a separate embeddings model to create vectors for retrieval. Some setups use the same model for both; here they choose a specific embeddings model (E5-Mistral…) which is typically optimized for embeddings.

code snippet

llm = ChatSambaNovaCloud(
    model="Meta-Llama-3.3-70B-Instruct",
    max_tokens=500,
    temperature=0.7,
    top_p=0.01,
)

embeddings = SambaNovaCloudEmbeddings(
    model="E5-Mistral-7B-Instruct",
    sambanova_api_key=os.environ["SAMBANOVA_API_KEY"]
)

Step4. Build or Load PDF Vector Database

Why use Path.stem for DB naming?
→ Ensures every PDF has a unique, human-readable database name based on its filename.
Why store under db_root ?
→ Keeps vector stores organized in a structured directory, so each PDF has its own persistent DB.
Why check if DB exists first?
→ Saves time and cost — prevents recomputing embeddings every time the same PDF is used.
Why use PyPDFLoader ?
→ Specialized loader that extracts text + metadata from PDFs into Document objects, ready for further processing.
Why chunk the text?
- LLMs have context window limits (e.g., 4k–32k tokens).
- Splitting ensures chunks are small enough to fit into prompts.
- Overlap keeps semantic continuity between chunks, so queries don’t lose context when text is split mid-sentence.
  Instantiate a splitter that splits by characters into chunks of ~500 characters with 50 chars overlap.
  - Why chunking?: Break long text into manageable pieces so retrieval gives focused contexts that fit LLM context windows.
  - Why overlap?: Keeps context between chunk boundaries so you don’t lose information split in the middle of a concept/sentence.
  - Note: chunk_size here is by characters; some apps prefer token-based splitters for better alignment with token budgets.
Why DeepLake?
- A vector database optimized for embeddings.
- Stores text as vectors → enables semantic search / retrieval.
- Persistent → once built, the DB doesn’t need to be recreated.
Why read_only=True when loading existing DB?
→ Prevents accidental overwrites, ensures integrity of already-built embeddings.
Why return (name, db) ?
- name → lets you reference which PDF/DB is being used.
- db → gives you direct access for queries (db.similarity_search(query)).

code snippet

#Load or build DB
def build_or_load_db(pdf_path, db_root="vector_stores"):
    name = Path(pdf_path).stem
    db_path = f"{db_root}/{name}"

    if Path(db_path).exists():
        db = DeepLake(dataset_path=db_path, embedding=embeddings, read_only=True)
    else:
        loader = PyPDFLoader(pdf_path)
        docs = loader.load()
        splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
        chunks = splitter.split_documents(docs)
        db = DeepLake(dataset_path=db_path, embedding=embeddings)
        db.add_documents(chunks)
    return name, db

Step5. Load PDFs and create tools

This code prepares the tools that the agent will use to answer queries:

PDF Files: Specifies the paths of PDFs to be used as knowledge sources.
Retriever Tool: For each PDF, a retriever function is created that searches the vector store (db) for the top 3 semantically similar text chunks. This function is wrapped as a LangChain Tool so the agent can call it.
Build Tools List: Loops through all PDFs, builds or loads their vector DBs, and appends the corresponding retriever tools to the tools list.
Tavily Fallback Tool: Adds a web search tool for queries not covered in the PDFs, providing the agent with an external fallback source.
Confirmation: Prints all loaded tools to verify that retrievers and fallback tools are ready for the agent.
Purpose: Enables the agent to dynamically select between multiple PDF retrievers and a web search tool, supporting multi-source retrieval in an Agentic RAG workflow.

code snippet

# Update these paths
pdf_files = [
    "/Users/shivanim/Desktop/pdf1.pdf",
    "/Users/shivanim/Desktop/pdf2.pdf",
]

#Retriever wrapper
def make_retriever_tool(name, db):
    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 3})

    def retriever_func(query: str):
        docs = retriever.get_relevant_documents(query)
        if not docs:
            return "No relevant information found."
        return "\n".join(d.page_content for d in docs)

    return Tool(
        name=f"{name}_retriever",
        func=retriever_func,   # function only takes query
        description=f"Use this tool to search inside the {name} PDF."
    )


    tools = []
for pdf in pdf_files:
    name, db = build_or_load_db(pdf)
    tools.append(make_retriever_tool(name, db))

#Tavily fallback tool
tavily_tool = TavilySearchResults(
    name="tavily_search_engine",
    description="Use this if the PDFs don’t have the answer.",
    max_results=1,
    include_answer=True,
)
tools.append(tavily_tool)

print("Tools loaded:", [t.name for t in tools])

Step6. Agent Initialization for Multi-Tool Reasoning in Agentic RAG

This is Zero-Shot ReAct Agent.

agent="zero-shot-react-description" → tells LangChain to use a ReAct-style agent in zero-shot mode.
ReAct = Reasoning + Acting:
1. Reasoning: the agent thinks step by step (“I need to search pdf1 PDF for this info”).
2. Acting: it chooses and calls a relevant tool (like a retriever or web search).
3. Observation: gets results and continues reasoning if needed.
Zero-Shot: it doesn’t require prior examples or fine-tuning; it interprets the tool descriptions to decide which one to use.

Why You Used This Agent

You have multiple tools (pdf1 retriever, pdf2 retriever, Tavily search).
The agent must dynamically decide which tool to call based on the user query.
ReAct agents are ideal because they reason, act, and iterate until they produce an answer.
Zero-shot allows using this setup without creating a custom prompt or training data.

I have used Zero-Shot ReAct Agent to enable multi-tool reasoning, letting the agent decide on-the-fly which PDF retriever or fallback search to use to answer user queries.You can use different agent according to your use case.

code snippet

#Initialize Agent
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=False)

Step7. Query Execution and Answer Retrieval in Agentic RAG

This code sends a user query to the Zero-Shot ReAct agent. The agent evaluates the query, decides which tool(s) to call (pdf1 PDF retriever, pdf2 PDF retriever, or Tavily search), retrieves relevant information, and synthesizes a final answer using the LLM. The result returned is a structured dictionary, and result["output"] contains the final, aggregated answer that the agent generates for the query.

code snippet

# Ask query
query = "Does SambaNova support streaming completions like OpenAI? How does handle streaming responses?"
result = agent.invoke({"input": query})

print("\n Final Answer:\n", result["output"])

Agentic RAG is like giving an AI assistant not just memory, but also reasoning and tools. Instead of just pulling info from a database, it can plan, fetch from multiple places, run calculations, and then answer in a structured way.

Thank you and happy learning……………..!

coby.adams · August 19, 2025, 3:34am

@shivani.moze this is awesome !!!