From LLM to Agent: A Deep Dive into AI Agent Architecture Evolution

Introduction

When discussing AI Agents, the following questions are frequently raised:

How does an Agent work?
What's the difference between using an Agent and directly using ChatGPT/Qwen/DeepSeek?
What does Manus do? Why do people prefer Manus over native ChatGPT?
What's the relationship between Agent and RAG?
Do you need to fine-tune models for building Agents?

The essence of these questions is: What exactly is an "Agent"? What's the fundamental difference between it and directly calling a large model? Is it just a rebranded concept?

This article will systematically answer all these questions with clear explanations and diagrams.

Three Generations of Architecture Evolution: LLM → Workflow → Agent

The technical architecture of intelligent applications has evolved through three generations: Bare LLM → Workflow → Agent

First Generation: Bare LLM (Conversation is the End)

Representative Products: ChatGPT, Claude, Qwen, DeepSeek native chat interfaces

Architecture Characteristics:

Characteristics: Single request-response, stateless, no execution capability

Capability Boundaries:

What it can do	What it cannot do
Answer questions, generate text	Execute actual operations (e.g., query database, call APIs)
Provide suggestions and solutions	Get real-time information (knowledge has cutoff date)
Code generation, copywriting	Cross-system collaboration for complex tasks
Single or simple multi-turn conversations	Long-term task tracking and state management

Essential Positioning: A knowledgeable consultant who "can only talk but cannot act."

Second Generation: Workflow (Orchestrating LLM)

Representative Products: Dify, Coze, Baidu Qianfan, LangChain (Chain mode), etc.

Architecture Characteristics:

Characteristics: Humans design the flow, LLM executes within preset nodes. All possible execution paths must be designed in advance.

Improvements over Bare LLM:

Improvement	Description
Can execute operations	Call external systems through API nodes
Flow orchestration	Multi-step tasks can be executed in sequence
Conditional branching	Different paths for different situations
Knowledge enhancement	Can integrate RAG for knowledge retrieval

Core Limitations:

Problem	Manifestation
Branch Explosion	Business scenario combinations grow exponentially, impossible to enumerate
Preset Paths	All possible execution paths must be designed in advance
Long-tail Issues	Edge cases are difficult to cover
Poor Flexibility	New scenarios require flow modifications, slow response to changes

Essential Positioning: LLM is "orchestrated" within preset flows, flows are designed by humans, LLM is just an execution node in the flow.

Third Generation: Agent (Autonomous Intelligent Agent)

Representative Products: Manus, OpenAI Operator, Claude Computer Use, AutoGPT, MetaGPT

Architecture Characteristics:

Characteristics: LLM decides what to do, how to do it, and which tools to use

Core Capabilities:

Capability	Description
Autonomous Planning	LLM decomposes tasks and decides execution steps
Tool Invocation	Autonomously selects appropriate tools as needed
Feedback Loop	Observes results after execution, dynamically adjusts strategy
Long-running Tasks	Can continuously execute complex multi-step tasks

Essential Positioning: LLM transforms from "being orchestrated" to "the orchestrator," deciding what to call, what to execute, and what to do next.

Summary: Three Generations Compared

Dimension	Bare LLM	Workflow	Agent
Representative Products	ChatGPT/Qwen/DeepSeek	Dify/Coze	Manus/Operator
Execution Capability	None, text output only	Yes, but paths preset	Yes, autonomous decision
Flow Control	None	Human-designed flows	LLM autonomous planning
Tool Invocation	None	Preset node calls	Autonomous selection
Handling Long-tail	Poor	Average	Good
Maintenance Cost	Low	High (branch explosion)	Low (no need to enumerate)
Flexibility	High (but no execution)	Low (fixed paths)	High
Analogy	Consultant who only talks	Actor following script	Assistant who gets things done

One-liner Summary:

Bare LLM: Tells you "how to do it"
Workflow: "Does it for you" following preset paths
Agent: Figures out how to "get it done for you"

Core Working Principles of Agent

Agent Loop: Observe-Think-Act-Feedback

The core of an Agent is a continuously running loop called the Agent Loop:

Four Phases Explained:

Phase	Core Actions
Observe	Receive user input, sense current state, get environment information
Think	Understand task goal, analyze current state, plan next step, select tools
Act	Invoke selected tools, execute specific operations, interact with external systems
Feedback	Observe execution results, judge if goal achieved, decide to continue/adjust/end

Core Components of Agent

A complete Agent system contains four core components:

Four Components Explained:

Component	Function	Description
LLM (Brain)	Understanding, reasoning, decision-making	The thinking center of Agent, can be GPT-4/Claude/Qwen, etc.
Tool Set (Hands & Feet)	Execute external operations	API calls, database queries, code execution, web browsing, etc.
Memory System	State management	Short-term memory (conversation context), long-term memory (user profile), working memory (task state)
Planner	Task decomposition	Decompose complex tasks into sub-steps, determine execution order, dynamically adjust plans

Technical Implementation Paradigms

Current mainstream Agent implementation paradigms in the industry:

ReAct (Reasoning + Acting)

Core Idea: Let LLM alternate between "reasoning" and "acting," thinking about why before each action.

User: Check the weather in Beijing tomorrow

Agent Thinking: User wants to know Beijing's weather tomorrow, I need to call weather query tool
Agent Action: Call get_weather(city="Beijing", date="tomorrow")
Agent Observation: Result shows sunny, temperature 15-25°C
Agent Thinking: Got weather data, can now answer the user
Agent Output: Beijing will be sunny tomorrow, 15-25°C, great for outdoor activities.

Pros: Transparent reasoning process, easy to debug and audit
Cons: Every step requires LLM reasoning, lower efficiency

Plan-and-Execute

Core Idea: First make a complete plan, then execute step by step.

User: Help me create an analysis report on the EV market

Agent Planning:
  1. Search for latest EV market data
  2. Collect sales information of major manufacturers
  3. Analyze market trends and competitive landscape
  4. Generate charts and visualizations
  5. Compile into a complete report document

Agent Execution: Execute steps according to plan...

Pros: More organized handling of complex tasks
Cons: Once planned, less flexibility to adjust

Multi-Agent (Multi-Agent Collaboration)

Core Idea: Multiple specialized Agents work together, each handling their expertise.

Pros: Professional division of labor, strong capability for complex tasks
Cons: High system complexity, high coordination costs

Relationship Between Agent and RAG

What is RAG?

RAG (Retrieval Augmented Generation) is a technique that enables LLMs to access external knowledge:

Three Steps:

Retrieval: Convert question to vector, retrieve relevant documents from knowledge base
Augmentation: Inject retrieved document snippets as context into Prompt
Generation: LLM generates answer based on context

RAG is One of Agent's "Tools"

Core Point: RAG and Agent are not concepts at the same level.

RAG is one of the many tools an Agent can invoke, used for knowledge retrieval

Detailed Comparison

Dimension	RAG	Agent
Essence	A knowledge enhancement technique	A system architecture
Purpose	Enable LLM to access external knowledge	Enable LLM to autonomously complete tasks
Capability	Retrieve + Answer	Plan + Execute + Feedback
Execute Operations	Cannot	Can
Relationship	Is a tool for Agent	Can include RAG

How Agent Uses RAG

In Agent architecture, RAG is typically invoked as a "knowledge retrieval tool":

User: "According to company policy, how many days of annual leave can I apply for?"

Agent Thinking: This is a question requiring internal policy lookup, I need to search knowledge base
Agent Action: Call rag_search(query="annual leave days company policy")
Agent Observation: Results show "1-5 years tenure: 5 days, 5-10 years: 10 days, over 10 years: 15 days"
Agent Thinking: Got policy info, but need user's tenure for accurate answer
Agent Action: Ask user "How many years have you been with the company?"
User: "3 years"
Agent Output: According to company policy, with 3 years tenure, you can apply for 5 days of annual leave.

RAG and Agent Combination Patterns

Pattern	Description	Use Case
RAG Only	Pure knowledge Q&A, no execution capability	Simple FAQ, document Q&A
Agent + RAG	Agent calls RAG for knowledge, then executes operations	Task execution requiring knowledge support
Agentic RAG	RAG flow itself controlled by Agent, can dynamically decide retrieval strategy	Complex knowledge reasoning

Does Agent Need Model Fine-tuning?

Core Conclusion: Usually Not Needed

Agent's capability mainly comes from system architecture, not model fine-tuning.

Key Insights:

LLM Base Capability: General LLMs (like GPT-4/Claude/Qwen) are sufficient, usually no fine-tuning needed
System Architecture Capability: Tool definitions, memory system, prompt engineering, Agent Loop — these are the main sources of Agent capability, unrelated to model fine-tuning

Why Agent Usually Doesn't Need Fine-tuning?

Reason	Explanation
LLM Capability is Sufficient	Modern LLMs (GPT-4, Claude 3, Qwen2.5, etc.) have sufficient general capability to support Agent's reasoning and planning needs
Capability Comes from Architecture	Agent's "execution capability" comes from tool invocation, not the model itself
Prompt Engineering is More Efficient	Through well-designed System Prompts, general LLMs can behave like domain experts
Greater Flexibility	No fine-tuning means you can switch base models anytime, benefiting from model iterations
Lower Cost	Fine-tuning requires data, compute, and time, while prompt engineering costs almost nothing

When Might Fine-tuning Be Needed?

While usually not needed, consider fine-tuning in these scenarios:

Scenario	Description	Fine-tuning Goal
Domain-specific Terminology	Model doesn't accurately understand specific domain terms	Improve domain language understanding
Specific Output Format	Need model to stably output specific formats	Enhance format compliance
Cost Optimization	Replace large model with small model + fine-tuning	Reduce inference cost
Private Deployment	Cannot call cloud LLM APIs	Improve local small model capability

Fine-tuning vs Prompt Engineering vs RAG

Technique	Purpose	Use Cases	Cost
Prompt Engineering	Guide model behavior	Almost all scenarios	Low
RAG	Inject external knowledge	Scenarios requiring specific knowledge	Medium
Fine-tuning (LoRA, etc.)	Change model parameters	Special formats/domains/cost optimization	High

Recommended Strategy:

Priority: Prompt Engineering + RAG
Secondary: If effects are insufficient, consider fine-tuning
Principle: Avoid fine-tuning if possible, maintain architecture flexibility

The Right Path for Agent Capability Enhancement

Priority	Approach	Specific Work
1	Prompt Engineering	Optimize System Prompt, design clear tool descriptions, Few-shot examples
2	Tool Ecosystem	Enrich tool set, optimize tool interface design, tool composition capability
3	Knowledge Enhancement	Build high-quality knowledge base, optimize retrieval strategy, knowledge update mechanism
4	Model Fine-tuning	Domain terminology understanding, stable output format, cost optimization (only when necessary)

Industry Case Studies

Manus: Why is it Popular? What Does it Do?

Manus is an AI Agent product that went viral in early 2025, called "the first AI that can actually help you do things."

Manus's Core Capabilities

Capability	Description	Examples
Browser Automation	Operates web pages like a human	Auto search, fill forms, download files
Code Execution	Write and run code	Data processing, file conversion, automation scripts
File Operations	Create, edit, manage files	Generate reports, organize documents, batch processing
Task Orchestration	Break down and execute complex tasks	Multi-step task automation

Manus vs Native ChatGPT Comparison

Task	ChatGPT's Response	Manus's Approach
"Help me create a competitor analysis report"	Gives you a report template and framework suggestions	Auto-searches competitor info, collects data, generates complete analysis report
"Check Beijing's weather tomorrow"	Tells you which website to check	Opens weather site, queries data, directly tells you the result
"Convert this PDF to Word"	Recommends some online conversion tools	Directly executes conversion, generates Word file for you
"Help me analyze this Excel data"	Needs you to paste data in	Directly reads file, executes analysis, generates charts

Why Do People Prefer Manus?

One-liner: Manus transforms LLM from a "conversation tool" into an "execution assistant."

Specific reasons:

From "Suggestion" to "Execution": Not just telling you how to do it, but actually doing it for you
Reduced Human Intervention: Users only need to state the goal, no step-by-step guidance needed
Cross-system Capability: Can operate multiple websites and tools simultaneously to complete complex tasks
Result Delivery: Final output is usable files and data, not just text suggestions

OpenAI Operator / Claude Computer Use

In 2024-2025, OpenAI and Anthropic successively launched Computer Use capabilities:

Product	Release Date	Core Capability
Claude Computer Use	Oct 2024	Claude can control computers, operating mouse and keyboard like a human
OpenAI Operator	Jan 2025	GPT can automatically execute web tasks like shopping, booking, etc.

This indicates: Agent-ification is an inevitable trend for LLM applications, with leading vendors all moving in this direction.

MCP Protocol: Industry Standard for Agent Tool Invocation

In November 2024, Anthropic released the MCP (Model Context Protocol), which is becoming the de facto standard for Agent tool invocation:

Feature	Description
Standardization	Unified tool description and invocation protocol
Ecosystem Sharing	Develop once, reuse across multiple Agents
Industry Recognition	Alipay, WeChat, and others have launched MCP Servers

How to Build an Agent System

Technical Component Checklist

Building a complete Agent system requires the following technical components:

Layer Details:

Layer	Component	Specifics
1	Agent Runtime Framework	Agent Loop implementation, plan-execute-feedback cycle, multi-Agent collaboration. Options: LangGraph/AutoGPT/CrewAI/Custom
2	Tool System	Tool definition (MCP/Function Calling), tool registry, invocation engine, error handling & retry
3	Memory System	Short-term memory (conversation context), long-term memory (user profile, task history), working memory (task intermediate state)
4	Perception Enhancement	Multi-modal input processing, RAG knowledge retrieval, web search
5	Security Control	Input security (injection detection), execution security (permission control), output security (hallucination detection)

Minimal Viable Agent

To quickly experience Agent development, start with a minimal viable approach:

python

# Pseudocode: Minimal Viable Agent
def minimal_agent(user_input, tools, llm):
    # 1. Send user input and available tool descriptions to LLM
    prompt = f"""
    You are an AI assistant that can use the following tools:
    {format_tools(tools)}
    
    User request: {user_input}
    
    Analyze the user's needs, decide whether to call a tool, and how to respond.
    """
    
    while True:
        # 2. LLM thinks and decides next step
        response = llm.chat(prompt)
        
        # 3. If LLM decides to call a tool
        if response.has_tool_call:
            tool_name = response.tool_call.name
            tool_args = response.tool_call.arguments
            
            # 4. Execute tool call
            result = tools[tool_name].execute(tool_args)
            
            # 5. Feed result back to LLM, continue loop
            prompt += f"\nTool {tool_name} returned: {result}"
        else:
            # 6. LLM gives final answer, end loop
            return response.content

Recommended Framework Choices

Framework	Features	Use Cases
LangGraph	By LangChain team, supports complex Agent flow orchestration	Complex multi-step Agents
AutoGPT	Early Agent framework, active community	General autonomous Agents
CrewAI	Focused on multi-Agent collaboration	Multi-role collaboration scenarios
OpenAI Assistants API	Official OpenAI solution, high integration	Quick prototyping, GPT users
Custom	Full control, deep customization	Production systems with special requirements

Summary

This article systematically explored the technical evolution from LLM to Agent:

Three Generations: Bare LLM (can only talk) → Workflow (follows script) → Agent (acts autonomously)
Agent Core: Observe-Think-Act-Feedback loop, composed of LLM + Tools + Memory + Planning
Agent vs RAG: RAG is a tool for Agent, used for knowledge enhancement; they're not concepts at the same level
Fine-tuning Needed?: Usually not, Agent capability mainly comes from system architecture rather than model parameters
Industry Trend: Agent-ification is the inevitable direction for LLM applications, with leading vendors all investing in this area

One-liner to understand Agent: Agent = LLM (Brain) + Tools (Hands & Feet) + Memory + Planning Capability, enabling AI to evolve from "can talk" to "can act."

Appendix: Technical Terminology

Term	Full Name	Explanation
Agent	Agent	Intelligent entity capable of perceiving environment, making autonomous decisions, and executing actions
LLM	Large Language Model	Large language models like GPT-4, Claude, Qwen, etc.
Workflow	Workflow	Predefined task execution flow
MCP	Model Context Protocol	Model context protocol proposed by Anthropic for Agent tool invocation standard
RAG	Retrieval Augmented Generation	Enhancing LLM responses by retrieving external knowledge
ReAct	Reasoning + Acting	An Agent paradigm alternating between reasoning and acting
CoT	Chain of Thought	Prompting technique that makes LLM show reasoning process
Tool Calling	Tool Calling / Function Calling	Capability for LLM to invoke external tools
LoRA	Low-Rank Adaptation	An efficient model fine-tuning method

From LLM to Agent: A Deep Dive into AI Agent Architecture Evolution ​

Introduction ​

Three Generations of Architecture Evolution: LLM → Workflow → Agent ​

First Generation: Bare LLM (Conversation is the End) ​

Second Generation: Workflow (Orchestrating LLM) ​

Third Generation: Agent (Autonomous Intelligent Agent) ​

Summary: Three Generations Compared ​

Core Working Principles of Agent ​

Agent Loop: Observe-Think-Act-Feedback ​

Core Components of Agent ​

Technical Implementation Paradigms ​

ReAct (Reasoning + Acting) ​

Plan-and-Execute ​

Multi-Agent (Multi-Agent Collaboration) ​

Relationship Between Agent and RAG ​

What is RAG? ​

RAG is One of Agent's "Tools" ​

Detailed Comparison ​

How Agent Uses RAG ​

RAG and Agent Combination Patterns ​

Does Agent Need Model Fine-tuning? ​

Core Conclusion: Usually Not Needed ​

Why Agent Usually Doesn't Need Fine-tuning? ​

When Might Fine-tuning Be Needed? ​

Fine-tuning vs Prompt Engineering vs RAG ​

The Right Path for Agent Capability Enhancement ​

Industry Case Studies ​

Manus: Why is it Popular? What Does it Do? ​

Manus's Core Capabilities ​

Manus vs Native ChatGPT Comparison ​

Why Do People Prefer Manus? ​

OpenAI Operator / Claude Computer Use ​

MCP Protocol: Industry Standard for Agent Tool Invocation ​

How to Build an Agent System ​

Technical Component Checklist ​

Minimal Viable Agent ​

Recommended Framework Choices ​

Summary ​

Appendix: Technical Terminology ​

From LLM to Agent: A Deep Dive into AI Agent Architecture Evolution

Introduction

Three Generations of Architecture Evolution: LLM → Workflow → Agent

First Generation: Bare LLM (Conversation is the End)

Second Generation: Workflow (Orchestrating LLM)

Third Generation: Agent (Autonomous Intelligent Agent)

Summary: Three Generations Compared

Core Working Principles of Agent

Agent Loop: Observe-Think-Act-Feedback

Core Components of Agent

Technical Implementation Paradigms

ReAct (Reasoning + Acting)

Plan-and-Execute

Multi-Agent (Multi-Agent Collaboration)

Relationship Between Agent and RAG

What is RAG?

RAG is One of Agent's "Tools"

Detailed Comparison

How Agent Uses RAG

RAG and Agent Combination Patterns

Does Agent Need Model Fine-tuning?

Core Conclusion: Usually Not Needed

Why Agent Usually Doesn't Need Fine-tuning?

When Might Fine-tuning Be Needed?

Fine-tuning vs Prompt Engineering vs RAG

The Right Path for Agent Capability Enhancement

Industry Case Studies

Manus: Why is it Popular? What Does it Do?

Manus's Core Capabilities

Manus vs Native ChatGPT Comparison

Why Do People Prefer Manus?

OpenAI Operator / Claude Computer Use

MCP Protocol: Industry Standard for Agent Tool Invocation

How to Build an Agent System

Technical Component Checklist

Minimal Viable Agent

Recommended Framework Choices

Summary

Appendix: Technical Terminology