From LLM to Agent: A Deep Dive into AI Agent Architecture Evolution
Introduction
When discussing AI Agents, the following questions are frequently raised:
- How does an Agent work?
- What's the difference between using an Agent and directly using ChatGPT/Qwen/DeepSeek?
- What does Manus do? Why do people prefer Manus over native ChatGPT?
- What's the relationship between Agent and RAG?
- Do you need to fine-tune models for building Agents?
The essence of these questions is: What exactly is an "Agent"? What's the fundamental difference between it and directly calling a large model? Is it just a rebranded concept?
This article will systematically answer all these questions with clear explanations and diagrams.
Three Generations of Architecture Evolution: LLM → Workflow → Agent
The technical architecture of intelligent applications has evolved through three generations: Bare LLM → Workflow → Agent
First Generation: Bare LLM (Conversation is the End)
Representative Products: ChatGPT, Claude, Qwen, DeepSeek native chat interfaces
Architecture Characteristics:
Characteristics: Single request-response, stateless, no execution capability
Capability Boundaries:
| What it can do | What it cannot do |
|---|---|
| Answer questions, generate text | Execute actual operations (e.g., query database, call APIs) |
| Provide suggestions and solutions | Get real-time information (knowledge has cutoff date) |
| Code generation, copywriting | Cross-system collaboration for complex tasks |
| Single or simple multi-turn conversations | Long-term task tracking and state management |
Essential Positioning: A knowledgeable consultant who "can only talk but cannot act."
Second Generation: Workflow (Orchestrating LLM)
Representative Products: Dify, Coze, Baidu Qianfan, LangChain (Chain mode), etc.
Architecture Characteristics:
Characteristics: Humans design the flow, LLM executes within preset nodes. All possible execution paths must be designed in advance.
Improvements over Bare LLM:
| Improvement | Description |
|---|---|
| Can execute operations | Call external systems through API nodes |
| Flow orchestration | Multi-step tasks can be executed in sequence |
| Conditional branching | Different paths for different situations |
| Knowledge enhancement | Can integrate RAG for knowledge retrieval |
Core Limitations:
| Problem | Manifestation |
|---|---|
| Branch Explosion | Business scenario combinations grow exponentially, impossible to enumerate |
| Preset Paths | All possible execution paths must be designed in advance |
| Long-tail Issues | Edge cases are difficult to cover |
| Poor Flexibility | New scenarios require flow modifications, slow response to changes |
Essential Positioning: LLM is "orchestrated" within preset flows, flows are designed by humans, LLM is just an execution node in the flow.
Third Generation: Agent (Autonomous Intelligent Agent)
Representative Products: Manus, OpenAI Operator, Claude Computer Use, AutoGPT, MetaGPT
Architecture Characteristics:
Characteristics: LLM decides what to do, how to do it, and which tools to use
Core Capabilities:
| Capability | Description |
|---|---|
| Autonomous Planning | LLM decomposes tasks and decides execution steps |
| Tool Invocation | Autonomously selects appropriate tools as needed |
| Feedback Loop | Observes results after execution, dynamically adjusts strategy |
| Long-running Tasks | Can continuously execute complex multi-step tasks |
Essential Positioning: LLM transforms from "being orchestrated" to "the orchestrator," deciding what to call, what to execute, and what to do next.
Summary: Three Generations Compared
| Dimension | Bare LLM | Workflow | Agent |
|---|---|---|---|
| Representative Products | ChatGPT/Qwen/DeepSeek | Dify/Coze | Manus/Operator |
| Execution Capability | None, text output only | Yes, but paths preset | Yes, autonomous decision |
| Flow Control | None | Human-designed flows | LLM autonomous planning |
| Tool Invocation | None | Preset node calls | Autonomous selection |
| Handling Long-tail | Poor | Average | Good |
| Maintenance Cost | Low | High (branch explosion) | Low (no need to enumerate) |
| Flexibility | High (but no execution) | Low (fixed paths) | High |
| Analogy | Consultant who only talks | Actor following script | Assistant who gets things done |
One-liner Summary:
- Bare LLM: Tells you "how to do it"
- Workflow: "Does it for you" following preset paths
- Agent: Figures out how to "get it done for you"
Core Working Principles of Agent
Agent Loop: Observe-Think-Act-Feedback
The core of an Agent is a continuously running loop called the Agent Loop:
Four Phases Explained:
| Phase | Core Actions |
|---|---|
| Observe | Receive user input, sense current state, get environment information |
| Think | Understand task goal, analyze current state, plan next step, select tools |
| Act | Invoke selected tools, execute specific operations, interact with external systems |
| Feedback | Observe execution results, judge if goal achieved, decide to continue/adjust/end |
Core Components of Agent
A complete Agent system contains four core components:
Four Components Explained:
| Component | Function | Description |
|---|---|---|
| LLM (Brain) | Understanding, reasoning, decision-making | The thinking center of Agent, can be GPT-4/Claude/Qwen, etc. |
| Tool Set (Hands & Feet) | Execute external operations | API calls, database queries, code execution, web browsing, etc. |
| Memory System | State management | Short-term memory (conversation context), long-term memory (user profile), working memory (task state) |
| Planner | Task decomposition | Decompose complex tasks into sub-steps, determine execution order, dynamically adjust plans |
Technical Implementation Paradigms
Current mainstream Agent implementation paradigms in the industry:
ReAct (Reasoning + Acting)
Core Idea: Let LLM alternate between "reasoning" and "acting," thinking about why before each action.
User: Check the weather in Beijing tomorrow
Agent Thinking: User wants to know Beijing's weather tomorrow, I need to call weather query tool
Agent Action: Call get_weather(city="Beijing", date="tomorrow")
Agent Observation: Result shows sunny, temperature 15-25°C
Agent Thinking: Got weather data, can now answer the user
Agent Output: Beijing will be sunny tomorrow, 15-25°C, great for outdoor activities.Pros: Transparent reasoning process, easy to debug and audit
Cons: Every step requires LLM reasoning, lower efficiency
Plan-and-Execute
Core Idea: First make a complete plan, then execute step by step.
User: Help me create an analysis report on the EV market
Agent Planning:
1. Search for latest EV market data
2. Collect sales information of major manufacturers
3. Analyze market trends and competitive landscape
4. Generate charts and visualizations
5. Compile into a complete report document
Agent Execution: Execute steps according to plan...Pros: More organized handling of complex tasks
Cons: Once planned, less flexibility to adjust
Multi-Agent (Multi-Agent Collaboration)
Core Idea: Multiple specialized Agents work together, each handling their expertise.
Pros: Professional division of labor, strong capability for complex tasks
Cons: High system complexity, high coordination costs
Relationship Between Agent and RAG
What is RAG?
RAG (Retrieval Augmented Generation) is a technique that enables LLMs to access external knowledge:
Three Steps:
- Retrieval: Convert question to vector, retrieve relevant documents from knowledge base
- Augmentation: Inject retrieved document snippets as context into Prompt
- Generation: LLM generates answer based on context
RAG is One of Agent's "Tools"
Core Point: RAG and Agent are not concepts at the same level.
RAG is one of the many tools an Agent can invoke, used for knowledge retrieval
Detailed Comparison
| Dimension | RAG | Agent |
|---|---|---|
| Essence | A knowledge enhancement technique | A system architecture |
| Purpose | Enable LLM to access external knowledge | Enable LLM to autonomously complete tasks |
| Capability | Retrieve + Answer | Plan + Execute + Feedback |
| Execute Operations | Cannot | Can |
| Relationship | Is a tool for Agent | Can include RAG |
How Agent Uses RAG
In Agent architecture, RAG is typically invoked as a "knowledge retrieval tool":
User: "According to company policy, how many days of annual leave can I apply for?"
Agent Thinking: This is a question requiring internal policy lookup, I need to search knowledge base
Agent Action: Call rag_search(query="annual leave days company policy")
Agent Observation: Results show "1-5 years tenure: 5 days, 5-10 years: 10 days, over 10 years: 15 days"
Agent Thinking: Got policy info, but need user's tenure for accurate answer
Agent Action: Ask user "How many years have you been with the company?"
User: "3 years"
Agent Output: According to company policy, with 3 years tenure, you can apply for 5 days of annual leave.RAG and Agent Combination Patterns
| Pattern | Description | Use Case |
|---|---|---|
| RAG Only | Pure knowledge Q&A, no execution capability | Simple FAQ, document Q&A |
| Agent + RAG | Agent calls RAG for knowledge, then executes operations | Task execution requiring knowledge support |
| Agentic RAG | RAG flow itself controlled by Agent, can dynamically decide retrieval strategy | Complex knowledge reasoning |
Does Agent Need Model Fine-tuning?
Core Conclusion: Usually Not Needed
Agent's capability mainly comes from system architecture, not model fine-tuning.
Key Insights:
- LLM Base Capability: General LLMs (like GPT-4/Claude/Qwen) are sufficient, usually no fine-tuning needed
- System Architecture Capability: Tool definitions, memory system, prompt engineering, Agent Loop — these are the main sources of Agent capability, unrelated to model fine-tuning
Why Agent Usually Doesn't Need Fine-tuning?
| Reason | Explanation |
|---|---|
| LLM Capability is Sufficient | Modern LLMs (GPT-4, Claude 3, Qwen2.5, etc.) have sufficient general capability to support Agent's reasoning and planning needs |
| Capability Comes from Architecture | Agent's "execution capability" comes from tool invocation, not the model itself |
| Prompt Engineering is More Efficient | Through well-designed System Prompts, general LLMs can behave like domain experts |
| Greater Flexibility | No fine-tuning means you can switch base models anytime, benefiting from model iterations |
| Lower Cost | Fine-tuning requires data, compute, and time, while prompt engineering costs almost nothing |
When Might Fine-tuning Be Needed?
While usually not needed, consider fine-tuning in these scenarios:
| Scenario | Description | Fine-tuning Goal |
|---|---|---|
| Domain-specific Terminology | Model doesn't accurately understand specific domain terms | Improve domain language understanding |
| Specific Output Format | Need model to stably output specific formats | Enhance format compliance |
| Cost Optimization | Replace large model with small model + fine-tuning | Reduce inference cost |
| Private Deployment | Cannot call cloud LLM APIs | Improve local small model capability |
Fine-tuning vs Prompt Engineering vs RAG
| Technique | Purpose | Use Cases | Cost |
|---|---|---|---|
| Prompt Engineering | Guide model behavior | Almost all scenarios | Low |
| RAG | Inject external knowledge | Scenarios requiring specific knowledge | Medium |
| Fine-tuning (LoRA, etc.) | Change model parameters | Special formats/domains/cost optimization | High |
Recommended Strategy:
- Priority: Prompt Engineering + RAG
- Secondary: If effects are insufficient, consider fine-tuning
- Principle: Avoid fine-tuning if possible, maintain architecture flexibility
The Right Path for Agent Capability Enhancement
| Priority | Approach | Specific Work |
|---|---|---|
| 1 | Prompt Engineering | Optimize System Prompt, design clear tool descriptions, Few-shot examples |
| 2 | Tool Ecosystem | Enrich tool set, optimize tool interface design, tool composition capability |
| 3 | Knowledge Enhancement | Build high-quality knowledge base, optimize retrieval strategy, knowledge update mechanism |
| 4 | Model Fine-tuning | Domain terminology understanding, stable output format, cost optimization (only when necessary) |
Industry Case Studies
Manus: Why is it Popular? What Does it Do?
Manus is an AI Agent product that went viral in early 2025, called "the first AI that can actually help you do things."
Manus's Core Capabilities
| Capability | Description | Examples |
|---|---|---|
| Browser Automation | Operates web pages like a human | Auto search, fill forms, download files |
| Code Execution | Write and run code | Data processing, file conversion, automation scripts |
| File Operations | Create, edit, manage files | Generate reports, organize documents, batch processing |
| Task Orchestration | Break down and execute complex tasks | Multi-step task automation |
Manus vs Native ChatGPT Comparison
| Task | ChatGPT's Response | Manus's Approach |
|---|---|---|
| "Help me create a competitor analysis report" | Gives you a report template and framework suggestions | Auto-searches competitor info, collects data, generates complete analysis report |
| "Check Beijing's weather tomorrow" | Tells you which website to check | Opens weather site, queries data, directly tells you the result |
| "Convert this PDF to Word" | Recommends some online conversion tools | Directly executes conversion, generates Word file for you |
| "Help me analyze this Excel data" | Needs you to paste data in | Directly reads file, executes analysis, generates charts |
Why Do People Prefer Manus?
One-liner: Manus transforms LLM from a "conversation tool" into an "execution assistant."
Specific reasons:
- From "Suggestion" to "Execution": Not just telling you how to do it, but actually doing it for you
- Reduced Human Intervention: Users only need to state the goal, no step-by-step guidance needed
- Cross-system Capability: Can operate multiple websites and tools simultaneously to complete complex tasks
- Result Delivery: Final output is usable files and data, not just text suggestions
OpenAI Operator / Claude Computer Use
In 2024-2025, OpenAI and Anthropic successively launched Computer Use capabilities:
| Product | Release Date | Core Capability |
|---|---|---|
| Claude Computer Use | Oct 2024 | Claude can control computers, operating mouse and keyboard like a human |
| OpenAI Operator | Jan 2025 | GPT can automatically execute web tasks like shopping, booking, etc. |
This indicates: Agent-ification is an inevitable trend for LLM applications, with leading vendors all moving in this direction.
MCP Protocol: Industry Standard for Agent Tool Invocation
In November 2024, Anthropic released the MCP (Model Context Protocol), which is becoming the de facto standard for Agent tool invocation:
| Feature | Description |
|---|---|
| Standardization | Unified tool description and invocation protocol |
| Ecosystem Sharing | Develop once, reuse across multiple Agents |
| Industry Recognition | Alipay, WeChat, and others have launched MCP Servers |
How to Build an Agent System
Technical Component Checklist
Building a complete Agent system requires the following technical components:
Layer Details:
| Layer | Component | Specifics |
|---|---|---|
| 1 | Agent Runtime Framework | Agent Loop implementation, plan-execute-feedback cycle, multi-Agent collaboration. Options: LangGraph/AutoGPT/CrewAI/Custom |
| 2 | Tool System | Tool definition (MCP/Function Calling), tool registry, invocation engine, error handling & retry |
| 3 | Memory System | Short-term memory (conversation context), long-term memory (user profile, task history), working memory (task intermediate state) |
| 4 | Perception Enhancement | Multi-modal input processing, RAG knowledge retrieval, web search |
| 5 | Security Control | Input security (injection detection), execution security (permission control), output security (hallucination detection) |
Minimal Viable Agent
To quickly experience Agent development, start with a minimal viable approach:
# Pseudocode: Minimal Viable Agent
def minimal_agent(user_input, tools, llm):
# 1. Send user input and available tool descriptions to LLM
prompt = f"""
You are an AI assistant that can use the following tools:
{format_tools(tools)}
User request: {user_input}
Analyze the user's needs, decide whether to call a tool, and how to respond.
"""
while True:
# 2. LLM thinks and decides next step
response = llm.chat(prompt)
# 3. If LLM decides to call a tool
if response.has_tool_call:
tool_name = response.tool_call.name
tool_args = response.tool_call.arguments
# 4. Execute tool call
result = tools[tool_name].execute(tool_args)
# 5. Feed result back to LLM, continue loop
prompt += f"\nTool {tool_name} returned: {result}"
else:
# 6. LLM gives final answer, end loop
return response.contentRecommended Framework Choices
| Framework | Features | Use Cases |
|---|---|---|
| LangGraph | By LangChain team, supports complex Agent flow orchestration | Complex multi-step Agents |
| AutoGPT | Early Agent framework, active community | General autonomous Agents |
| CrewAI | Focused on multi-Agent collaboration | Multi-role collaboration scenarios |
| OpenAI Assistants API | Official OpenAI solution, high integration | Quick prototyping, GPT users |
| Custom | Full control, deep customization | Production systems with special requirements |
Summary
This article systematically explored the technical evolution from LLM to Agent:
- Three Generations: Bare LLM (can only talk) → Workflow (follows script) → Agent (acts autonomously)
- Agent Core: Observe-Think-Act-Feedback loop, composed of LLM + Tools + Memory + Planning
- Agent vs RAG: RAG is a tool for Agent, used for knowledge enhancement; they're not concepts at the same level
- Fine-tuning Needed?: Usually not, Agent capability mainly comes from system architecture rather than model parameters
- Industry Trend: Agent-ification is the inevitable direction for LLM applications, with leading vendors all investing in this area
One-liner to understand Agent: Agent = LLM (Brain) + Tools (Hands & Feet) + Memory + Planning Capability, enabling AI to evolve from "can talk" to "can act."
Appendix: Technical Terminology
| Term | Full Name | Explanation |
|---|---|---|
| Agent | Agent | Intelligent entity capable of perceiving environment, making autonomous decisions, and executing actions |
| LLM | Large Language Model | Large language models like GPT-4, Claude, Qwen, etc. |
| Workflow | Workflow | Predefined task execution flow |
| MCP | Model Context Protocol | Model context protocol proposed by Anthropic for Agent tool invocation standard |
| RAG | Retrieval Augmented Generation | Enhancing LLM responses by retrieving external knowledge |
| ReAct | Reasoning + Acting | An Agent paradigm alternating between reasoning and acting |
| CoT | Chain of Thought | Prompting technique that makes LLM show reasoning process |
| Tool Calling | Tool Calling / Function Calling | Capability for LLM to invoke external tools |
| LoRA | Low-Rank Adaptation | An efficient model fine-tuning method |