Skip to content

From LLM to Agent: A Deep Dive into AI Agent Architecture Evolution

Introduction

When discussing AI Agents, the following questions are frequently raised:

  • How does an Agent work?
  • What's the difference between using an Agent and directly using ChatGPT/Qwen/DeepSeek?
  • What does Manus do? Why do people prefer Manus over native ChatGPT?
  • What's the relationship between Agent and RAG?
  • Do you need to fine-tune models for building Agents?

The essence of these questions is: What exactly is an "Agent"? What's the fundamental difference between it and directly calling a large model? Is it just a rebranded concept?

This article will systematically answer all these questions with clear explanations and diagrams.

Three Generations of Architecture Evolution: LLM → Workflow → Agent

The technical architecture of intelligent applications has evolved through three generations: Bare LLM → Workflow → Agent

First Generation: Bare LLM (Conversation is the End)

Representative Products: ChatGPT, Claude, Qwen, DeepSeek native chat interfaces

Architecture Characteristics:

Characteristics: Single request-response, stateless, no execution capability

Capability Boundaries:

What it can doWhat it cannot do
Answer questions, generate textExecute actual operations (e.g., query database, call APIs)
Provide suggestions and solutionsGet real-time information (knowledge has cutoff date)
Code generation, copywritingCross-system collaboration for complex tasks
Single or simple multi-turn conversationsLong-term task tracking and state management

Essential Positioning: A knowledgeable consultant who "can only talk but cannot act."

Second Generation: Workflow (Orchestrating LLM)

Representative Products: Dify, Coze, Baidu Qianfan, LangChain (Chain mode), etc.

Architecture Characteristics:

Characteristics: Humans design the flow, LLM executes within preset nodes. All possible execution paths must be designed in advance.

Improvements over Bare LLM:

ImprovementDescription
Can execute operationsCall external systems through API nodes
Flow orchestrationMulti-step tasks can be executed in sequence
Conditional branchingDifferent paths for different situations
Knowledge enhancementCan integrate RAG for knowledge retrieval

Core Limitations:

ProblemManifestation
Branch ExplosionBusiness scenario combinations grow exponentially, impossible to enumerate
Preset PathsAll possible execution paths must be designed in advance
Long-tail IssuesEdge cases are difficult to cover
Poor FlexibilityNew scenarios require flow modifications, slow response to changes

Essential Positioning: LLM is "orchestrated" within preset flows, flows are designed by humans, LLM is just an execution node in the flow.

Third Generation: Agent (Autonomous Intelligent Agent)

Representative Products: Manus, OpenAI Operator, Claude Computer Use, AutoGPT, MetaGPT

Architecture Characteristics:

Characteristics: LLM decides what to do, how to do it, and which tools to use

Core Capabilities:

CapabilityDescription
Autonomous PlanningLLM decomposes tasks and decides execution steps
Tool InvocationAutonomously selects appropriate tools as needed
Feedback LoopObserves results after execution, dynamically adjusts strategy
Long-running TasksCan continuously execute complex multi-step tasks

Essential Positioning: LLM transforms from "being orchestrated" to "the orchestrator," deciding what to call, what to execute, and what to do next.

Summary: Three Generations Compared

DimensionBare LLMWorkflowAgent
Representative ProductsChatGPT/Qwen/DeepSeekDify/CozeManus/Operator
Execution CapabilityNone, text output onlyYes, but paths presetYes, autonomous decision
Flow ControlNoneHuman-designed flowsLLM autonomous planning
Tool InvocationNonePreset node callsAutonomous selection
Handling Long-tailPoorAverageGood
Maintenance CostLowHigh (branch explosion)Low (no need to enumerate)
FlexibilityHigh (but no execution)Low (fixed paths)High
AnalogyConsultant who only talksActor following scriptAssistant who gets things done

One-liner Summary:

  • Bare LLM: Tells you "how to do it"
  • Workflow: "Does it for you" following preset paths
  • Agent: Figures out how to "get it done for you"

Core Working Principles of Agent

Agent Loop: Observe-Think-Act-Feedback

The core of an Agent is a continuously running loop called the Agent Loop:

Four Phases Explained:

PhaseCore Actions
ObserveReceive user input, sense current state, get environment information
ThinkUnderstand task goal, analyze current state, plan next step, select tools
ActInvoke selected tools, execute specific operations, interact with external systems
FeedbackObserve execution results, judge if goal achieved, decide to continue/adjust/end

Core Components of Agent

A complete Agent system contains four core components:

Four Components Explained:

ComponentFunctionDescription
LLM (Brain)Understanding, reasoning, decision-makingThe thinking center of Agent, can be GPT-4/Claude/Qwen, etc.
Tool Set (Hands & Feet)Execute external operationsAPI calls, database queries, code execution, web browsing, etc.
Memory SystemState managementShort-term memory (conversation context), long-term memory (user profile), working memory (task state)
PlannerTask decompositionDecompose complex tasks into sub-steps, determine execution order, dynamically adjust plans

Technical Implementation Paradigms

Current mainstream Agent implementation paradigms in the industry:

ReAct (Reasoning + Acting)

Core Idea: Let LLM alternate between "reasoning" and "acting," thinking about why before each action.

User: Check the weather in Beijing tomorrow

Agent Thinking: User wants to know Beijing's weather tomorrow, I need to call weather query tool
Agent Action: Call get_weather(city="Beijing", date="tomorrow")
Agent Observation: Result shows sunny, temperature 15-25°C
Agent Thinking: Got weather data, can now answer the user
Agent Output: Beijing will be sunny tomorrow, 15-25°C, great for outdoor activities.

Pros: Transparent reasoning process, easy to debug and audit
Cons: Every step requires LLM reasoning, lower efficiency

Plan-and-Execute

Core Idea: First make a complete plan, then execute step by step.

User: Help me create an analysis report on the EV market

Agent Planning:
  1. Search for latest EV market data
  2. Collect sales information of major manufacturers
  3. Analyze market trends and competitive landscape
  4. Generate charts and visualizations
  5. Compile into a complete report document

Agent Execution: Execute steps according to plan...

Pros: More organized handling of complex tasks
Cons: Once planned, less flexibility to adjust

Multi-Agent (Multi-Agent Collaboration)

Core Idea: Multiple specialized Agents work together, each handling their expertise.

Pros: Professional division of labor, strong capability for complex tasks
Cons: High system complexity, high coordination costs

Relationship Between Agent and RAG

What is RAG?

RAG (Retrieval Augmented Generation) is a technique that enables LLMs to access external knowledge:

Three Steps:

  1. Retrieval: Convert question to vector, retrieve relevant documents from knowledge base
  2. Augmentation: Inject retrieved document snippets as context into Prompt
  3. Generation: LLM generates answer based on context

RAG is One of Agent's "Tools"

Core Point: RAG and Agent are not concepts at the same level.

RAG is one of the many tools an Agent can invoke, used for knowledge retrieval

Detailed Comparison

DimensionRAGAgent
EssenceA knowledge enhancement techniqueA system architecture
PurposeEnable LLM to access external knowledgeEnable LLM to autonomously complete tasks
CapabilityRetrieve + AnswerPlan + Execute + Feedback
Execute OperationsCannotCan
RelationshipIs a tool for AgentCan include RAG

How Agent Uses RAG

In Agent architecture, RAG is typically invoked as a "knowledge retrieval tool":

User: "According to company policy, how many days of annual leave can I apply for?"

Agent Thinking: This is a question requiring internal policy lookup, I need to search knowledge base
Agent Action: Call rag_search(query="annual leave days company policy")
Agent Observation: Results show "1-5 years tenure: 5 days, 5-10 years: 10 days, over 10 years: 15 days"
Agent Thinking: Got policy info, but need user's tenure for accurate answer
Agent Action: Ask user "How many years have you been with the company?"
User: "3 years"
Agent Output: According to company policy, with 3 years tenure, you can apply for 5 days of annual leave.

RAG and Agent Combination Patterns

PatternDescriptionUse Case
RAG OnlyPure knowledge Q&A, no execution capabilitySimple FAQ, document Q&A
Agent + RAGAgent calls RAG for knowledge, then executes operationsTask execution requiring knowledge support
Agentic RAGRAG flow itself controlled by Agent, can dynamically decide retrieval strategyComplex knowledge reasoning

Does Agent Need Model Fine-tuning?

Core Conclusion: Usually Not Needed

Agent's capability mainly comes from system architecture, not model fine-tuning.

Key Insights:

  • LLM Base Capability: General LLMs (like GPT-4/Claude/Qwen) are sufficient, usually no fine-tuning needed
  • System Architecture Capability: Tool definitions, memory system, prompt engineering, Agent Loop — these are the main sources of Agent capability, unrelated to model fine-tuning

Why Agent Usually Doesn't Need Fine-tuning?

ReasonExplanation
LLM Capability is SufficientModern LLMs (GPT-4, Claude 3, Qwen2.5, etc.) have sufficient general capability to support Agent's reasoning and planning needs
Capability Comes from ArchitectureAgent's "execution capability" comes from tool invocation, not the model itself
Prompt Engineering is More EfficientThrough well-designed System Prompts, general LLMs can behave like domain experts
Greater FlexibilityNo fine-tuning means you can switch base models anytime, benefiting from model iterations
Lower CostFine-tuning requires data, compute, and time, while prompt engineering costs almost nothing

When Might Fine-tuning Be Needed?

While usually not needed, consider fine-tuning in these scenarios:

ScenarioDescriptionFine-tuning Goal
Domain-specific TerminologyModel doesn't accurately understand specific domain termsImprove domain language understanding
Specific Output FormatNeed model to stably output specific formatsEnhance format compliance
Cost OptimizationReplace large model with small model + fine-tuningReduce inference cost
Private DeploymentCannot call cloud LLM APIsImprove local small model capability

Fine-tuning vs Prompt Engineering vs RAG

TechniquePurposeUse CasesCost
Prompt EngineeringGuide model behaviorAlmost all scenariosLow
RAGInject external knowledgeScenarios requiring specific knowledgeMedium
Fine-tuning (LoRA, etc.)Change model parametersSpecial formats/domains/cost optimizationHigh

Recommended Strategy:

  1. Priority: Prompt Engineering + RAG
  2. Secondary: If effects are insufficient, consider fine-tuning
  3. Principle: Avoid fine-tuning if possible, maintain architecture flexibility

The Right Path for Agent Capability Enhancement

PriorityApproachSpecific Work
1Prompt EngineeringOptimize System Prompt, design clear tool descriptions, Few-shot examples
2Tool EcosystemEnrich tool set, optimize tool interface design, tool composition capability
3Knowledge EnhancementBuild high-quality knowledge base, optimize retrieval strategy, knowledge update mechanism
4Model Fine-tuningDomain terminology understanding, stable output format, cost optimization (only when necessary)

Industry Case Studies

Manus is an AI Agent product that went viral in early 2025, called "the first AI that can actually help you do things."

Manus's Core Capabilities

CapabilityDescriptionExamples
Browser AutomationOperates web pages like a humanAuto search, fill forms, download files
Code ExecutionWrite and run codeData processing, file conversion, automation scripts
File OperationsCreate, edit, manage filesGenerate reports, organize documents, batch processing
Task OrchestrationBreak down and execute complex tasksMulti-step task automation

Manus vs Native ChatGPT Comparison

TaskChatGPT's ResponseManus's Approach
"Help me create a competitor analysis report"Gives you a report template and framework suggestionsAuto-searches competitor info, collects data, generates complete analysis report
"Check Beijing's weather tomorrow"Tells you which website to checkOpens weather site, queries data, directly tells you the result
"Convert this PDF to Word"Recommends some online conversion toolsDirectly executes conversion, generates Word file for you
"Help me analyze this Excel data"Needs you to paste data inDirectly reads file, executes analysis, generates charts

Why Do People Prefer Manus?

One-liner: Manus transforms LLM from a "conversation tool" into an "execution assistant."

Specific reasons:

  1. From "Suggestion" to "Execution": Not just telling you how to do it, but actually doing it for you
  2. Reduced Human Intervention: Users only need to state the goal, no step-by-step guidance needed
  3. Cross-system Capability: Can operate multiple websites and tools simultaneously to complete complex tasks
  4. Result Delivery: Final output is usable files and data, not just text suggestions

OpenAI Operator / Claude Computer Use

In 2024-2025, OpenAI and Anthropic successively launched Computer Use capabilities:

ProductRelease DateCore Capability
Claude Computer UseOct 2024Claude can control computers, operating mouse and keyboard like a human
OpenAI OperatorJan 2025GPT can automatically execute web tasks like shopping, booking, etc.

This indicates: Agent-ification is an inevitable trend for LLM applications, with leading vendors all moving in this direction.

MCP Protocol: Industry Standard for Agent Tool Invocation

In November 2024, Anthropic released the MCP (Model Context Protocol), which is becoming the de facto standard for Agent tool invocation:

FeatureDescription
StandardizationUnified tool description and invocation protocol
Ecosystem SharingDevelop once, reuse across multiple Agents
Industry RecognitionAlipay, WeChat, and others have launched MCP Servers

How to Build an Agent System

Technical Component Checklist

Building a complete Agent system requires the following technical components:

Layer Details:

LayerComponentSpecifics
1Agent Runtime FrameworkAgent Loop implementation, plan-execute-feedback cycle, multi-Agent collaboration. Options: LangGraph/AutoGPT/CrewAI/Custom
2Tool SystemTool definition (MCP/Function Calling), tool registry, invocation engine, error handling & retry
3Memory SystemShort-term memory (conversation context), long-term memory (user profile, task history), working memory (task intermediate state)
4Perception EnhancementMulti-modal input processing, RAG knowledge retrieval, web search
5Security ControlInput security (injection detection), execution security (permission control), output security (hallucination detection)

Minimal Viable Agent

To quickly experience Agent development, start with a minimal viable approach:

python
# Pseudocode: Minimal Viable Agent
def minimal_agent(user_input, tools, llm):
    # 1. Send user input and available tool descriptions to LLM
    prompt = f"""
    You are an AI assistant that can use the following tools:
    {format_tools(tools)}
    
    User request: {user_input}
    
    Analyze the user's needs, decide whether to call a tool, and how to respond.
    """
    
    while True:
        # 2. LLM thinks and decides next step
        response = llm.chat(prompt)
        
        # 3. If LLM decides to call a tool
        if response.has_tool_call:
            tool_name = response.tool_call.name
            tool_args = response.tool_call.arguments
            
            # 4. Execute tool call
            result = tools[tool_name].execute(tool_args)
            
            # 5. Feed result back to LLM, continue loop
            prompt += f"\nTool {tool_name} returned: {result}"
        else:
            # 6. LLM gives final answer, end loop
            return response.content
FrameworkFeaturesUse Cases
LangGraphBy LangChain team, supports complex Agent flow orchestrationComplex multi-step Agents
AutoGPTEarly Agent framework, active communityGeneral autonomous Agents
CrewAIFocused on multi-Agent collaborationMulti-role collaboration scenarios
OpenAI Assistants APIOfficial OpenAI solution, high integrationQuick prototyping, GPT users
CustomFull control, deep customizationProduction systems with special requirements

Summary

This article systematically explored the technical evolution from LLM to Agent:

  1. Three Generations: Bare LLM (can only talk) → Workflow (follows script) → Agent (acts autonomously)
  2. Agent Core: Observe-Think-Act-Feedback loop, composed of LLM + Tools + Memory + Planning
  3. Agent vs RAG: RAG is a tool for Agent, used for knowledge enhancement; they're not concepts at the same level
  4. Fine-tuning Needed?: Usually not, Agent capability mainly comes from system architecture rather than model parameters
  5. Industry Trend: Agent-ification is the inevitable direction for LLM applications, with leading vendors all investing in this area

One-liner to understand Agent: Agent = LLM (Brain) + Tools (Hands & Feet) + Memory + Planning Capability, enabling AI to evolve from "can talk" to "can act."

Appendix: Technical Terminology

TermFull NameExplanation
AgentAgentIntelligent entity capable of perceiving environment, making autonomous decisions, and executing actions
LLMLarge Language ModelLarge language models like GPT-4, Claude, Qwen, etc.
WorkflowWorkflowPredefined task execution flow
MCPModel Context ProtocolModel context protocol proposed by Anthropic for Agent tool invocation standard
RAGRetrieval Augmented GenerationEnhancing LLM responses by retrieving external knowledge
ReActReasoning + ActingAn Agent paradigm alternating between reasoning and acting
CoTChain of ThoughtPrompting technique that makes LLM show reasoning process
Tool CallingTool Calling / Function CallingCapability for LLM to invoke external tools
LoRALow-Rank AdaptationAn efficient model fine-tuning method

License

This article is licensed under CC BY-NC-SA 4.0 . You are free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material

Under the following terms:

  • Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • NonCommercial — You may not use the material for commercial purposes.
  • ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Last updated at: