Agent 部署与优化

从原型到生产是 Agent 项目成功的关键。本章介绍 Agent 的部署架构、性能优化策略和运维最佳实践。

一、部署架构

1.1 部署模式对比

┌─────────────────────────────────────────────────────────────┐
│                    Agent 部署模式对比                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   模式 1: 云端部署                                          │
│   ┌─────────────────────────────────────────────────────┐  │
│   │                                                      │  │
│   │   用户 → CDN → 负载均衡 → Agent 服务集群            │  │
│   │                          ↓                           │  │
│   │                    LLM API / 向量数据库              │  │
│   │                                                      │  │
│   │  优点：                                               │  │
│   │  • 弹性扩展，应对流量波动                            │  │
│   │  • 运维成本低                                        │  │
│   │  • 高可用                                            │  │
│   │                                                      │  │
│   │  缺点：                                               │  │
│   │  • 数据需上传到云端                                  │  │
│   │  • 对网络依赖大                                      │  │
│   │                                                      │  │
│   │  适用场景：                                          │  │
│   │  • SaaS 产品                                         │  │
│   │  • 面向公众的应用                                    │  │
│   │                                                      │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   模式 2: 混合部署                                          │
│   ┌─────────────────────────────────────────────────────┐  │
│   │                                                      │  │
│   │   用户 → 内网 Agent 服务 → 云端 LLM API             │  │
│   │              ↓                                       │  │
│   │         本地向量数据库                               │  │
│   │                                                      │  │
│   │  优点：                                               │  │
│   │  • 敏感数据保留在本地                                │  │
│   │  • 可利用云端 LLM 能力                              │  │
│   │                                                      │  │
│   │  缺点：                                               │  │
│   │  • 架构复杂                                          │  │
│   │  • 需要网络连通                                      │  │
│   │                                                      │  │
│   │  适用场景：                                          │  │
│   │  • 企业内部应用                                      │  │
│   │  • 数据敏感场景                                      │  │
│   │                                                      │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   模式 3: 私有化部署                                        │
│   ┌─────────────────────────────────────────────────────┐  │
│   │                                                      │  │
│   │   用户 → 内网 Agent 服务 → 本地 LLM                 │  │
│   │              ↓              ↓                        │  │
│   │         本地向量数据库    GPU 集群                   │  │
│   │                                                      │  │
│   │  优点：                                               │  │
│   │  • 数据完全隔离                                      │  │
│   │  • 无外部依赖                                        │  │
│   │  • 响应延迟低                                        │  │
│   │                                                      │  │
│   │  缺点：                                               │  │
│   │  • 初始成本高（GPU）                                 │  │
│   │  • 运维复杂                                          │  │
│   │  • 模型能力受限                                      │  │
│   │                                                      │  │
│   │  适用场景：                                          │  │
│   │  • 金融/政务等高安全场景                             │  │
│   │  • 离线环境                                          │  │
│   │                                                      │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

1.2 微服务架构

┌─────────────────────────────────────────────────────────────┐
│                    Agent 微服务架构                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   ┌─────────────────────────────────────────────────────┐  │
│   │                    API Gateway                       │  │
│   │  • 认证授权  • 限流  • 路由  • 日志                  │  │
│   └─────────────────────────────────────────────────────┘  │
│                          │                                  │
│          ┌───────────────┼───────────────┐                 │
│          ↓               ↓               ↓                 │
│   ┌───────────┐   ┌───────────┐   ┌───────────┐          │
│   │  Session  │   │  Agent    │   │  Memory   │          │
│   │  Service  │   │  Service  │   │  Service  │          │
│   │           │   │           │   │           │          │
│   │ 会话管理  │   │ Agent 核心│   │ 记忆存储  │          │
│   └───────────┘   └───────────┘   └───────────┘          │
│          │               │               │                 │
│          │       ┌───────┴───────┐       │                 │
│          │       ↓               ↓       │                 │
│          │  ┌───────────┐  ┌───────────┐│                 │
│          │  │   Tool    │  │    LLM    ││                 │
│          │  │  Service  │  │  Gateway  ││                 │
│          │  │           │  │           ││                 │
│          │  │ 工具执行  │  │ 模型调用  ││                 │
│          │  └───────────┘  └───────────┘│                 │
│          │                      │       │                 │
│          └──────────────────────┼───────┘                 │
│                                 ↓                          │
│   ┌─────────────────────────────────────────────────────┐  │
│   │                    数据层                           │  │
│   │  ┌──────────┐ ┌──────────┐ ┌──────────┐            │  │
│   │  │  Redis   │ │ Postgres │ │ 向量数据库 │           │  │
│   │  │ (会话)   │ │ (持久化) │ │ (知识库) │            │  │
│   │  └──────────┘ └──────────┘ └──────────┘            │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

二、性能优化

2.1 优化方向

┌─────────────────────────────────────────────────────────────┐
│                    Agent 性能优化方向                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   1. 模型调用优化                                           │
│   ┌─────────────────────────────────────────────────────┐  │
│   │                                                      │  │
│   │  策略              │ 效果           │ 实现复杂度    │  │
│   │  ──────────────────┼────────────────┼───────────────│  │
│   │  模型选择          │ 高             │ 低            │  │
│   │  • 简单任务用小模型│                │               │  │
│   │  • 复杂任务用大模型│                │               │  │
│   │                    │                │               │  │
│   │  Prompt 压缩       │ 中             │ 中            │  │
│   │  • 移除冗余信息    │                │               │  │
│   │  • 使用模板        │                │               │  │
│   │                    │                │               │  │
│   │  批量请求          │ 高             │ 中            │  │
│   │  • 合并多个调用    │                │               │  │
│   │  • 并行处理        │                │               │  │
│   │                    │                │               │  │
│   │  缓存              │ 高             │ 低            │  │
│   │  • 相同请求缓存    │                │               │  │
│   │  • 向量嵌入缓存    │                │               │  │
│   │                                                      │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   2. 检索优化                                               │
│   ┌─────────────────────────────────────────────────────┐  │
│   │                                                      │  │
│   │  • 索引优化：使用合适的向量索引（HNSW, IVF）        │  │
│   │  • 分块策略：合理的文档分块大小                     │  │
│   │  • 预计算：热点查询预计算                           │  │
│   │  • 缓存：查询结果缓存                               │  │
│   │                                                      │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   3. 执行优化                                               │
│   ┌─────────────────────────────────────────────────────┐  │
│   │                                                      │  │
│   │  • 流式输出：减少首字延迟                           │  │
│   │  • 异步处理：非阻塞执行                             │  │
│   │  • 提前终止：满足条件即返回                         │  │
│   │  • 并行工具调用：无依赖的工具并行执行               │  │
│   │                                                      │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

2.2 优化代码实现

"""
Agent 性能优化实现
"""
 
import asyncio
from typing import List, Dict, Any, Optional
from functools import lru_cache
import hashlib
 
 
class OptimizedAgent:
    """优化版 Agent"""
    
    def __init__(
        self,
        llm_small=None,
        llm_large=None,
        cache=None,
        vector_store=None
    ):
        self.llm_small = llm_small      # 快速模型
        self.llm_large = llm_large      # 强力模型
        self.cache = cache              # 缓存
        self.vector_store = vector_store
    
    async def run(self, query: str, complexity: str = "auto") -> Dict:
        """
        运行 Agent
        
        Args:
            query: 用户查询
            complexity: 任务复杂度 (simple/complex/auto)
        """
        # 1. 检查缓存
        cache_key = self._get_cache_key(query)
        cached = await self._get_from_cache(cache_key)
        if cached:
            return {**cached, "from_cache": True}
        
        # 2. 自动判断复杂度
        if complexity == "auto":
            complexity = await self._estimate_complexity(query)
        
        # 3. 选择模型
        llm = self.llm_small if complexity == "simple" else self.llm_large
        
        # 4. 执行（流式）
        result = await self._execute_with_streaming(query, llm)
        
        # 5. 缓存结果
        await self._save_to_cache(cache_key, result)
        
        return result
    
    def _get_cache_key(self, query: str) -> str:
        """生成缓存键"""
        return hashlib.md5(query.encode()).hexdigest()
    
    async def _get_from_cache(self, key: str) -> Optional[Dict]:
        """从缓存获取"""
        if self.cache:
            return await self.cache.get(key)
        return None
    
    async def _save_to_cache(self, key: str, value: Dict):
        """保存到缓存"""
        if self.cache:
            await self.cache.set(key, value, ttl=3600)
    
    async def _estimate_complexity(self, query: str) -> str:
        """估算任务复杂度"""
        # 简单规则判断
        simple_keywords = ["是什么", "查询", "获取", "显示"]
        complex_keywords = ["分析", "比较", "推理", "规划", "设计"]
        
        query_lower = query.lower()
        
        for kw in complex_keywords:
            if kw in query_lower:
                return "complex"
        
        for kw in simple_keywords:
            if kw in query_lower:
                return "simple"
        
        # 默认复杂
        return "complex"
    
    async def _execute_with_streaming(
        self,
        query: str,
        llm
    ) -> Dict:
        """流式执行"""
        result_text = ""
        
        async for chunk in llm.astream(query):
            result_text += chunk.content
            # 这里可以 yield chunk 实现流式输出
        
        return {
            "output": result_text,
            "model": llm.model_name if hasattr(llm, "model_name") else "unknown"
        }
    
    async def parallel_tool_call(
        self,
        tools: List[str],
        inputs: List[Dict]
    ) -> List[Any]:
        """并行工具调用"""
        tasks = [
            self._call_tool(tool, input_data)
            for tool, input_data in zip(tools, inputs)
        ]
        
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        return results
    
    async def _call_tool(self, tool_name: str, input_data: Dict) -> Any:
        """调用工具"""
        # 工具调用实现
        pass
 
 
# ========== 缓存实现 ==========
 
class AgentCache:
    """Agent 缓存"""
    
    def __init__(self, redis_client=None, local_cache_size: int = 1000):
        self.redis = redis_client
        self.local_cache = {}
        self.local_cache_size = local_cache_size
    
    async def get(self, key: str) -> Optional[Dict]:
        """获取缓存"""
        # 先查本地缓存
        if key in self.local_cache:
            return self.local_cache[key]
        
        # 再查 Redis
        if self.redis:
            value = await self.redis.get(key)
            if value:
                import json
                result = json.loads(value)
                # 写入本地缓存
                self._set_local(key, result)
                return result
        
        return None
    
    async def set(self, key: str, value: Dict, ttl: int = 3600):
        """设置缓存"""
        # 写入本地缓存
        self._set_local(key, value)
        
        # 写入 Redis
        if self.redis:
            import json
            await self.redis.setex(key, ttl, json.dumps(value))
    
    def _set_local(self, key: str, value: Dict):
        """设置本地缓存"""
        if len(self.local_cache) >= self.local_cache_size:
            # 简单的 LRU：移除最早的
            oldest_key = next(iter(self.local_cache))
            del self.local_cache[oldest_key]
        
        self.local_cache[key] = value
 
 
# ========== 批量请求优化 ==========
 
class BatchRequestManager:
    """批量请求管理器"""
    
    def __init__(self, llm, batch_size: int = 10, timeout: float = 0.1):
        self.llm = llm
        self.batch_size = batch_size
        self.timeout = timeout
        self.pending = []
        self.results = {}
    
    async def add_request(self, request_id: str, prompt: str) -> str:
        """添加请求"""
        self.pending.append((request_id, prompt))
        
        # 等待结果
        while request_id not in self.results:
            await asyncio.sleep(0.01)
        
        return self.results.pop(request_id)
    
    async def process_batch(self):
        """处理批量请求"""
        while True:
            if len(self.pending) >= self.batch_size:
                await self._execute_batch()
            else:
                await asyncio.sleep(self.timeout)
    
    async def _execute_batch(self):
        """执行批量请求"""
        batch = self.pending[:self.batch_size]
        self.pending = self.pending[self.batch_size:]
        
        prompts = [p for _, p in batch]
        
        # 批量调用
        results = await self._batch_call(prompts)
        
        # 分发结果
        for (request_id, _), result in zip(batch, results):
            self.results[request_id] = result
    
    async def _batch_call(self, prompts: List[str]) -> List[str]:
        """批量调用 LLM"""
        # 实现批量调用逻辑
        tasks = [self.llm.ainvoke(p) for p in prompts]
        results = await asyncio.gather(*tasks)
        return [r.content for r in results]

三、成本控制

3.1 成本分析

┌─────────────────────────────────────────────────────────────┐
│                    Agent 成本构成                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   ┌─────────────────────────────────────────────────────┐  │
│   │                    成本占比                          │  │
│   │                                                      │  │
│   │   LLM API 调用   ████████████████████░░░░  60-80%   │  │
│   │   向量存储       ████░░░░░░░░░░░░░░░░░░░░  10-15%   │  │
│   │   计算/服务器    ████░░░░░░░░░░░░░░░░░░░░  10-15%   │  │
│   │   其他           ██░░░░░░░░░░░░░░░░░░░░░░  5-10%    │  │
│   │                                                      │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   Token 消耗分布：                                           │
│   ┌─────────────────────────────────────────────────────┐  │
│   │                                                      │  │
│   │   系统提示词     10-20%                             │  │
│   │   历史对话       30-50%                             │  │
│   │   检索上下文     20-40%                             │  │
│   │   用户输入       5-10%                              │  │
│   │   输出           10-20%                             │  │
│   │                                                      │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

3.2 成本优化策略

"""
成本优化实现
"""
 
from typing import List, Dict
import tiktoken
 
 
class CostOptimizer:
    """成本优化器"""
    
    def __init__(
        self,
        max_history_turns: int = 5,
        max_context_tokens: int = 4000,
        compression_ratio: float = 0.5
    ):
        self.max_history_turns = max_history_turns
        self.max_context_tokens = max_context_tokens
        self.compression_ratio = compression_ratio
        self.encoding = tiktoken.get_encoding("cl100k_base")
    
    def count_tokens(self, text: str) -> int:
        """计算 Token 数"""
        return len(self.encoding.encode(text))
    
    def optimize_prompt(
        self,
        system_prompt: str,
        history: List[Dict],
        context: str,
        user_input: str
    ) -> str:
        """优化 Prompt"""
        # 1. 压缩历史对话
        compressed_history = self._compress_history(history)
        
        # 2. 压缩检索上下文
        compressed_context = self._compress_context(context)
        
        # 3. 组装最终 Prompt
        parts = [
            system_prompt,
            "\n\n[对话历史]",
            compressed_history,
            "\n\n[相关知识]",
            compressed_context,
            "\n\n[用户输入]",
            user_input
        ]
        
        return "\n".join(parts)
    
    def _compress_history(self, history: List[Dict]) -> str:
        """压缩历史对话"""
        # 限制轮数
        recent_history = history[-self.max_history_turns:]
        
        # 格式化
        lines = []
        for h in recent_history:
            role = h.get("role", "user")
            content = h.get("content", "")
            # 截断过长的内容
            if len(content) > 500:
                content = content[:500] + "..."
            lines.append(f"{role}: {content}")
        
        return "\n".join(lines)
    
    def _compress_context(self, context: str) -> str:
        """压缩检索上下文"""
        tokens = self.count_tokens(context)
        
        if tokens <= self.max_context_tokens:
            return context
        
        # 简单截断
        # 实际可以使用摘要模型压缩
        target_length = int(len(context) * self.compression_ratio)
        return context[:target_length] + "\n...(内容已压缩)"
 
 
class ModelRouter:
    """模型路由器 - 根据任务选择合适的模型"""
    
    def __init__(self, models: Dict[str, Dict]):
        """
        Args:
            models: 模型配置
            {
                "fast": {"model": "gpt-3.5", "cost_per_1k": 0.001},
                "smart": {"model": "gpt-4", "cost_per_1k": 0.03}
            }
        """
        self.models = models
    
    def select_model(
        self,
        task_type: str,
        input_length: int,
        budget: float = None
    ) -> str:
        """
        选择模型
        
        Args:
            task_type: 任务类型
            input_length: 输入长度
            budget: 预算限制
            
        Returns:
            模型名称
        """
        # 简单任务用快速模型
        if task_type in ["summarize", "translate", "simple_qa"]:
            return "fast"
        
        # 复杂任务用强力模型
        if task_type in ["reasoning", "planning", "code_gen"]:
            return "smart"
        
        # 根据输入长度决定
        if input_length < 1000:
            return "fast"
        
        return "smart"
    
    def estimate_cost(
        self,
        model_name: str,
        input_tokens: int,
        output_tokens: int
    ) -> float:
        """估算成本"""
        config = self.models.get(model_name, {})
        cost_per_1k = config.get("cost_per_1k", 0.01)
        
        total_tokens = input_tokens + output_tokens
        return (total_tokens / 1000) * cost_per_1k

四、监控与运维

4.1 监控指标

┌─────────────────────────────────────────────────────────────┐
│                    Agent 监控指标                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   业务指标：                                                 │
│   ┌─────────────────────────────────────────────────────┐  │
│   │ • 任务成功率（按类型统计）                          │  │
│   │ • 平均响应时间                                      │  │
│   │ • 用户满意度                                        │  │
│   │ • 活跃用户数                                        │  │
│   │ • 功能使用分布                                      │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   技术指标：                                                 │
│   ┌─────────────────────────────────────────────────────┐  │
│   │ • API 调用延迟（P50/P95/P99）                       │  │
│   │ • LLM 调用延迟                                      │  │
│   │ • Token 消耗速率                                    │  │
│   │ • 错误率（按类型）                                  │  │
│   │ • 队列积压                                          │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   成本指标：                                                 │
│   ┌─────────────────────────────────────────────────────┐  │
│   │ • 每日 API 成本                                     │  │
│   │ • 每用户平均成本                                    │  │
│   │ • 成本趋势                                          │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

4.2 告警规则

┌─────────────────────────────────────────────────────────────┐
│                    告警规则示例                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   P0 - 紧急（5分钟内处理）                                  │
│   ┌─────────────────────────────────────────────────────┐  │
│   │ • 服务不可用                                        │  │
│   │ • 错误率 > 20%                                      │  │
│   │ • 响应时间 P95 > 30s                                │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   P1 - 高优先级（1小时内处理）                              │
│   ┌─────────────────────────────────────────────────────┐  │
│   │ • 错误率 > 10%                                      │  │
│   │ • 响应时间 P95 > 15s                                │  │
│   │ • Token 消耗异常增长                                │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   P2 - 中优先级（1天内处理）                                │
│   ┌─────────────────────────────────────────────────────┐  │
│   │ • 任务成功率下降 > 5%                               │  │
│   │ • 用户投诉增加                                      │  │
│   │ • 成本超预算                                        │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

五、面试问答

Q1: 如何优化 Agent 的响应速度？

回答要点：

流式输出：首字延迟最小化
模型选择：简单任务用小模型
缓存策略：相似请求复用结果
并行处理：无依赖操作并行执行
Prompt 优化：减少不必要的 Token

Q2: 如何控制 Agent 的运营成本？

回答要点：

策略	预期节省
模型分级使用	30-50%
缓存优化	20-30%
Prompt 压缩	15-25%
历史对话截断	10-20%

Q3: Agent 部署有哪些常见问题？

回答要点：

LLM API 不稳定：增加重试和降级机制
上下文过长：实现上下文压缩和截断
并发限制：请求队列和限流
成本超支：设置预算告警和熔断

六、小结

Agent 部署与优化的关键要点：

架构设计：根据场景选择合适的部署模式
性能优化：从模型、检索、执行三个维度优化
成本控制：Token 是主要成本来源
监控运维：建立完善的监控告警体系

客服 Agent Agent 评估