Agent 评估指标

评估是 Agent 持续优化的基础。建立科学的评估体系，才能客观衡量 Agent 效果，指导迭代改进。

一、评估维度

1.1 评估指标体系

┌─────────────────────────────────────────────────────────────┐
│                    Agent 评估指标体系                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   ┌─────────────────────────────────────────────────────┐  │
│   │                 效果指标                             │  │
│   │  ┌──────────┐ ┌──────────┐ ┌──────────┐            │  │
│   │  │ 任务成功 │ │ 工具调用 │ │ 输出质量 │            │  │
│   │  │ 率       │ │ 准确率   │ │          │            │  │
│   │  └──────────┘ └──────────┘ └──────────┘            │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   ┌─────────────────────────────────────────────────────┐  │
│   │                 效率指标                             │  │
│   │  ┌──────────┐ ┌──────────┐ ┌──────────┐            │  │
│   │  │ 响应时间 │ │ 步骤数量 │ │ Token 消 │            │  │
│   │  │          │ │          │ │ 耗       │            │  │
│   │  └──────────┘ └──────────┘ └──────────┘            │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   ┌─────────────────────────────────────────────────────┐  │
│   │                 用户体验指标                         │  │
│   │  ┌──────────┐ ┌──────────┐ ┌──────────┐            │  │
│   │  │ 用户满意 │ │ 任务完成 │ │ 交互流畅 │            │  │
│   │  │ 度       │ │ 时间     │ │ 度       │            │  │
│   │  └──────────┘ └──────────┘ └──────────┘            │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   ┌─────────────────────────────────────────────────────┐  │
│   │                 可靠性指标                           │  │
│   │  ┌──────────┐ ┌──────────┐ ┌──────────┐            │  │
│   │  │ 错误率   │ │ 恢复能力 │ │ 稳定性   │            │  │
│   │  │          │ │          │ │          │            │  │
│   │  └──────────┘ └──────────┘ └──────────┘            │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

1.2 核心指标详解

指标	英文	定义	计算方式	目标值
任务成功率	Task Success Rate	正确完成任务的比例	成功数/总任务数	> 80%
工具准确率	Tool Accuracy	正确调用工具的比例	正确调用数/总调用数	> 90%
平均步骤数	Avg Steps	完成任务的平均交互次数	总步骤数/任务数	场景相关
首次成功率	First Try Success	首次尝试即成功的比例	首次成功数/总任务数	> 60%
响应延迟	Latency	从输入到输出的时间	P50/P95/P99	< 5s

二、评估方法

2.1 评估方法对比

┌─────────────────────────────────────────────────────────────┐
│                    评估方法对比                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   方法 1: 人工评估                                          │
│   ┌─────────────────────────────────────────────────────┐  │
│   │ 优点：                                               │  │
│   │ • 最准确的评估方式                                   │  │
│   │ • 能发现细微问题                                     │  │
│   │ • 可评估用户体验                                     │  │
│   │                                                      │  │
│   │ 缺点：                                               │  │
│   │ • 成本高、速度慢                                     │  │
│   │ • 主观性强                                           │  │
│   │ • 难以规模化                                         │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   方法 2: 自动化测试集                                      │
│   ┌─────────────────────────────────────────────────────┐  │
│   │ 优点：                                               │  │
│   │ • 可重复执行                                         │  │
│   │ • 成本低                                             │  │
│   │ • 适合 CI/CD 集成                                    │  │
│   │                                                      │  │
│   │ 缺点：                                               │  │
│   │ • 需要准备测试数据                                   │  │
│   │ • 覆盖面有限                                         │  │
│   │ • 维护成本                                           │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   方法 3: LLM-as-Judge                                      │
│   ┌─────────────────────────────────────────────────────┐  │
│   │ 优点：                                               │  │
│   │ • 自动化程度高                                       │  │
│   │ • 可评估开放性输出                                   │  │
│   │ • 一致性好                                           │  │
│   │                                                      │  │
│   │ 缺点：                                               │  │
│   │ • 存在模型偏见                                       │  │
│   │ • 成本较高（调用 LLM）                               │  │
│   │ • 需要设计评估提示词                                 │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   方法 4: 用户反馈                                          │
│   ┌─────────────────────────────────────────────────────┐  │
│   │ 优点：                                               │  │
│   │ • 真实用户评价                                       │  │
│   │ • 发现未知问题                                       │  │
│   │ • 持续收集                                           │  │
│   │                                                      │  │
│   │ 缺点：                                               │  │
│   │ • 反馈可能稀疏                                       │  │
│   │ • 存在偏差                                           │  │
│   │ • 需要激励用户                                       │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

2.2 评估框架实现

"""
Agent 评估框架
"""
 
from typing import List, Dict, Any, Optional, Callable
from dataclasses import dataclass
from enum import Enum
import json
import time
 
 
class EvaluationResult(Enum):
    """评估结果"""
    PASS = "pass"
    FAIL = "fail"
    PARTIAL = "partial"
 
 
@dataclass
class TestCase:
    """测试用例"""
    id: str
    input: str
    expected_output: Optional[str] = None
    expected_tools: Optional[List[str]] = None
    success_criteria: Optional[str] = None
    category: str = "general"
    difficulty: str = "medium"
 
 
@dataclass
class EvaluationRecord:
    """评估记录"""
    test_case: TestCase
    actual_output: str
    actual_tools: List[str]
    result: EvaluationResult
    score: float
    execution_time: float
    error: Optional[str] = None
    notes: str = ""
 
 
class AgentEvaluator:
    """Agent 评估器"""
    
    def __init__(
        self,
        agent,
        llm_judge=None,
        evaluators: Optional[Dict[str, Callable]] = None
    ):
        self.agent = agent
        self.llm_judge = llm_judge
        self.evaluators = evaluators or {}
        self.results: List[EvaluationRecord] = []
    
    async def evaluate(
        self,
        test_cases: List[TestCase],
        parallel: bool = False
    ) -> Dict[str, Any]:
        """
        执行评估
        
        Args:
            test_cases: 测试用例列表
            parallel: 是否并行执行
            
        Returns:
            评估报告
        """
        self.results = []
        
        for tc in test_cases:
            result = await self._evaluate_single(tc)
            self.results.append(result)
        
        return self._generate_report()
    
    async def _evaluate_single(self, tc: TestCase) -> EvaluationRecord:
        """评估单个用例"""
        start_time = time.time()
        
        try:
            # 执行 Agent
            response = await self.agent.run(tc.input)
            
            actual_output = response.get("output", "")
            actual_tools = response.get("tools_used", [])
            
            # 评估结果
            result, score = await self._compute_score(
                tc, 
                actual_output, 
                actual_tools
            )
            
            execution_time = time.time() - start_time
            
            return EvaluationRecord(
                test_case=tc,
                actual_output=actual_output,
                actual_tools=actual_tools,
                result=result,
                score=score,
                execution_time=execution_time
            )
            
        except Exception as e:
            execution_time = time.time() - start_time
            return EvaluationRecord(
                test_case=tc,
                actual_output="",
                actual_tools=[],
                result=EvaluationResult.FAIL,
                score=0.0,
                execution_time=execution_time,
                error=str(e)
            )
    
    async def _compute_score(
        self,
        tc: TestCase,
        actual_output: str,
        actual_tools: List[str]
    ) -> tuple[EvaluationResult, float]:
        """计算得分"""
        scores = []
        
        # 1. 工具调用评估
        if tc.expected_tools:
            tool_score = self._evaluate_tools(tc.expected_tools, actual_tools)
            scores.append(("tools", tool_score, 0.3))
        
        # 2. 输出评估
        if tc.expected_output:
            output_score = self._evaluate_output(tc.expected_output, actual_output)
            scores.append(("output", output_score, 0.5))
        
        # 3. 使用 LLM 评估
        if self.llm_judge and tc.success_criteria:
            llm_score = await self._llm_evaluate(
                tc.input,
                actual_output,
                tc.success_criteria
            )
            scores.append(("llm", llm_score, 0.2))
        
        # 计算加权得分
        if not scores:
            return EvaluationResult.PARTIAL, 0.5
        
        total_weight = sum(s[2] for s in scores)
        weighted_score = sum(s[1] * s[2] for s in scores) / total_weight
        
        if weighted_score >= 0.8:
            return EvaluationResult.PASS, weighted_score
        elif weighted_score >= 0.5:
            return EvaluationResult.PARTIAL, weighted_score
        else:
            return EvaluationResult.FAIL, weighted_score
    
    def _evaluate_tools(
        self,
        expected: List[str],
        actual: List[str]
    ) -> float:
        """评估工具调用"""
        if not expected:
            return 1.0
        
        expected_set = set(expected)
        actual_set = set(actual)
        
        # 计算交集比例
        intersection = expected_set & actual_set
        return len(intersection) / len(expected_set)
    
    def _evaluate_output(
        self,
        expected: str,
        actual: str
    ) -> float:
        """评估输出"""
        # 简单的字符串匹配
        expected_lower = expected.lower().strip()
        actual_lower = actual.lower().strip()
        
        if expected_lower == actual_lower:
            return 1.0
        
        # 包含关系
        if expected_lower in actual_lower:
            return 0.8
        
        # 词语重叠
        expected_words = set(expected_lower.split())
        actual_words = set(actual_lower.split())
        overlap = len(expected_words & actual_words) / len(expected_words)
        
        return overlap
    
    async def _llm_evaluate(
        self,
        input_text: str,
        output: str,
        criteria: str
    ) -> float:
        """使用 LLM 评估"""
        prompt = f"""
请评估以下 Agent 输出是否满足成功标准。
 
输入：{input_text}
输出：{output}
成功标准：{criteria}
 
请给出 0-1 的得分，并解释原因。
只输出一个 JSON 对象：{{"score": 0.8, "reason": "..."}}
"""
        
        response = await self.llm_judge.ainvoke(prompt)
        
        try:
            result = json.loads(response.content)
            return float(result.get("score", 0))
        except:
            return 0.5
    
    def _generate_report(self) -> Dict[str, Any]:
        """生成评估报告"""
        total = len(self.results)
        
        if total == 0:
            return {"error": "无评估结果"}
        
        # 统计
        pass_count = sum(1 for r in self.results if r.result == EvaluationResult.PASS)
        partial_count = sum(1 for r in self.results if r.result == EvaluationResult.PARTIAL)
        fail_count = sum(1 for r in self.results if r.result == EvaluationResult.FAIL)
        
        avg_score = sum(r.score for r in self.results) / total
        avg_time = sum(r.execution_time for r in self.results) / total
        
        # 按类别统计
        category_stats = {}
        for r in self.results:
            cat = r.test_case.category
            if cat not in category_stats:
                category_stats[cat] = {"total": 0, "pass": 0, "avg_score": 0}
            category_stats[cat]["total"] += 1
            if r.result == EvaluationResult.PASS:
                category_stats[cat]["pass"] += 1
            category_stats[cat]["avg_score"] += r.score
        
        for cat in category_stats:
            stats = category_stats[cat]
            stats["pass_rate"] = stats["pass"] / stats["total"]
            stats["avg_score"] /= stats["total"]
        
        return {
            "summary": {
                "total": total,
                "pass": pass_count,
                "partial": partial_count,
                "fail": fail_count,
                "pass_rate": pass_count / total,
                "avg_score": avg_score,
                "avg_execution_time": avg_time
            },
            "by_category": category_stats,
            "details": [
                {
                    "id": r.test_case.id,
                    "result": r.result.value,
                    "score": r.score,
                    "time": r.execution_time,
                    "error": r.error
                }
                for r in self.results
            ]
        }
 
 
# ========== 测试用例示例 ==========
 
SAMPLE_TEST_CASES = [
    TestCase(
        id="math_001",
        input="计算 15 + 27",
        expected_output="42",
        expected_tools=["calculator"],
        category="math",
        difficulty="easy"
    ),
    TestCase(
        id="search_001",
        input="搜索今天的新闻",
        expected_tools=["search"],
        success_criteria="返回相关新闻信息",
        category="search",
        difficulty="medium"
    ),
    TestCase(
        id="code_001",
        input="写一个 Python 函数计算斐波那契数列",
        success_criteria="代码正确、可运行、有注释",
        category="code",
        difficulty="medium"
    )
]

2.3 LLM-as-Judge 实现

"""
LLM-as-Judge 评估器
"""
 
class LLMJudge:
    """LLM 评判器"""
    
    EVALUATION_PROMPT = """
你是一个专业的评估者，需要评估 AI Agent 的输出质量。
 
## 评估标准
 
1. **准确性**：输出是否正确回答了用户问题
2. **完整性**：输出是否完整覆盖了用户需求
3. **相关性**：输出是否与问题相关，无无关内容
4. **清晰度**：输出是否清晰易懂，结构合理
 
## 评估对象
 
用户输入：\\{input\\}
Agent 输出：\\{output\\}
期望输出（如有）：\\{expected\\}
 
## 输出要求
 
请给出评估结果，格式如下（JSON格式）：
\{
    "accuracy": 0.9,
    "completeness": 0.8,
    "relevance": 1.0,
    "clarity": 0.9,
    "overall": 0.9,
    "reason": "简要说明评估理由"
\}
 
请进行评估：
"""
    
    def __init__(self, llm):
        self.llm = llm
    
    async def evaluate(
        self,
        input_text: str,
        output: str,
        expected: Optional[str] = None
    ) -> Dict[str, float]:
        """执行评估"""
        prompt = self.EVALUATION_PROMPT.format(
            input=input_text,
            output=output,
            expected=expected or "无"
        )
        
        response = await self.llm.ainvoke(prompt)
        
        # 解析结果
        import re
        json_match = re.search(r'\{[^}]+\}', response.content, re.DOTALL)
        
        if json_match:
            try:
                return json.loads(json_match.group())
            except:
                pass
        
        return {"overall": 0.5, "reason": "解析失败"}
    
    async def batch_evaluate(
        self,
        items: List[Dict[str, str]]
    ) -> List[Dict[str, float]]:
        """批量评估"""
        results = []
        for item in items:
            result = await self.evaluate(
                item["input"],
                item["output"],
                item.get("expected")
            )
            results.append(result)
        return results

三、评估最佳实践

3.1 评估数据集构建

┌─────────────────────────────────────────────────────────────┐
│                    评估数据集构建指南                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   1. 覆盖度要求                                             │
│   ┌─────────────────────────────────────────────────────┐  │
│   │ • 功能覆盖：覆盖 Agent 的所有核心功能               │  │
│   │ • 难度覆盖：简单/中等/困难都要有                    │  │
│   │ • 边界覆盖：正常 + 异常 + 边界情况                  │  │
│   │ • 用户覆盖：不同用户群体的典型问题                  │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   2. 数据来源                                               │
│   ┌─────────────────────────────────────────────────────┐  │
│   │ • 真实用户问题（脱敏）                              │  │
│   │ • 产品文档中的 FAQ                                  │  │
│   │ • 竞品分析                                          │  │
│   │ • 专家编写                                          │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   3. 标注规范                                               │
│   ┌─────────────────────────────────────────────────────┐  │
│   │ • 期望输出要明确                                    │  │
│   │ • 成功标准要可量化                                  │  │
│   │ • 标注结果要经过审核                                │  │
│   │ • 定期更新和清洗                                    │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
│   4. 规模建议                                               │
│   ┌─────────────────────────────────────────────────────┐  │
│   │ • MVP 阶段：50-100 条                               │  │
│   │ • 生产阶段：500-1000 条                             │  │
│   │ • 成熟阶段：2000+ 条                                │  │
│   └─────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

3.2 持续评估流程

┌─────────────────────────────────────────────────────────────┐
│                    持续评估流程                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   ┌─────────────┐     ┌─────────────┐     ┌─────────────┐ │
│   │  开发迭代   │ ──→ │  自动评估   │ ──→ │  结果分析   │ │
│   │             │     │  (CI/CD)    │     │             │ │
│   └─────────────┘     └─────────────┘     └──────┬──────┘ │
│          ↑                                        │        │
│          │                                        ↓        │
│   ┌─────────────┐     ┌─────────────┐     ┌─────────────┐ │
│   │  模型优化   │ ←── │  问题定位   │ ←── │  人工抽检   │ │
│   │             │     │             │     │             │ │
│   └─────────────┘     └─────────────┘     └─────────────┘ │
│                                                             │
│   关键节点：                                                 │
│   • 每次代码提交 → 自动评估                                 │
│   • 每日 → 回归测试                                        │
│   • 每周 → 人工抽检                                        │
│   • 每月 → 全面评估                                        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

四、面试问答

Q1: 如何设计 Agent 的评估指标？

回答要点：

明确目标：根据 Agent 类型确定核心指标
多维度：效果、效率、体验、可靠性
可量化：指标要可测量、可比较
有基准：设定目标值和基线值

Q2: LLM-as-Judge 有什么局限性？

回答要点：

局限性	解决方案
模型偏见	使用多个模型交叉验证
评估一致性	设计标准化评估提示词
成本高	抽样评估 + 关键用例全覆盖
无法评估事实	结合规则验证

Q3: 如何处理评估结果的偏差？

回答要点：

标注质量：多人标注、交叉验证
样本均衡：各类别数量均衡
排除异常：识别和处理异常值
定期校准：与人工评估对比校准

五、小结

Agent 评估的关键要素：

多维指标：效果、效率、体验、可靠性
多种方法：人工、自动、LLM、用户反馈
持续迭代：评估驱动的优化闭环
数据质量：高质量的评估数据集是基础

Agent 部署 Chapter8 Materials