第7篇 | Agent工程化能力：异常处理/重试熔断/上下文溢出/循环检测

敲代码的程序猿

433人浏览 · 2026-04-30 13:45:00

敲代码的程序猿 · 2026-04-30 13:45:00 发布

本文属于「Agent核心架构工程师面试」系列第7篇

上一篇：《RAG检索增强实战：Chunk策略/Embedding/重排序/效果评估》

适合人群：准备大厂AI Agent方向面试的Java工程师

开篇：一个让Agent在生产环境崩掉的真实案例

某电商智能客服项目上线后第二周的复盘会议：

"第三天开始，Agent开始胡言乱语，问东答西。"

"第五天，直接进入死循环，CPU打满，进程崩溃。"

"第七天，我们发现Agent调用工具超时后，盲目重试了47次，直接把下游服务打挂了。"

"后来排查原因：上下文窗口爆了，但Agent没有检测机制，继续往里塞数据。"

这不是模型的问题，是你的工程化能力没跟上。

Agent的工程化能力包含四大核心：异常处理体系、重试与熔断机制、上下文溢出防护、循环检测与终止。这些能力决定了Agent在生产环境的稳定性，也是面试官最爱问的"深度问题"。

本文从踩坑到原理到代码，给你一套完整的工程化方案。

一、痛点：Agent工程化的四大死亡陷阱

1.1 死亡陷阱一：异常处理体系缺失

// ❌ 典型错误：异常被吞掉，Agent继续执行
public Response executeAgent(UserQuery query) {
    try {
        // 调用LLM
        Response response = llm.invoke(query);
        return response;
    } catch (Exception e) {
        // 错误做法：打印日志然后返回空结果
        log.error("LLM调用失败", e);
        return Response.empty();  // Agent不知道出错了！
    }
}

问题：

Agent不知道调用失败了，继续执行后续逻辑
返回空结果给用户，体验极差
没有错误分类，不知道是网络问题、限流、还是模型挂了

根因：没有设计分层的异常处理体系，没有错误分类和恢复策略。

1.2 死亡陷阱二：盲目重试导致雪崩

// ❌ 典型错误：无脑重试，没有退避策略
public Response callLLMWithRetry(Query query) {
    int maxRetries = 10;
    for (int i = 0; i < maxRetries; i++) {
        try {
            return llm.invoke(query);
        } catch (Exception e) {
            log.warn("第{}次调用失败: {}", i + 1, e.getMessage());
            // 致命错误：没有任何等待，直接重试
        }
    }
    throw new AgentException("重试10次全部失败");
}

场景：LLM服务限流了，每秒最多10次请求。你的Agent并发50个请求，每个都无脑重试10次 → 500次请求直接打挂LLM服务 → 触发更严格的限流 → 死锁。

根因：没有指数退避、没有熔断机制、没有限流保护。

1.3 死亡陷阱三：上下文窗口溢出

// ❌ 典型错误：无限往上下文塞数据
public void addToContext(List<Message> history, String newContent) {
    // 危险！没有长度检查
    history.add(new Message("user", newContent));
    // 持续累积，直到触发LLM的context overflow
}

public Response chat(List<Message> history, String input) {
    history.add(new Message("user", input));
    // 致命错误：直接把所有历史都发过去
    return llm.invoke(history);  // 爆了！
}

真实案例：

用户问："帮我总结一下这个项目的所有需求变更"
Agent开始读取所有需求文档
1条 → 100条 → 1000条 → 爆了 → 返回错误

但如果提前检测，可以这样处理：
1. 检测上下文长度 > 80%阈值
2. 触发"摘要压缩"逻辑
3. 将1000条压缩成50条核心摘要
4. 再继续执行

根因：没有上下文长度管理机制，没有智能压缩策略。

1.4 死亡陷阱四：循环检测缺失导致死循环

// ❌ 典型错误：没有循环检测，Agent可能陷入无限循环
public Response executeWithTools(UserQuery query) {
    int maxSteps = 50;  // 这个值设置得太大了
    for (int i = 0; i < maxSteps; i++) {
        // Agent可能陷入：调用工具A → 结果触发调用工具B → 结果触发调用工具A → 死循环
        Response stepResult = agent.step(query);
        if (stepResult.isTerminal()) {
            return stepResult;
        }
    }
    return Response.error("达到最大步数");
}

真实案例：

Agent执行任务："帮我分析这段代码并生成文档"
Step 1: 调用代码解析工具
Step 2: 调用文档生成工具
Step 3: 发现文档不够详细 → 调用代码解析工具
Step 4: 再次调用文档生成工具
Step 5: 又不够详细...
...
Step 50: 终于到达步数限制，但返回的是错误信息

根因：没有循环模式检测，没有"相似操作"检测机制。

二、原理：工程化能力的四大核心机制

2.1 分层异常处理体系

Agent的异常处理需要分三层：

┌─────────────────────────────────────────────────┐
│ 第一层：快速失败（Fast Fail）                    │
│   - 参数校验失败、权限不足 → 直接返回错误       │
│   - 不消耗LLM调用，成本为零                     │
├─────────────────────────────────────────────────┤
│ 第二层：可恢复错误（Retryable）                  │
│   - 网络超时、服务暂时不可用 → 重试              │
│   - 需要配置重试策略和退避算法                   │
├─────────────────────────────────────────────────┤
│ 第三层：不可恢复错误（Propagate）                │
│   - 配额耗尽、模型崩溃 → 记录日志、返回友好错误  │
│   - 需要降级策略和用户通知                       │
└─────────────────────────────────────────────────┘

关键点：

异常要分类，不是所有异常都重试
区分"临时性故障"和"永久性故障"
错误信息要包含可操作的建议

2.2 重试策略与熔断机制

重试策略需要三层保护：

// 重试策略配置
public class RetryConfig {
    // 指数退避参数
    private int baseDelayMs = 1000;        // 基础延迟1秒
    private double backoffMultiplier = 2.0; // 每次失败延迟翻倍
    private int maxDelayMs = 60000;        // 最大延迟60秒
    private int maxRetries = 3;            // 最多重试3次
    
    // 熔断参数
    private int circuitBreakerThreshold = 5;  // 5次失败触发熔断
    private int circuitBreakerTimeout = 30000; // 熔断30秒后尝试恢复
}

熔断器状态机：

CLOSED（正常）→ 失败次数达到阈值 → OPEN（熔断）→ 超时后 → HALF_OPEN（半开）→ 
成功 → CLOSED（恢复）
失败 → OPEN（继续熔断）

2.3 上下文管理策略

上下文管理需要三种机制：

长度检测：实时监控token数量
压缩策略：当超过阈值时触发摘要/裁剪
分级存储：区分短期记忆、长期记忆、当前上下文

public class ContextManager {
    // 上下文长度阈值（LLM上下文窗口的80%）
    private static final double TRIGGER_THRESHOLD = 0.8;
    
    // 压缩后保留的比例
    private static final double COMPRESSION_RATIO = 0.5;
    
    public List<Message> manageContext(List<Message> history, int maxTokens) {
        int currentTokens = estimateTokenCount(history);
        int threshold = (int)(maxTokens * TRIGGER_THRESHOLD);
        
        if (currentTokens > threshold) {
            // 触发压缩策略
            return compressContext(history, COMPRESSION_RATIO);
        }
        return history;
    }
    
    private List<Message> compressContext(List<Message> history, double ratio) {
        // 保留最近N%的消息 + 开头的重要系统提示
        int keepCount = (int)(history.size() * ratio);
        List<Message> compressed = new ArrayList<>();
        
        // 保留系统提示
        compressed.add(history.get(0));
        // 保留最近的消息
        compressed.addAll(history.subList(history.size() - keepCount, history.size()));
        
        // 插入摘要说明压缩了多少
        compressed.add(1, new Message("system", 
            String.format("[早期 %d 条消息已压缩]", history.size() - keepCount)));
        
        return compressed;
    }
}

2.4 循环检测机制

循环检测需要识别三种模式：

模式一：工具调用循环
  A工具 → B工具 → A工具 → B工具（交替重复）

模式二：结果收敛循环  
  返回结果1 → 解析 → 再次调用 → 返回结果2 → 解析 → 再次调用（结果一直变化）

模式三：相似操作重复
  操作X(参数A) → 操作X(参数A) → 操作X(参数A)（完全相同）

检测算法：

public class LoopDetector {
    private static final int WINDOW_SIZE = 5;  // 检测窗口大小
    private static final double SIMILARITY_THRESHOLD = 0.85;  // 相似度阈值
    
    public boolean isLooping(List<StepResult> recentSteps) {
        if (recentSteps.size() < WINDOW_SIZE) {
            return false;
        }
        
        // 检查模式一：工具调用交替
        if (detectAlternatingPattern(recentSteps)) {
            return true;
        }
        
        // 检查模式三：相似操作重复
        if (detectSimilarOperations(recentSteps)) {
            return true;
        }
        
        return false;
    }
    
    private boolean detectSimilarOperations(List<StepResult> steps) {
        List<StepResult> window = steps.subList(steps.size() - WINDOW_SIZE, steps.size());
        
        for (int i = 0; i < window.size() - 1; i++) {
            double similarity = calculateSimilarity(window.get(i), window.get(i + 1));
            if (similarity > SIMILARITY_THRESHOLD) {
                return true;  // 发现重复操作
            }
        }
        return false;
    }
    
    private double calculateSimilarity(StepResult a, StepResult b) {
        // 基于工具名称、操作类型、参数的相似度计算
        int sameTool = a.getToolName().equals(b.getToolName()) ? 1 : 0;
        int sameAction = a.getActionType().equals(b.getActionType()) ? 1 : 0;
        double paramSimilarity = calculateParamSimilarity(a.getParams(), b.getParams());
        
        return (sameTool + sameAction + paramSimilarity) / 3.0;
    }
}

三、方案：四大工程化能力实现

3.1 异常处理实现

@Service
public class AgentExceptionHandler {
    
    @Autowired
    private AlertService alertService;
    
    /**
     * 分层异常处理
     */
    public Response handleAgentException(Exception e, AgentContext context) {
        if (e instanceof ValidationException) {
            // 第一层：快速失败 - 参数校验失败
            return Response.error("参数错误: " + e.getMessage())
                    .withSuggestion("请检查输入参数")
                    .withErrorCode("VALIDATION_ERROR");
        }
        
        if (e instanceof RateLimitException) {
            // 第二层：限流 - 需要退避重试
            return handleRateLimit(e, context);
        }
        
        if (e instanceof QuotaExceededException) {
            // 第三层：配额耗尽 - 不可恢复，需要降级
            return handleQuotaExceeded(e, context);
        }
        
        if (e instanceof ModelUnavailableException) {
            // 第三层：模型不可用 - 需要切换模型
            return handleModelFailure(e, context);
        }
        
        // 未知异常：记录+告警+友好返回
        log.error("未预期的Agent异常", e);
        alertService.sendAlert("Agent异常", e.getMessage(), AlertLevel.HIGH);
        
        return Response.error("系统繁忙，请稍后重试")
                .withSuggestion("如果问题持续存在，请联系技术支持")
                .withErrorCode("UNKNOWN_ERROR");
    }
    
    private Response handleRateLimit(Exception e, AgentContext context) {
        RateLimitException rle = (RateLimitException) e;
        long retryAfter = rle.getRetryAfterMs();
        
        return Response.retry()
                .withDelay(retryAfter)
                .withSuggestion("服务限流，" + (retryAfter / 1000) + "秒后自动重试")
                .withErrorCode("RATE_LIMIT");
    }
    
    private Response handleQuotaExceeded(Exception e, AgentContext context) {
        // 触发降级策略：使用轻量模型或返回缓存结果
        return Response.degraded()
                .withMessage("服务配额已用完，已切换到降级模式")
                .withSuggestion("详细解答将在配额恢复后提供")
                .withErrorCode("QUOTA_EXCEEDED");
    }
}

3.2 重试与熔断实现

@Component
public class ResilientLLMClient {
    
    private CircuitBreaker circuitBreaker;
    private RetryStrategy retryStrategy;
    
    public ResilientLLMClient() {
        this.circuitBreaker = new CircuitBreaker(
            5,      // 失败阈值
            30000,  // 熔断时长
            2       // 半开状态成功阈值
        );
        this.retryStrategy = new ExponentialBackoffRetry(1000, 2.0, 60000, 3);
    }
    
    public Response invokeWithResilience(Prompt prompt) {
        // 检查熔断器状态
        if (circuitBreaker.isOpen()) {
            log.warn("Circuit breaker is open, using fallback");
            return invokeFallback(prompt);
        }
        
        try {
            return retryStrategy.execute(() -> doInvoke(prompt));
        } catch (RetryExhaustedException e) {
            // 重试耗尽，触发熔断
            circuitBreaker.recordFailure();
            throw e;
        }
    }
    
    private Response doInvoke(Prompt prompt) {
        try {
            Response response = llm.invoke(prompt);
            circuitBreaker.recordSuccess();  // 成功，关闭熔断
            return response;
        } catch (RetryableException e) {
            // 可重试的异常，抛出让RetryStrategy处理
            throw e;
        } catch (NonRetryableException e) {
            // 不可重试的异常，直接上抛
            throw e;
        }
    }
}

/**
 * 指数退避重试策略
 */
public class ExponentialBackoffRetry implements RetryStrategy {
    
    private final int baseDelayMs;
    private final double multiplier;
    private final int maxDelayMs;
    private final int maxRetries;
    
    public ExponentialBackoffRetry(int baseDelayMs, double multiplier, 
                                   int maxDelayMs, int maxRetries) {
        this.baseDelayMs = baseDelayMs;
        this.multiplier = multiplier;
        this.maxDelayMs = maxDelayMs;
        this.maxRetries = maxRetries;
    }
    
    @Override
    public <T> T execute(Supplier<T> operation) throws RetryExhaustedException {
        Exception lastException = null;
        
        for (int attempt = 0; attempt <= maxRetries; attempt++) {
            try {
                return operation.get();
            } catch (RetryableException e) {
                lastException = e;
                
                if (attempt == maxRetries) {
                    break;  // 最后一次失败，不再等待
                }
                
                // 计算退避时间
                long delay = calculateBackoff(attempt);
                log.warn("Attempt {} failed, retrying in {} ms", attempt + 1, delay);
                
                try {
                    Thread.sleep(delay);
                } catch (InterruptedException ie) {
                    Thread.currentThread().interrupt();
                    throw new RetryExhaustedException("Interrupted during backoff", lastException);
                }
            }
        }
        
        throw new RetryExhaustedException("Retry exhausted after " + maxRetries + " attempts", lastException);
    }
    
    private long calculateBackoff(int attempt) {
        long delay = (long)(baseDelayMs * Math.pow(multiplier, attempt));
        return Math.min(delay, maxDelayMs);
    }
}

/**
 * 熔断器实现
 */
public class CircuitBreaker {
    private final int failureThreshold;
    private final long timeout;
    private final int successThreshold;
    
    private AtomicInteger failureCount = new AtomicInteger(0);
    private AtomicReference<State> state = new AtomicReference<>(State.CLOSED);
    private volatile long lastFailureTime = 0;
    
    public enum State { CLOSED, OPEN, HALF_OPEN }
    
    public CircuitBreaker(int failureThreshold, long timeout, int successThreshold) {
        this.failureThreshold = failureThreshold;
        this.timeout = timeout;
        this.successThreshold = successThreshold;
    }
    
    public boolean isOpen() {
        if (state.get() == State.OPEN) {
            // 检查是否超时，可以进入半开状态
            if (System.currentTimeMillis() - lastFailureTime > timeout) {
                state.set(State.HALF_OPEN);
                return false;
            }
            return true;
        }
        return false;
    }
    
    public void recordFailure() {
        failureCount.incrementAndGet();
        lastFailureTime = System.currentTimeMillis();
        
        if (failureCount.get() >= failureThreshold) {
            state.set(State.OPEN);
            log.warn("Circuit breaker opened after {} failures", failureCount.get());
        }
    }
    
    public void recordSuccess() {
        if (state.get() == State.HALF_OPEN) {
            // 在半开状态需要连续成功若干次才关闭
            if (successCount.incrementAndGet() >= successThreshold) {
                reset();
            }
        } else {
            // 正常状态，每次成功减少失败计数
            failureCount.updateAndGet(c -> Math.max(0, c - 1));
        }
    }
    
    private void reset() {
        failureCount.set(0);
        successCount.set(0);
        state.set(State.CLOSED);
    }
}

3.3 上下文溢出防护实现

@Component
public class ContextOverflowProtection {
    
    private static final double WARNING_THRESHOLD = 0.7;   // 70%警告
    private static final double CRITICAL_THRESHOLD = 0.85;  // 85%危险
    private static final double COMPRESSION_THRESHOLD = 0.9; // 90%触发压缩
    
    @Autowired
    private LLMClient llmClient;
    
    /**
     * 上下文健康管理
     */
    public ContextManagerResult manageContext(List<Message> history, int maxTokens) {
        int currentTokens = estimateTokens(history);
        double usageRatio = (double) currentTokens / maxTokens;
        
        ContextManagerResult result = new ContextManagerResult();
        result.setMessages(history);
        result.setUsageRatio(usageRatio);
        result.setCurrentTokens(currentTokens);
        
        if (usageRatio >= COMPRESSION_THRESHOLD) {
            // 需要压缩
            List<Message> compressed = compressContext(history, 0.5);
            int compressedTokens = estimateTokens(compressed);
            result.setMessages(compressed);
            result.setUsageRatio((double) compressedTokens / maxTokens);
            result.setWasCompressed(true);
            result.setCompressionRatio((double) compressedTokens / currentTokens);
            log.info("Context compressed: {} tokens -> {} tokens ({}%)", 
                currentTokens, compressedTokens, 
                (int)(result.getCompressionRatio() * 100));
        } else if (usageRatio >= CRITICAL_THRESHOLD) {
            result.setWarning("上下文使用率过高，部分请求可能被截断");
        } else if (usageRatio >= WARNING_THRESHOLD) {
            result.setWarning("上下文使用率较高，建议精简");
        }
        
        return result;
    }
    
    /**
     * 智能压缩策略
     * 不是简单地截断，而是：
     * 1. 识别关键信息（系统提示、重要决策）
     * 2. 保留最近的对话
     * 3. 将中间部分转换为摘要
     */
    public List<Message> compressContext(List<Message> history, double targetRatio) {
        if (history.size() <= 3) {
            return history;  // 太短了不压缩
        }
        
        List<Message> compressed = new ArrayList<>();
        
        // 1. 保留系统提示
        if (history.get(0).getRole().equals("system")) {
            compressed.add(history.get(0));
        }
        
        // 2. 识别重要信息点
        List<MessageSegment> importantSegments = extractImportantSegments(history);
        
        // 3. 合并中间段为摘要
        if (history.size() > 10) {
            String summary = generateSummary(history.subList(2, history.size() - 5));
            compressed.add(new Message("system", "[早期对话摘要] " + summary));
        }
        
        // 4. 保留最近5轮对话
        int keepRecent = Math.min(5, history.size() - 2);
        for (int i = history.size() - keepRecent; i < history.size(); i++) {
            compressed.add(history.get(i));
        }
        
        return compressed;
    }
    
    private List<MessageSegment> extractImportantSegments(List<Message> history) {
        // 提取包含关键决策、配置变更、重要信息的段落
        List<MessageSegment> segments = new ArrayList<>();
        
        for (Message msg : history) {
            if (isImportant(msg)) {
                segments.add(new MessageSegment(msg, Importance.HIGH));
            }
        }
        
        return segments;
    }
    
    private boolean isImportant(Message msg) {
        // 判断消息是否重要：包含决策、配置、关键信息
        String content = msg.getContent().toLowerCase();
        return content.contains("决定") || 
               content.contains("配置") || 
               content.contains("规则") ||
               content.contains("策略");
    }
    
    private String generateSummary(List<Message> messages) {
        // 调用LLM生成摘要（轻量级prompt）
        StringBuilder sb = new StringBuilder();
        for (Message msg : messages) {
            sb.append(msg.getRole()).append(": ").append(msg.getContent()).append("\n");
        }
        
        Prompt summaryPrompt = new Prompt(
            "请将以下对话内容压缩成30字以内的摘要：\n" + sb.toString()
        );
        
        try {
            Response response = llmClient.invokeLight(summaryPrompt);
            return response.getContent();
        } catch (Exception e) {
            // 降级：简单截取
            return "此期间有多轮交互，涉及多个问题讨论";
        }
    }
    
    /**
     * Token估算（简化版，实际使用 tiktoken 或等效库）
     */
    private int estimateTokens(List<Message> messages) {
        int total = 0;
        for (Message msg : messages) {
            // 粗略估算：中文每字2token，英文每词1.5token
            String content = msg.getContent();
            int tokens = content.length() / 2;  // 简化估算
            tokens += 10;  // 每条消息的基础开销
            total += tokens;
        }
        return total;
    }
}

3.4 循环检测实现

@Component
public class LoopDetectionService {
    
    private static final int DEFAULT_MAX_STEPS = 30;
    private static final int WINDOW_SIZE = 5;
    private static final double SIMILARITY_THRESHOLD = 0.8;
    
    @Autowired
    private MetricsService metricsService;
    
    /**
     * 检测是否陷入循环
     */
    public LoopDetectionResult detectLoop(List<StepRecord> recentSteps) {
        LoopDetectionResult result = new LoopDetectionResult();
        
        if (recentSteps.size() < WINDOW_SIZE) {
            result.setLooping(false);
            return result;
        }
        
        // 模式一：工具调用交替循环
        if (detectAlternatingTools(recentSteps)) {
            result.setLooping(true);
            result.setLoopType(LoopType.ALTERNATING_TOOLS);
            result.setReason("检测到工具交替调用模式");
            return result;
        }
        
        // 模式二：相似操作重复
        if (detectSimilarOperations(recentSteps)) {
            result.setLooping(true);
            result.setLoopType(LoopType.SIMILAR_OPERATIONS);
            result.setReason("检测到相似操作重复执行");
            return result;
        }
        
        // 模式三：结果不收敛
        if (detectNonConvergence(recentSteps)) {
            result.setLooping(true);
            result.setLoopType(LoopType.NON_CONVERGENCE);
            result.setReason("检测到结果持续变化未收敛");
            return result;
        }
        
        result.setLooping(false);
        return result;
    }
    
    private boolean detectAlternatingTools(List<StepRecord> steps) {
        List<StepRecord> window = getWindow(steps, WINDOW_SIZE);
        
        // 检测 A-B-A-B 或 A-B-C-A-B-C 等交替模式
        List<String> toolSequence = window.stream()
            .map(StepRecord::getToolName)
            .collect(Collectors.toList());
        
        // 检测是否只有2-3个工具在循环
        long distinctTools = toolSequence.stream().distinct().count();
        if (distinctTools <= 3 && distinctTools >= 2) {
            // 检查是否形成循环
            for (int i = 0; i < window.size() - 3; i++) {
                String pattern = toolSequence.get(i) + "-" + toolSequence.get(i + 1);
                for (int j = i + 2; j < window.size() - 1; j++) {
                    String checkPattern = toolSequence.get(j) + "-" + toolSequence.get(j + 1);
                    if (pattern.equals(checkPattern)) {
                        return true;  // 发现循环模式
                    }
                }
            }
        }
        return false;
    }
    
    private boolean detectSimilarOperations(List<StepRecord> steps) {
        List<StepRecord> window = getWindow(steps, WINDOW_SIZE);
        
        // 计算相邻步骤的相似度
        for (int i = 0; i < window.size() - 2; i++) {
            double sim1 = calculateSimilarity(window.get(i), window.get(i + 1));
            double sim2 = calculateSimilarity(window.get(i + 1), window.get(i + 2));
            
            // 连续3个步骤都很相似
            if (sim1 > SIMILARITY_THRESHOLD && sim2 > SIMILARITY_THRESHOLD) {
                return true;
            }
        }
        
        return false;
    }
    
    private boolean detectNonConvergence(List<StepRecord> steps) {
        if (steps.size() < 10) {
            return false;
        }
        
        // 取最后N步的结果
        List<Object> recentResults = steps.stream()
            .skip(steps.size() - 5)
            .map(StepRecord::getResult)
            .collect(Collectors.toList());
        
        // 计算结果的变化程度
        double variance = calculateVariance(recentResults);
        
        // 如果结果持续变化且变化幅度较大，可能是死循环
        return variance > 0.5 && calculateTrend(recentResults) == 0;
    }
    
    private double calculateSimilarity(StepRecord a, StepRecord b) {
        double toolSimilarity = a.getToolName().equals(b.getToolName()) ? 1.0 : 0.0;
        double actionSimilarity = a.getActionType().equals(b.getActionType()) ? 1.0 : 0.0;
        double paramSimilarity = calculateParamSimilarity(a.getParams(), b.getParams());
        
        return (toolSimilarity * 0.5 + actionSimilarity * 0.3 + paramSimilarity * 0.2);
    }
    
    private double calculateParamSimilarity(Map<String, Object> params1, Map<String, Object> params2) {
        if (params1.isEmpty() && params2.isEmpty()) return 1.0;
        if (params1.isEmpty() || params2.isEmpty()) return 0.0;
        
        int sameKeys = 0;
        for (String key : params1.keySet()) {
            if (params2.containsKey(key)) {
                Object v1 = params1.get(key);
                Object v2 = params2.get(key);
                if (v1 != null && v1.equals(v2)) {
                    sameKeys++;
                }
            }
        }
        
        int maxSize = Math.max(params1.size(), params2.size());
        return (double) sameKeys / maxSize;
    }
    
    /**
     * 处理检测到循环的情况
     */
    public LoopHandlingResult handleLoop(LoopDetectionResult detection, AgentContext context) {
        LoopHandlingResult result = new LoopHandlingResult();
        
        log.warn("Loop detected: {} - {}", detection.getLoopType(), detection.getReason());
        
        // 记录指标
        metricsService.recordLoopEvent(detection);
        
        switch (detection.getLoopType()) {
            case ALTERNATING_TOOLS:
                // 交替循环：强制终止，给出部分结果
                result.setAction(LoopAction.TERMINATE_WITH_PARTIAL);
                result.setMessage("检测到工具交替调用，执行已终止。已获取部分结果。");
                break;
                
            case SIMILAR_OPERATIONS:
                // 相似操作：尝试跳过重复步骤
                result.setAction(LoopAction.SKIP_AND_CONTINUE);
                result.setMessage("检测到相似操作重复，已跳过并继续执行。");
                break;
                
            case NON_CONVERGENCE:
                // 不收敛：返回最接近成功的状态
                result.setAction(LoopAction.RETURN_BEST_STATE);
                result.setMessage("任务未能收敛，已返回目前最佳结果。");
                break;
                
            default:
                result.setAction(LoopAction.TERMINATE);
                result.setMessage("检测到执行异常，任务已终止。");
        }
        
        return result;
    }
}

四、面试通过指南

4.1 高频面试题

Q1：Agent的异常处理和普通后端服务有什么区别？

参考答案：

普通后端服务的异常处理目标是"正确或错误"；
Agent的异常处理目标是"让Agent继续完成任务"。

关键区别：
1. 异常分类更细：临时故障（重试）、资源不足（降级）、配额耗尽（排队）
2. 需要保留执行状态：Agent有中间步骤，不能简单地回滚
3. 降级策略更重要：可以用轻量模型、缓存结果、部分完成来响应
4. 用户体验更重要：不能突然中断，要给出进度说明

Q2：如何设计一个合理的重试策略？

参考答案：

重试策略设计要点：
1. 异常分类：
   - 可重试：网络超时、服务器暂时不可用、限流
   - 不可重试：参数错误、权限不足、404

2. 退避算法：
   - 指数退避：1s → 2s → 4s → 8s，避免惊群效应
   - jitter：加随机抖动，避免多实例同时重试

3. 熔断保护：
   - 失败次数超过阈值 → 打开熔断器
   - 熔断期间快速失败，不消耗资源
   - 超时后尝试恢复（半开状态）

4. 限流配合：
   - 限流时不仅要重试，还要控制并发量
   - 可以使用信号量或令牌桶限制并发重试数

Q3：上下文溢出后如何处理？

参考答案：

上下文溢出处理策略（按优先级）：
1. 预防阶段：监控使用率，提前预警（80%时警告）
2. 触发压缩：超过90%时触发智能压缩
3. 压缩策略：
   - 保留系统提示和重要配置
   - 保留最近N轮对话
   - 中间部分用LLM生成摘要
4. 截断策略：压缩后仍超限，截断最老的对话
5. 存档策略：超限内容存档到外部存储，可按需检索

Q4：如何检测Agent的死循环？

参考答案：

死循环检测机制：
1. 步数限制：硬限制，超过直接终止
2. 工具调用循环检测：检测A→B→A→B的交替模式
3. 相似操作检测：连续多次执行相似操作
4. 结果收敛检测：结果持续变化不收敛
5. 时间限制：单次执行超过阈值强制终止

检测到循环后的处理：
- 记录详细日志供分析
- 返回部分结果（不空手）
- 记录异常指标供监控
- 可选：自动尝试修复策略（如跳过重复步骤）

4.2 谈薪技巧

当面试官问到你做过的高可用系统时：

"我在上个项目实现了一套Agent异常处理体系，包含分层异常处理、指数退避重试、熔断保护机制。上线后Agent的可用性从92%提升到99.5%，类似限流异常的重试成功率从60%提升到95%。同时实现了上下文溢出保护和循环检测机制，有效避免了Agent进入死循环导致的系统崩溃。"

关键数据点：

可用性提升幅度
重试成功率
异常恢复时间
系统崩溃次数下降

当面试官问到你遇到的挑战时：

"最大的挑战是Agent死循环问题。一开始我们设置了30步的硬限制，但经常达到上限却没有完成任务。后来我设计了一套循环检测机制，通过分析工具调用序列、操作相似度、结果收敛性来提前检测循环，配合智能跳过和部分结果返回，既保证了系统稳定性，又尽可能给用户提供价值。"

五、总结：工程化能力的核心要点

┌─────────────────────────────────────────────────────────────┐
│                 Agent工程化能力核心要点                       │
├─────────────────────────────────────────────────────────────┤
│ 1. 分层异常处理                                              │
│    快速失败 → 可恢复重试 → 不可恢复降级                      │
├─────────────────────────────────────────────────────────────┤
│ 2. 重试与熔断                                                │
│    指数退避 + jitter + 熔断器状态机                         │
├─────────────────────────────────────────────────────────────┤
│ 3. 上下文管理                                                │
│    长度检测 → 智能压缩 → 分级存储                           │
├─────────────────────────────────────────────────────────────┤
│ 4. 循环检测                                                  │
│    工具交替 + 相似操作 + 结果收敛 + 步数/时间限制            │
└─────────────────────────────────────────────────────────────┘

面试加分项：

能画出异常处理和熔断的状态机图
能说出具体的阈值参数（如退避倍数、熔断时长）
有实际的线上问题和解决方案经验
了解业界的最佳实践（如Resilience4j、Hystrix的设计）

📋 顺手整理了一份《Agent工程化高频50题》。

异常处理怎么分层？重试策略怎么配？上下文爆了怎么办？死循环怎么检测？

这几个问题面试问得很深，但资料很少。我整理了一份专项面试题，每道题都标注了"面试官挖的角度"和"怎么答才能拿高分"。

不卖课，不收费，直接送。

点我头像私信「工程化」，看到回。

本文属于「Java版Agent架构师训练营」配套文章