基于混合语义溯源的法律文档摘要可追溯系统设计与实现

2301_81687591

381人浏览 · 2026-04-08 00:15:00

2301_81687591 · 2026-04-08 00:15:00 发布

基于混合语义溯源的法律文档摘要可追溯系统设计与实现

核心技术: LLM Prompt标注 + OpenAI Embedding余弦相似度 = 双通道融合溯源
论文灵感: ALCE (ACL 2023) + RARR (EMNLP 2023)
开发环境: Python 3.11 + FastAPI + Vue 3 + OpenAI API

一、前言：为什么法律文档摘要需要"溯源"？

在法律AI应用中，“幻觉”(Hallucination) 是最致命的问题。当AI摘要系统输出一句"法院认定合同有效"，律师和法官需要立即验证——这句话在原文的哪个位置？ 如果无法溯源，AI生成的摘要就毫无法律应用价值。

传统摘要系统的痛点：

信任危机 — 生成的要点没有引证，用户不敢采信
效率低下 — 用户需要手动在几十页原文中搜索对应段落
责任风险 — 错误引用导致的法律后果无法追责

本文的解决方案

我们设计了一套混合溯源架构：每个摘要要点自动关联到原文的精确段落，用户点击即可跳转高亮。

二、技术方案对比与选型

方案	论文来源	核心思路	优势	劣势	GPU需求
NLI溯源	ALCE, ACL 2023	用NLI模型验证摘要-原文蕴含关系	精确度最高	需要NLI分类器	是
Cross-Encoder	RARR, EMNLP 2023	交叉编码器对摘要-原文对打分	高召回率	计算量大	是
Attention归因	—	利用Transformer注意力矩阵定位来源	可解释性强	仅限白盒模型	是
混合溯源(本文)	综合ALCE+RARR	LLM标注 + Embedding相似度	无GPU, API友好	依赖Embedding质量	否

最终选择：混合溯源（双通道融合）— 兼顾精确度和工程可行性，纯API方案无需GPU。

三、系统架构

                    ┌─────────────────────┐
                    │     用户上传PDF      │
                    └────────┬────────────┘
                             ▼
                    ┌─────────────────────┐
                    │  文档分块 + 编号      │
                    │  [Block 0] [Block 1] │
                    └────────┬────────────┘
                             ▼
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
     ┌────────────┐  ┌────────────┐  ┌────────────┐
     │  RAG检索    │  │  LLM摘要   │  │ Embedding  │
     │  (ChromaDB) │  │ (GPT-4)   │  │ (3-small)  │
     └──────┬─────┘  └──────┬─────┘  └──────┬─────┘
            │               │               │
            └───────────────┼───────────────┘
                            ▼
                 ┌──────────────────────┐
                 │  LLM标注 [来源: X,Y] │ ← 通道1
                 └──────────┬───────────┘
                            ▼
                 ┌──────────────────────┐
                 │  语义cosine匹配      │ ← 通道2
                 └──────────┬───────────┘
                            ▼
                 ┌──────────────────────┐
                 │  双通道融合           │
                 │  → source_mappings   │
                 └──────────┬───────────┘
                            ▼
                 ┌──────────────────────┐
                 │  前端点击跳转 + 高亮  │
                 └──────────────────────┘

四、核心算法详解

4.1 文档分块编号

将原文拆分为编号块，嵌入LLM Prompt中：

def _build_numbered_blocks(doc_id: str, max_blocks: int = 60) -> str:
    """构建带编号的原文块文本"""
    blocks = get_blocks_by_doc(doc_id)
    if not blocks:
        return ""
    lines = []
    for i, b in enumerate(blocks[:max_blocks]):
        content = b['content'] if isinstance(b, dict) else b.content
        lines.append(f"[Block {i}] {content}")
    return "\n".join(lines)

设计考量：

max_blocks=60 控制上下文长度，避免超出LLM窗口
编号从0开始，与数据库索引对齐

4.2 通道1：LLM Prompt 标注

在角色提示词中加入来源标注指令：

_SOURCE_INSTRUCTION = """
## 来源标注要求
在"关键要素"部分，每个要点后面用方括号标注来源原文块编号，
格式为 [来源: X] 或 [来源: X, Y]，其中 X、Y 为原文块编号（从0开始）。
例如：
- 原告要求被告赔偿损失50万元 [来源: 3, 5]
- 法院认定合同有效 [来源: 12]
"""

解析LLM输出中的来源标注：

def _parse_llm_source_mappings(key_points: list[str]) -> tuple[list[str], list[dict]]:
    """从 LLM 输出的 [来源: X, Y] 标注中提取映射"""
    clean_points = []
    mappings = []
    pattern = re.compile(r'\[来源:\s*([\d,\s]+)\]')
    for i, point in enumerate(key_points):
        m = pattern.search(point)
        block_indices = []
        if m:
            nums = m.group(1).split(',')
            block_indices = [int(n.strip()) for n in nums if n.strip().isdigit()]
            clean_text = pattern.sub('', point).strip()
        else:
            clean_text = point.strip()
        clean_points.append(clean_text)
        if block_indices:
            mappings.append({"point_index": i, "block_indices": block_indices})
    return clean_points, mappings

局限性：LLM标注不总是准确——有时会标注错误的块编号，或者完全不标注。这就是为什么需要第二通道。

4.3 通道2：语义相似度匹配

这是系统的核心——用 OpenAI Embedding 计算每个摘要要点与所有原文块的余弦相似度：

def _get_embeddings(texts: list[str], client: OpenAI = None) -> list[list[float]]:
    """批量获取 OpenAI embedding，带缓存"""
    if client is None:
        client = _get_client()
    uncached = [(i, t) for i, t in enumerate(texts) if t not in _embedding_cache]
    if uncached:
        batch_texts = [t for _, t in uncached]
        for start in range(0, len(batch_texts), 100):
            batch = batch_texts[start:start+100]
            try:
                resp = client.embeddings.create(
                    input=batch,
                    model="text-embedding-3-small"
                )
                for j, emb_data in enumerate(resp.data):
                    _embedding_cache[batch[j]] = emb_data.embedding
            except Exception:
                for t in batch:
                    _embedding_cache[t] = []
    return [_embedding_cache.get(t, []) for t in texts]


def _cosine_similarity(a: list[float], b: list[float]) -> float:
    """余弦相似度 = dot(a,b) / (||a|| * ||b||)"""
    if not a or not b:
        return 0.0
    a_arr = np.array(a)
    b_arr = np.array(b)
    dot = np.dot(a_arr, b_arr)
    norm_a = np.linalg.norm(a_arr)
    norm_b = np.linalg.norm(b_arr)
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return float(dot / (norm_a * norm_b))

关键参数：

模型：text-embedding-3-small (1536维，OpenAI最新嵌入模型)
批处理：每批100条，避免API限流
缓存：dict 级内存缓存，避免重复请求

语义匹配核心逻辑：

def _semantic_source_mapping(
    key_points: list[str],
    doc_id: str,
    client: OpenAI = None,
    top_k: int = 3,
    threshold: float = 0.45,
) -> list[dict]:
    """
    语义溯源：计算每个关键要素与原文块的 cosine similarity
    取 similarity > threshold 的 top_k 个块作为来源
    """
    blocks = get_blocks_by_doc(doc_id)
    if not blocks or not key_points:
        return []
    
    block_texts = [b['content'] if isinstance(b, dict) else b.content for b in blocks]
    
    # 批量获取所有 embeddings（要点 + 原文块）
    all_texts = key_points + block_texts
    all_embeddings = _get_embeddings(all_texts, client)
    
    point_embeddings = all_embeddings[:len(key_points)]
    block_embeddings = all_embeddings[len(key_points):]
    
    mappings = []
    for i, point_emb in enumerate(point_embeddings):
        if not point_emb:
            continue
        # 计算与所有块的相似度
        similarities = [(j, _cosine_similarity(point_emb, block_emb))
                        for j, block_emb in enumerate(block_embeddings) if block_emb]
        
        # 按相似度降序排列，取 top_k 且 > threshold
        similarities.sort(key=lambda x: x[1], reverse=True)
        block_indices = [idx for idx, sim in similarities[:top_k] if sim >= threshold]
        
        if block_indices:
            mappings.append({
                "point_index": i,
                "block_indices": block_indices,
                "scores": [round(sim, 4) for idx, sim in similarities[:len(block_indices)]],
            })
    return mappings

为什么 threshold=0.45？

法律文本中，摘要要点和原文块的语义不是完全重叠（摘要是概括性表述）
经实测，0.45能保证召回率的同时过滤明显不相关的段落
太高(>0.7)会漏掉间接引用的段落
太低(<0.3)会引入噪声

4.4 双通道融合

最后，将两个通道的结果合并：

def _merge_source_mappings(llm_mappings: list[dict], semantic_mappings: list[dict]) -> list[dict]:
    """
    融合策略：
    1. 语义匹配结果为基础
    2. LLM标注作为高置信度补充（覆盖语义结果）
    3. 去重 + 上限5个来源/要点
    4. 标记融合方法：hybrid/llm/semantic
    """
    result_map: dict[int, dict] = {}
    
    # 先添加语义匹配的结果
    for m in semantic_mappings:
        pi = m["point_index"]
        result_map[pi] = {
            "point_index": pi,
            "block_indices": m["block_indices"],
            "method": "semantic",
        }
    
    # LLM 标注覆盖语义匹配
    for m in llm_mappings:
        pi = m["point_index"]
        if pi in result_map:
            combined = list(m["block_indices"])
            for idx in result_map[pi]["block_indices"]:
                if idx not in combined:
                    combined.append(idx)
            result_map[pi] = {
                "point_index": pi,
                "block_indices": combined[:5],
                "method": "hybrid",
            }
        else:
            result_map[pi] = {
                "point_index": pi,
                "block_indices": m["block_indices"],
                "method": "llm",
            }
    
    return sorted(result_map.values(), key=lambda x: x["point_index"])

五、前端交互实现

5.1 Vue 组件关键代码

SummaryPanel.vue — 溯源标签 + 点击事件:

<li v-for="(point, i) in summary.key_points" :key="i"
    :class="{ 'has-source': !!getSourceMappingForPoint(i) }"
    @click="handlePointClick(i)">
  {{ point }}
  <span v-if="getSourceMappingForPoint(i)" class="source-tag"
        :class="getSourceMethod(i) || ''">
    {{ getSourceMethod(i) === 'hybrid' ? '混合溯源' : 
       getSourceMethod(i) === 'semantic' ? '语义溯源' : '溯源' }}
  </span>
</li>

DocumentView.vue — 自动跳转 + 高亮动画:

function handleLocateSource(blockIndices: number[]) {
  // 1. 切换到原文块 tab
  activeTab.value = 'blocks'
  // 2. 高亮目标块
  highlightedBlocks.value = new Set(blockIndices)
  // 3. 滚动到第一个目标块
  nextTick(() => {
    const el = document.getElementById(`block-${blockIndices[0]}`)
    el?.scrollIntoView({ behavior: 'smooth', block: 'center' })
  })
  // 4. 4秒后清除高亮
  setTimeout(() => { highlightedBlocks.value.clear() }, 4000)
}

5.2 高亮动画 CSS

.highlighted {
  border-left: 3px solid var(--accent);
  background: var(--accent-light);
  animation: pulse 1.5s ease-in-out;
}
@keyframes pulse {
  0%, 100% { opacity: 1; }
  50% { opacity: 0.6; }
}

六、性能分析

阶段	时间开销	说明
文档分块编号	<10ms	纯字符串操作
LLM摘要生成	3-8s	取决于文档长度和模型
Embedding计算	0.3-1s	API调用，60 blocks一批
余弦相似度矩阵	<10ms	NumPy向量运算
融合	<1ms	简单字典操作
总增量延迟	约0.5-1.5s	相比纯LLM摘要