概述

随着企业数字化转型的深入推进,大量的合同、协议、发票、报告等文档需要进行处理和分析。传统的人工审核方式效率低下,难以满足现代企业的需求。文档智能解析审核系统应运而生,通过结合大模型技术,实现对文档的智能理解、摘要提取和合规校验。

本系列文章将详细介绍如何使用Java技术栈实现一个完整的文档智能解析审核系统,涵盖PDF文档解析、大模型集成、合规规则引擎和审核报告生成等核心功能。

1 文档智能解析审核的应用场景

1.1 合同审核

合同审核是企业日常运营中最常见的文档处理场景。一份标准的商业合同通常包含以下关键要素:

  • **合同双方信息**:甲方、乙方名称及基本信息
  • **合同标的**:具体的商品或服务内容
  • **合同金额与支付方式**:付款金额、分期计划、支付方式
  • **履行期限**:开始时间、结束时间、关键里程碑
  • **违约责任**:违约金计算方式、赔偿上限
  • **争议解决**:仲裁条款、管辖法院

传统的人工审核需要法务人员逐字逐句阅读合同,耗时且容易遗漏关键条款。通过文档智能解析审核系统,可以:

1. **自动提取合同关键要素**:使用PDF解析技术提取文本,自动识别合同双方、金额、期限等核心信息

2. **风险点自动识别**:基于大模型的语义理解能力,自动识别合同中的潜在风险条款

3. **条款合规性校验**:与预设的合规规则库对比,标记不合规条款

4. **摘要自动生成**:生成合同摘要,方便快速了解合同核心内容

1.2 发票识别

发票是企业财务核算的重要凭证,发票识别系统需要处理以下内容:

public class InvoiceDTO {
    private String invoiceNumber;      // 发票号码
    private String invoiceCode;        // 发票代码
    private String sellerName;         // 销售方名称
    private String buyerName;          // 购买方名称
    private BigDecimal totalAmount;    // 价税合计
    private String invoiceDate;        // 开票日期
    private String taxAmount;          // 税额
    private List<InvoiceItem> items;   // 明细项目
}

[java]

发票识别的核心挑战在于:

  • **格式多样性**:不同地区的发票格式存在差异
  • **表格解析**:需要准确提取明细项目表格
  • **印章遮挡**:发票上常有红色印章遮挡部分文字
  • **模糊文字**:扫描件可能存在文字模糊的情况

1.3 报告摘要

长篇报告(如审计报告、研究报告、可行性报告)的摘要提取是另一个重要场景:

报告类型

平均页数

关键提取要素

审计报告

30-50页

审计意见、财务状况、风险提示

研究报告

20-100页

研究结论、数据来源、方法论

可行性报告

40-80页

项目概述、投资估算、效益分析

大模型在报告摘要场景中展现出显著优势,能够理解报告的逻辑结构,提取核心观点,生成连贯的摘要文本。

2 整体架构设计

2.1 架构分层概述

文档智能解析审核系统采用四层架构设计,从底向上依次为:文档解析层、大模型层、审核层、报告层。

**文档解析层**负责接收各种格式的文档(PDF、Word、Excel等),将其转换为结构化的中间表示。该层的核心技术包括Apache PDFBox文档解析、Apache POI Office文档解析、OCR文字识别、表格结构识别、图片签章提取等。

**大模型层**基于LangChain4j框架实现,提供Prompt模板管理、对话上下文管理、Chain调用编排等核心功能。通过统一的接口适配多种大模型(ChatGLM、Qwen、Ernie、GPT-4等),实现大模型能力的灵活切换。

**审核层**基于解析层和大模型层的结果,执行合规规则校验、风险点识别、条款比对等审核逻辑。该层采用规则引擎与AI模型相结合的混合审核策略,确保审核结果的准确性和可解释性。

**报告层**负责生成结构化的审核报告,包括摘要信息、风险清单、合规建议等。报告可以输出为JSON格式供系统集成,也可以生成可视化HTML页面供人工查阅。

2.2 数据流转设计

文档输入 → 文档解析 → 结构化文本 → 大模型理解 → 审核判断 → 报告输出
    ↓           ↓            ↓            ↓           ↓         ↓
  PDF文件    PDFBox      DocumentDTO   ChatMessage  ReviewResult ReportDTO
  Word文件   POI         TextBlock     Prompt       RiskItem     JSON/Html
  图片扫描   OCR         TableDTO      Response     Suggestion

数据流转的核心原则是:

1. **单向流动**:数据从文档解析层向报告层单向流动,各层之间通过定义良好的接口交互

2. **结构化传递**:每层都对原始数据进行结构化处理,便于后续层使用

3. **缓存加速**:中间结果缓存到Redis,减少重复计算

4. **可追溯**:每个处理环节都记录日志,便于问题排查

2.3 核心技术选型

**Spring Boot 3.x**作为核心框架,提供依赖注入、AOP切面、Web服务等基础设施。Spring Boot的自动配置能力大大简化了项目搭建过程,其成熟的生态也为系统稳定性提供了保障。

**LangChain4j**是Java生态中最成熟的大模型集成框架,提供了丰富的Chain类型、Prompt模板管理、工具调用等能力。相比直接调用大模型API,LangChain4j提供了更高层次的抽象,降低了开发复杂度。

**Apache PDFBox**是Apache基金会的开源PDF处理库,提供了PDF文档加载、文本提取、图片提取等功能。其轻量级的特点使其成为PDF解析的首选工具。

**Apache POI**是处理Office文档的标准解决方案,支持Word、Excel、PowerPoint等格式的读写操作。

**Elasticsearch**作为向量数据库存储文档的语义向量,支持语义相似度搜索。在合同审核场景中,可以利用ES的向量检索能力,快速找到相似的历史合同作为参考。

**Redis**用于缓存频繁访问的数据,包括:大模型API Token计数、解析结果的缓存、用户会话信息等。

<!-- Maven 核心依赖配置 -->
<dependencies>
    <!-- Spring Boot Starter -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
        <version>3.2.0</version>
    </dependency>

    <!-- LangChain4j -->
    <dependency>
        <groupId>dev.langchain4j</groupId>
        <artifactId>langchain4j</artifactId>
        <version>0.27.0</version>
    </dependency>
    <dependency>
        <groupId>dev.langchain4j</groupId>
        <artifactId>langchain4j-open-ai</artifactId>
        <version>0.27.0</version>
    </dependency>

    <!-- Apache PDFBox -->
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>3.0.0</version>
    </dependency>

    <!-- Apache POI -->
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>5.2.5</version>
    </dependency>

    <!-- Elasticsearch -->
    <dependency>
        <groupId>co.elastic.clients</groupId>
        <artifactId>elasticsearch-java</artifactId>
        <version>8.11.0</version>
    </dependency>

    <!-- Redis -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-data-redis</artifactId>
        <version>3.2.0</version>
    </dependency>
</dependencies>

[xml]

3 技术选型详解

3.1 Spring Boot 框架优势

Spring Boot 3.x相比之前的版本有诸多改进:

1. **JDK 17+支持**:充分利用新版本JDK的性能提升和语言特性

2. **AOT编译支持**:启动速度大幅提升,内存占用降低

3. **虚拟线程支持**:处理并发请求更高效

// Spring Boot 应用主类
@SpringBootApplication
@EnableScheduling
public class ContractReviewApplication {
    public static void main(String[] args) {
        SpringApplication.run(ContractReviewApplication.class, args);
    }
}

// 配置类示例
@Configuration
@EnableConfigurationProperties({
    LangChainProperties.class,
    ElasticsearchProperties.class
})
public class AppConfig {

    @Bean
    public ChatModel chatModel(LangChainProperties properties) {
        return OpenAiChatModel.builder()
            .apiKey(properties.getOpenai().getApiKey())
            .modelName(properties.getOpenai().getModel())
            .temperature(0.7)
            .build();
    }

    @Bean
    public ElasticsearchClient elasticsearchClient(
            ElasticsearchProperties properties) {
        return ElasticsearchClient.builder()
            .hosts(new HttpHost(
                properties.getHost(),
                properties.getPort(),
                "https"
            ))
            .build();
    }
}

[java]

3.2 LangChain4j 核心概念

LangChain4j是Java版的LangChain实现,核心概念包括:

**ChatModel**:大模型对话接口,支持同步和异步调用

public interface ChatModel {
    String generate(String userMessage);
    Response<AiMessage> generate(List<ChatMessage> messages);
}

[java]

**PromptTemplate**:Prompt模板管理,支持变量替换

PromptTemplate template = PromptTemplate.from(
    "请提取以下合同中的{{field}}信息:\n{{content}}"
);

Map<String, String> variables = Map.of(
    "field", "甲方名称",
    "content", "本合同甲方为阿里巴巴集团..."
);

String prompt = template.apply(variables);

[java]

**Chain**:Chain是LangChain的核心抽象,将多个操作串联成流水线

// 合同摘要Chain示例
Chain contractSummaryChain = Chain.builder()
    .step(new LoadDocumentStep(parserService))
    .step(new ExtractKeyInfoStep(chatModel))
    .step(new GenerateSummaryStep(chatModel))
    .step(new FormatOutputStep())
    .build();

[java]

**Memory**:对话上下文管理,支持多轮对话

MessageWindowChatMemory memory = MessageWindowChatMemory.builder()
    .maxMessages(20)
    .build();

ConversationalChain chain = ConversationalChain.builder()
    .chatModel(chatModel)
    .chatMemory(memory)
    .build();

[java]

3.3 Elasticsearch 向量检索

在合同审核场景中,向量检索用于:

1. **相似合同查找**:根据当前合同内容,找到相似的历史合同作为参考

2. **条款库检索**:从条款库中检索适用的标准条款

3. **风险案例匹配**:查找相似的风险案例供参考

// 向量存储配置
@Configuration
public class VectorStoreConfig {

    @Bean
    public VectorStore vectorStore(ElasticsearchClient client) {
        return new ElasticsearchVectorStore(
            client,
            "contract_embeddings",  // 索引名
            new EmbeddingModel()    // 嵌入模型
        );
    }
}

// 存储文档向量
public void indexContract(ContractDTO contract) {
    // 1. 分块文档内容
    List<String> chunks = textChunker.chunk(contract.getContent());

    // 2. 生成向量
    List<Embedding> embeddings = embeddingModel.embed(chunks);

    // 3. 存储到ES
    for (int i = 0; i < chunks.size(); i++) {
        vectorStore.add(Document.builder()
            .id(contract.getId() + "_" + i)
            .text(chunks.get(i))
            .embedding(embeddings.get(i))
            .metadata(Map.of(
                "contractId", contract.getId(),
                "chunkIndex", i
            ))
            .build());
    }
}

// 相似度检索
public List<ContractDTO> findSimilarContracts(
        String query, int topK) {

    // 1. 将查询转为向量
    Embedding queryEmbedding = embeddingModel.embed(query);

    // 2. ES相似度搜索
    SearchResponse response = client.search(s -> s
        .index("contract_embeddings")
        .knn(k -> k
            .field("embedding")
            .queryVector(toFloats(queryEmbedding))
            .k(topK)
        ),
        ContractSearchResult.class
    );

    // 3. 返回结果
    return response.hits().hits().stream()
        .map(hit -> contractService.findById(
            hit.source().get("contractId")))
        .collect(Collectors.toList());
}

[java]

3.4 Redis 缓存策略

Redis在系统中承担多重职责:

# application.yml 配置示例
spring:
  data:
    redis:
      host: localhost
      port: 6379
      password: ${REDIS_PASSWORD:}
      database: 0
      timeout: 3000ms
      lettuce:
        pool:
          max-active: 20
          max-idle: 10
          min-idle: 5

[yaml]

@Service
public class CacheService {

    private final RedisTemplate<String, Object> redisTemplate;

    // Token计数缓存(限制API调用频率)
    public void incrementTokenCount(String userId) {
        String key = "token:count:" + userId;
        Long count = redisTemplate.opsForValue().increment(key);
        if (count != null && count == 1) {
            redisTemplate.expire(key, Duration.ofMinutes(1));
        }
    }

    public boolean isRateLimited(String userId, int maxRequests) {
        String key = "token:count:" + userId;
        Integer count = (Integer) redisTemplate.opsForValue().get(key);
        return count != null && count >= maxRequests;
    }

    // 解析结果缓存
    public void cacheParseResult(String documentId,
                                 DocumentDTO result) {
        String key = "parse:result:" + documentId;
        redisTemplate.opsForValue().set(key, result, Duration.ofHours(24));
    }

    public DocumentDTO getCachedParseResult(String documentId) {
        String key = "parse:result:" + documentId;
        return (DocumentDTO) redisTemplate.opsForValue().get(key);
    }

    // 分布式锁(防止并发处理同一文档)
    public boolean tryLock(String documentId, Duration timeout) {
        String key = "lock:document:" + documentId;
        return Boolean.TRUE.equals(
            redisTemplate.opsForValue().setIfAbsent(
                key, Thread.currentThread().getName(),
                java.util.concurrent.TimeUnit.toSeconds(
                    timeout.toMillis()), java.util.concurrent.TimeUnit.MILLISECONDS
            )
        );
    }
}

[java]

4 项目模块划分与依赖关系

4.1 模块结构总览

contract-parent/
├── contract-core/          # 核心模块
├── contract-parser/        # 文档解析模块
├── contract-ai/            # 大模型服务模块
├── contract-review/        # 合规审核模块
├── contract-api/           # REST API模块
└── contract-web/          # Web前端模块

4.2 contract-core 核心模块

核心模块包含所有模块共享的基础组件:

// 实体类示例
@Entity
@Table(name = "contract")
public class Contract {

    @Id
    private String id;

    private String title;              // 合同标题
    private String content;            // 合同全文
    private String partyA;            // 甲方
    private String partyB;            // 乙方
    private BigDecimal amount;        // 合同金额
    private LocalDate startDate;      // 开始日期
    private LocalDate endDate;        // 结束日期
    private ContractStatus status;     // 状态
    private LocalDateTime createdAt;
    private LocalDateTime updatedAt;

    @OneToMany(mappedBy = "contract", cascade = CascadeType.ALL)
    private List<ReviewResult> reviewResults;
}

// 枚举类示例
public enum ContractStatus {
    DRAFT("草稿"),
    PENDING_REVIEW("待审核"),
    IN_REVIEW("审核中"),
    APPROVED("已通过"),
    REJECTED("已拒绝"),
    EXPIRED("已过期");

    private final String description;

    ContractStatus(String description) {
        this.description = description;
    }
}

// 接口定义示例
public interface DocumentParser {
    DocumentDTO parse(MultipartFile file) throws ParseException;
    boolean supports(String fileType);
}

public interface ReviewEngine {
    ReviewResult review(DocumentDTO document, ReviewRules rules);
}

public interface SummaryGenerator {
    SummaryDTO generateSummary(DocumentDTO document);
}

[java]

4.3 contract-parser 文档解析模块

// 模块依赖
// contract-core (必须)
// 第三方: Apache PDFBox, Apache POI, Tesseract OCR

// PDF解析服务
@Service
public class PdfParserService implements DocumentParser {

    @Override
    public DocumentDTO parse(MultipartFile file) throws ParseException {
        try (PDDocument document = PDDocument.load(file.getInputStream())) {
            PDFTextStripper stripper = new PDFTextStripper();
            stripper.setSortByPosition(true);

            DocumentDTO dto = new DocumentDTO();
            dto.setId(UUID.randomUUID().toString());
            dto.setFileName(file.getOriginalFilename());
            dto.setPageCount(document.getNumberOfPages());

            // 提取每页文本
            List<TextBlock> blocks = new ArrayList<>();
            for (int i = 1; i <= document.getNumberOfPages(); i++) {
                stripper.setStartPage(i);
                stripper.setEndPage(i);
                String text = stripper.getText(document);

                TextBlock block = new TextBlock();
                block.setPageNumber(i);
                block.setText(text);
                block.setBlockType(BlockType.PARAGRAPH);
                blocks.add(block);
            }

            dto.setBlocks(blocks);
            dto.setRawText(blocks.stream()
                .map(TextBlock::getText)
                .collect(Collectors.joining("\n")));

            return dto;
        } catch (IOException e) {
            throw new ParseException("PDF解析失败: " + e.getMessage(), e);
        }
    }

    @Override
    public boolean supports(String fileType) {
        return "pdf".equalsIgnoreCase(fileType);
    }
}

// 文本清洗工具
@Component
public class TextCleaner {

    public String clean(String rawText) {
        if (rawText == null) {
            return "";
        }

        return rawText
            // 移除多余空白
            .replaceAll("\\s+", " ")
            // 移除特殊控制字符
            .replaceAll("[\\x00-\\x1F\\x7F]", "")
            // 规范化引号
            .replaceAll("['\u2018\u2019]", "'")
            .replaceAll("["\u201C\u201D]", "\"")
            // 移除行号
            .replaceAll("^\\d+\\s+", "")
            .trim();
    }

    public String extractSection(String text, String sectionTitle) {
        // 提取指定章节内容
        Pattern pattern = Pattern.compile(
            sectionTitle + "[::]*\\s*([\\s\\S]*?)(?=\\n\\s*\\d+\\s*[章节条]|\\Z)",
            Pattern.CASE_INSENSITIVE
        );
        Matcher matcher = pattern.matcher(text);
        if (matcher.find()) {
            return matcher.group(1).trim();
        }
        return "";
    }
}

[java]

4.4 contract-ai 大模型服务模块

// 模块依赖
// contract-core (必须)
// contract-parser (可选,用于先解析再理解)
// 第三方: LangChain4j

@Configuration
public class LangChainConfig {

    @Value("${langchain.chatglm.api-key}")
    private String chatglmApiKey;

    @Value("${langchain.chatglm.base-url}")
    private String chatglmBaseUrl;

    @Bean
    public ChatModel chatModel() {
        return ChatGLMChatModel.builder()
            .apiKey(chatglmApiKey)
            .baseUrl(chatglmBaseUrl)
            .modelName("chatglm-4")
            .temperature(0.7)
            .maxTokens(2000)
            .build();
    }

    @Bean
    public AiServices aiServices(ChatModel chatModel) {
        return AiServices.builder(ContractAssistant.class)
            .chatModel(chatModel)
            .chatMemory(MessageWindowChatMemory.withMaxMessages(20))
            .build();
    }
}

// AI服务接口
public interface ContractAssistant {

    @SystemMessage("""
        你是一个专业的合同审核助手。
        你的职责是:
        1. 提取合同中的关键信息(甲方、乙方、金额、期限等)
        2. 识别合同中的潜在风险点
        3. 对合同的合规性进行判断
        请始终按照JSON格式输出结果。
        """)
    @UserMessage("请提取以下合同的关键信息:\n{{content}}")
    ContractKeyInfo extractKeyInfo(@UserVariable String content);

    @SystemMessage("你是一个专业的法律顾问,擅长分析合同条款的法律风险。")
    @UserMessage("请分析以下合同条款的风险:\n{{clause}}")
    RiskAnalysis analyzeRisk(@UserVariable String clause);
}

// 使用示例
@Service
public class ContractAiService {

    private final ContractAssistant assistant;

    public ContractKeyInfo extractContractInfo(DocumentDTO document) {
        String content = textCleaner.clean(document.getRawText());
        ContractKeyInfo info = assistant.extractKeyInfo(
            StringUtils.abbreviate(content, 4000)  // 限制长度
        );
        return info;
    }
}

[java]

4.5 contract-review 合规审核模块

// 模块依赖
// contract-core (必须)
// contract-ai (必须)
// contract-parser (可选)

// 合规规则引擎
@Service
public class ComplianceRuleEngine {

    private final List<ComplianceRule> rules;

    public ComplianceRuleEngine(
            @Autowired(required = false)
            List<ComplianceRule> rules) {
        this.rules = rules != null ? rules : Collections.emptyList();
    }

    public ReviewResult review(DocumentDTO document) {
        ReviewResult result = new ReviewResult();
        result.setDocumentId(document.getId());
        result.setReviewTime(LocalDateTime.now());

        List<RiskItem> riskItems = new ArrayList<>();
        List<Suggestion> suggestions = new ArrayList<>();

        for (ComplianceRule rule : rules) {
            if (rule.isApplicable(document)) {
                RuleCheckResult checkResult = rule.check(document);
                if (!checkResult.isPassed()) {
                    RiskItem risk = new RiskItem();
                    risk.setRuleId(rule.getId());
                    risk.setRuleName(rule.getName());
                    risk.setRiskLevel(checkResult.getRiskLevel());
                    risk.setDescription(checkResult.getDescription());
                    risk.setEvidence(checkResult.getEvidence());
                    riskItems.add(risk);

                    suggestions.addAll(rule.getSuggestions(checkResult));
                }
            }
        }

        result.setRiskItems(riskItems);
        result.setSuggestions(suggestions);
        result.setOverallRiskLevel(calculateOverallRisk(riskItems));
        result.setPassed(riskItems.isEmpty());

        return result;
    }

    private RiskLevel calculateOverallRisk(List<RiskItem> items) {
        if (items.isEmpty()) {
            return RiskLevel.NONE;
        }
        boolean hasHigh = items.stream()
            .anyMatch(r -> r.getRiskLevel() == RiskLevel.HIGH);
        boolean hasMedium = items.stream()
            .anyMatch(r -> r.getRiskLevel() == RiskLevel.MEDIUM);

        if (hasHigh) {
            return RiskLevel.HIGH;
        } else if (hasMedium) {
            return RiskLevel.MEDIUM;
        } else {
            return RiskLevel.LOW;
        }
    }
}

// 合规规则示例
@Component
@RequiredArgsConstructor
public class AmountLimitRule implements ComplianceRule {

    private final BigDecimal defaultLimit = new BigDecimal("1000000"); // 100万

    @Override
    public String getId() {
        return "AMOUNT_LIMIT_001";
    }

    @Override
    public String getName() {
        return "合同金额上限检查";
    }

    @Override
    public boolean isApplicable(DocumentDTO document) {
        return document.getKeyInfo() != null
            && document.getKeyInfo().getAmount() != null;
    }

    @Override
    public RuleCheckResult check(DocumentDTO document) {
        BigDecimal amount = document.getKeyInfo().getAmount();
        RuleCheckResult result = new RuleCheckResult();

        if (amount.compareTo(defaultLimit) > 0) {
            result.setPassed(false);
            result.setRiskLevel(RiskLevel.HIGH);
            result.setDescription("合同金额超过默认上限100万元");
            result.setEvidence(Map.of(
                "amount", amount,
                "limit", defaultLimit
            ));
        } else {
            result.setPassed(true);
        }

        return result;
    }

    @Override
    public List<Suggestion> getSuggestions(RuleCheckResult checkResult) {
        List<Suggestion> suggestions = new ArrayList<>();
        if (!checkResult.isPassed()) {
            suggestions.add(Suggestion.builder()
                .type(SuggestionType.WARNING)
                .content("建议:合同金额较大,请确认是否需要进行额外审批")
                .build());
        }
        return suggestions;
    }
}

[java]

4.6 contract-api REST API模块

// 模块依赖
// contract-core (必须)
// contract-parser (必须)
// contract-ai (必须)
// contract-review (必须)

// REST控制器
@RestController
@RequestMapping("/api/v1/contracts")
@RequiredArgsConstructor
public class ContractController {

    private final ContractService contractService;
    private final DocumentParserService parserService;
    private final ReviewService reviewService;

    @PostMapping("/upload")
    public ResponseEntity<ContractUploadResponse> uploadContract(
            @RequestParam("file") MultipartFile file) {
        try {
            // 1. 解析文档
            DocumentDTO document = parserService.parse(file);

            // 2. 保存合同
            Contract contract = contractService.save(document);

            // 3. 返回结果
            return ResponseEntity.ok(ContractUploadResponse.builder()
                .contractId(contract.getId())
                .fileName(file.getOriginalFilename())
                .pageCount(document.getPageCount())
                .status("SUCCESS")
                .build());
        } catch (ParseException e) {
            return ResponseEntity.badRequest()
                .body(ContractUploadResponse.builder()
                    .status("FAILED")
                    .errorMessage("文档解析失败: " + e.getMessage())
                    .build());
        }
    }

    @PostMapping("/{id}/review")
    public ResponseEntity<ReviewResponse> reviewContract(
            @PathVariable String id,
            @RequestBody(required = false) ReviewRequest request) {
        try {
            ReviewResult result = reviewService.review(id, request);
            return ResponseEntity.ok(ReviewResponse.from(result));
        } catch (ContractNotFoundException e) {
            return ResponseEntity.notFound().build();
        }
    }

    @GetMapping("/{id}/summary")
    public ResponseEntity<SummaryResponse> getContractSummary(
            @PathVariable String id) {
        Contract contract = contractService.findById(id);
        SummaryDTO summary = reviewService.generateSummary(
            contract.getDocument());
        return ResponseEntity.ok(SummaryResponse.from(summary));
    }
}

[java]

5 实际代码运行信息

5.1 Maven依赖配置

以下是项目完整的pom.xml配置:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
         http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>contract-review-system</artifactId>
    <version>1.0.0-SNAPSHOT</version>
    <packaging>pom</packaging>

    <name>Contract Review System</name>
    <description>Document Intelligent Parsing and Review System</description>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>3.2.0</version>
        <relativePath/>
    </parent>

    <modules>
        <module>contract-core</module>
        <module>contract-parser</module>
        <module>contract-ai</module>
        <module>contract-review</module>
        <module>contract-api</module>
        <module>contract-web</module>
    </modules>

    <properties>
        <java.version>17</java.version>
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

        <langchain4j.version>0.27.0</langchain4j.version>
        <pdfbox.version>3.0.0</pdfbox.version>
        <poi.version>5.2.5</poi.version>
        <elasticsearch.version>8.11.0</elasticsearch.version>
    </properties>

    <dependencyManagement>
        <dependencies>
            <!-- Internal modules -->
            <dependency>
                <groupId>com.example</groupId>
                <artifactId>contract-core</artifactId>
                <version>${project.version}</version>
            </dependency>
            <!-- ... other modules ... -->

            <!-- LangChain4j -->
            <dependency>
                <groupId>dev.langchain4j</groupId>
                <artifactId>langchain4j</artifactId>
                <version>${langchain4j.version}</version>
            </dependency>

            <!-- Apache PDFBox -->
            <dependency>
                <groupId>org.apache.pdfbox</groupId>
                <artifactId>pdfbox</artifactId>
                <version>${pdfbox.version}</version>
            </dependency>
        </dependencies>
    </dependencyManagement>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.11.0</version>
                <configuration>
                    <source>17</source>
                    <target>17</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

[xml]

5.2 应用配置文件

# application.yml
spring:
  application:
    name: contract-review-system

  servlet:
    multipart:
      max-file-size: 50MB
      max-request-size: 50MB

  data:
    redis:
      host: ${REDIS_HOST:localhost}
      port: ${REDIS_PORT:6379}
      password: ${REDIS_PASSWORD:}
      database: 0
    elasticsearch:
      repositories:
        enabled: true

# LangChain配置
langchain:
  chatglm:
    api-key: ${CHATGLM_API_KEY:}
    base-url: ${CHATGLM_BASE_URL:https://open.bigmodel.cn/api/paas/v4}
    model: chatglm-4
    temperature: 0.7
    max-tokens: 2000

# Elasticsearch配置
elasticsearch:
  host: ${ES_HOST:localhost}
  port: ${ES_PORT:9200}
  scheme: https
  index:
    contract-embeddings: contract_embeddings

# 文件存储配置
storage:
  type: local
  local:
    path: /data/contracts
    temp-path: /tmp/contracts

# 审核规则配置
review:
  rules:
    enabled: true
    amount-limit: 1000000
    auto-review: true

# 日志配置
logging:
  level:
    root: INFO
    com.example.contract: DEBUG
    dev.langchain4j: DEBUG
  pattern:
    console: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36} - %msg%n"

[yaml]

5.3 运行结果示例

  .   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::                (v3.2.0)

2024-01-15 10:30:00 [main] INFO  c.e.c.ContractReviewApplication - Starting ContractReviewApplication
2024-01-15 10:30:01 [main] INFO  c.e.c.ContractReviewApplication - The following 1 profile is active: "default"
2024-01-15 10:30:05 [main] INFO  o.s.b.w.embedded.tomcat.TomcatWebServer - Tomcat initialized with port 8080
2024-01-15 10:30:06 [main] INFO  o.apache.catalina.core.StandardService - Starting service [Tomcat]
2024-01-15 10:30:07 [main] INFO  o.apache.catalina.core.StandardEngine - Starting Servlet engine: [Apache Tomcat/10.1.17]
2024-01-15 10:30:10 [main] INFO  o.s.b.w.embedded.tomcat.TomcatWebServer - Tomcat started on port 8080
2024-01-15 10:30:11 [main] INFO  c.e.c.ContractReviewApplication - Started ContractReviewApplication in 11.5 seconds

# API调用日志
2024-01-15 10:35:00 [http-nio-8080-exec-1] INFO  c.e.c.a.ContractController - Contract upload request: test_contract.pdf
2024-01-15 10:35:01 [http-nio-8080-exec-1] DEBUG c.e.c.p.PdfParserService - Loading PDF document: test_contract.pdf
2024-01-15 10:35:02 [http-nio-8080-exec-1] DEBUG c.e.c.p.PdfParserService - Extracted 15 pages
2024-01-15 10:35:02 [http-nio-8080-exec-1] DEBUG c.e.c.p.TextCleaner - Cleaning text, raw length: 45230
2024-01-15 10:35:03 [http-nio-8080-exec-1] INFO  c.e.c.a.ContractController - Contract saved with ID: contract-001
2024-01-15 10:35:03 [http-nio-8080-exec-1] INFO  c.e.c.r.ReviewService - Starting review for contract: contract-001
2024-01-15 10:35:04 [http-nio-8080-exec-1] DEBUG c.e.c.r.ComplianceRuleEngine - Loaded 5 compliance rules
2024-01-15 10:35:05 [http-nio-8080-exec-1] INFO  c.e.c.ai.ContractAiService - Calling LLM for key info extraction
2024-01-15 10:35:08 [http-nio-8080-exec-1] DEBUG c.e.c.ai.ContractAiService - LLM response received, tokens: 850
2024-01-15 10:35:09 [http-nio-8080-exec-1] INFO  c.e.c.r.ReviewService - Review completed, risk level: MEDIUM, 2 risk items found

6 总结

本文详细介绍了文档智能解析审核系统的整体架构设计,包括:

1. **应用场景分析**:合同审核、发票识别、报告摘要等核心场景

2. **整体架构设计**:文档解析层→大模型层→审核层→报告层的四层架构

3. **技术选型**:Spring Boot 3.x + LangChain4j + Elasticsearch + Redis

4. **项目模块划分**:6个Maven模块的职责划分与依赖关系

5. **核心代码示例**:Maven配置、应用配置、核心类代码

在后续文章中,我们将深入探讨:

  • **第2篇**:PDF文档解析技术实现(Apache PDFBox详解)
  • **第3篇**:大模型集成与服务封装(LangChain4j高级用法)
  • **第4篇**:合规审核规则引擎设计与实现
  • **第5篇**:审核报告生成与可视化展示

敬请期待!

---

*本文作者:洛水石*

*版权所有,未经许可禁止转载*

Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐