垂直领域问答系统构建:基于QLoRA微调与混合RAG的工程实践
实践导向:通用大模型在特定业务场景中存在专业知识不足与幻觉问题。本文以构建一个垂直领域问答系统为例,记录从数据处理、模型微调、混合检索构建到容器化部署的完整工程落地过程。
一、 系统整体架构与项目结构
在真实的业务系统中,应用层、算法层和数据层需解耦部署。本实践采用前后端分离与微服务架构,技术栈涵盖Python(算法与API)、PostgreSQL/PgVector(关系型向量数据库)、Shell(任务编排)与Docker(环境隔离)。
1.1 生产级联合架构
[用户端请求]
│
▼
[Nginx / API Gateway] ──> [FastAPI 后端 (Python)]
│ │
│ ├─ 1. 安全校验
│ ├─ 2. 混合检索
│ │ ├─ PgVector (SQL: 稠密向量检索)
│ │ └─ BM25 (Python: 稀疏检索)
│ ├─ 3. Rerank 重排
│ └─ 4. vLLM 推理服务
│ └─ QLoRA 微调合并模型
▼
[数据持久层] ──> PostgreSQL + PgVector 扩展
1.2 项目目录结构规范
vertical_qa_system/
├── data/
│ ├── raw/ # 原始业务文档
│ └── processed/ # 清洗后JSONL数据
├── models/
│ ├── base_model/ # 基座模型权重
│ └── lora_weights/ # 训练出的LoRA适配器
├── src/
│ ├── data_engine.py # 数据清洗与处理
│ ├── train_engine.py # QLoRA微调脚本
│ ├── rag_engine.py # 混合检索与Rerank (Python)
│ └── security.py # 安全防御模块
├── deploy/
│ ├── api_server.py # FastAPI服务代码
│ ├── Dockerfile.api # API后端镜像构建
│ └── docker-compose.yml # 容器编排文件
├── scripts/
│ ├── run_train.sh # 训练启动与权重合并脚本
│ └── init_db.sql # 数据库初始化SQL脚本
└── requirements.txt
二、 向量数据库设计与数据清洗
生产环境不建议使用轻量级本地向量库(如ChromaDB),而是采用支持高并发与事务的 PostgreSQL + PgVector 扩展,以实现结构化数据与向量的联合查询。
2.1 数据库表结构设计 (SQL)
使用SQL初始化业务知识库表,同时存储文本内容与其对应的向量表示。
scripts/init_db.sql
-- 启用 vector 扩展
CREATE EXTENSION IF NOT EXISTS vector;
-- 创建知识库表
CREATE TABLE knowledge_base (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL, -- 文本内容
source VARCHAR(255), -- 数据来源标识
embedding VECTOR(1024), -- 向量维度,依Embedding模型而定 (如bge-large-zh)
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- 创建向量相似度索引
CREATE INDEX ON knowledge_base USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
2.2 数据清洗与去重 (Python)
业务数据通常包含大量噪声,需进行去重与标准化处理。
src/data_engine.py
import json
import re
from datasketch import MinHash, MinHashLSH
def clean_text(text: str) -> str:
"""基础文本清洗:去除多余空格与特殊字符"""
text = re.sub(r'<[^>]+>', '', text)
text = re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9,。!?、;:""''()《》\s]', '', text)
return re.sub(r'\s+', ' ', text).strip()
def minhash_dedup(data_list: list[dict], threshold=0.85):
"""基于MinHash的语义近似去重"""
lsh = MinHashLSH(threshold=threshold, num_perm=128)
unique_data = []
for idx, item in enumerate(data_list):
text = item.get("content", "")
m = MinHash(num_perm=128)
for word in text.split():
m.update(word.encode('utf8'))
if not lsh.query(m):
lsh.insert(str(idx), m)
unique_data.append(item)
return unique_data
def convert_to_chatml(instruction: str, output: str) -> dict:
return {
"messages": [
{"role": "system", "content": "你是一个严谨的垂直领域助手,只基于事实回答问题。"},
{"role": "user", "content": instruction},
{"role": "assistant", "content": output}
]
}
三、 模型微调与自动化流水线 (Python + Shell)
模型微调采用QLoRA降低显存需求,训练完成后需要将LoRA权重合并回基座模型,此过程通过Shell脚本串联,形成可复现的流水线。
3.1 QLoRA 微调核心逻辑 (Python)
src/train_engine.py
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
from trl import SFTTrainer
def run_training():
model_path = "../models/base_model/Qwen2-7B-Instruct"
dataset_path = "../data/processed/train.jsonl"
output_dir = "../models/lora_weights"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, quantization_config=bnb_config, device_map="auto", trust_remote_code=True
)
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], task_type="CAUSAL_LM"
)
model = get_peft_model(model, peft_config)
dataset = load_dataset("json", data_files=dataset_path, split="train")
training_args = TrainingArguments(
output_dir=output_dir, per_device_train_batch_size=2, gradient_accumulation_steps=4,
optim="paged_adamw_32bit", save_steps=50, learning_rate=2e-4, max_steps=200,
bf16=True, lr_scheduler_type="cosine",
)
trainer = SFTTrainer(
model=model, train_dataset=dataset, peft_config=peft_config,
max_seq_length=2048, tokenizer=tokenizer, args=training_args,
)
trainer.train()
trainer.save_model(output_dir)
def merge_lora_weights():
"""合并权重方法,供Shell脚本调用"""
model = AutoPeftModelForCausalLM.from_pretrained(
"../models/lora_weights", device_map="auto", torch_dtype=torch.float16
)
model = model.merge_and_unload()
model.save_pretrained("../models/merged_model")
3.2 训练与合并自动化脚本
scripts/run_train.sh
#!/bin/bash
set -e
echo "===== 阶段1: 开始 QLoRA 微调 ====="
python src/train_engine.py
echo "===== 阶段2: 合并 LoRA 权重 ====="
python -c "from src.train_engine import merge_lora_weights; merge_lora_weights()"
echo "===== 阶段3: 验证合并后的模型 ====="
python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = './models/merged_model'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map='auto')
inputs = tokenizer('你好,请介绍一下你自己。', return_tensors='pt').to('cuda')
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
"
echo "===== 全流程执行完毕 ====="
四、 混合RAG引擎:PgVector与BM25融合 (Python + SQL)
采用数据库原生SQL进行稠密向量检索,结合Python实现BM25稀疏检索,实现数据层的混合查询。
src/rag_engine.py
import psycopg2
from rank_bm25 import BM25Okapi
import jieba
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
class HybridRAG:
def __init__(self, db_url: str):
# 连接 PgVector 数据库
self.conn = psycopg2.connect(db_url)
self.reranker = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-large")
self.bm25 = None
self.bm25_docs = []
def build_bm25_index(self):
"""从数据库加载文本构建BM25索引"""
with self.conn.cursor() as cur:
cur.execute("SELECT content FROM knowledge_base;")
self.bm25_docs = [row[0] for row in cur.fetchall()]
tokenized_corpus = [list(jieba.cut(doc)) for doc in self.bm25_docs]
self.bm25 = BM25Okapi(tokenized_corpus)
def dense_search_pgvector(self, query_embedding: list[float], top_k: int = 10) -> list[str]:
"""调用 SQL 进行 PgVector 稠密检索"""
sql = """
SELECT content
FROM knowledge_base
ORDER BY embedding <=> %s::vector
LIMIT %s;
"""
with self.conn.cursor() as cur:
# 将 Python list 转为 PostgreSQL 的 vector 格式字符串
vec_str = "[" + ",".join(map(str, query_embedding)) + "]"
cur.execute(sql, (vec_str, top_k))
return [row[0] for row in cur.fetchall()]
def sparse_search_bm25(self, query: str, top_k: int = 10) -> list[str]:
"""基于 Python 的 BM25 稀疏检索"""
tokenized_query = list(jieba.cut(query))
bm25_scores = self.bm25.get_scores(tokenized_query)
top_idx = sorted(range(len(bm25_scores)), key=lambda i: bm25_scores[i], reverse=True)[:top_k]
return [self.bm25_docs[i] for i in top_idx]
def retrieve_with_rerank(self, query: str, query_embedding: list[float]) -> list[str]:
# 1. 双路召回
dense_results = self.dense_search_pgvector(query_embedding)
sparse_results = self.sparse_search_bm25(query)
# 2. 去重
union_results = list(set(dense_results + sparse_results))
# 3. Rerank 重排
pairs = [[query, doc] for doc in union_results]
scores = self.reranker.predict(pairs)
# 按得分降序排序并截取 Top 3
ranked_results = sorted(zip(union_results, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked_results[:3]]
五、 容器化部署与服务上线
最后,使用 Docker Compose 将 PostgreSQL 数据库、vLLM 推理引擎和 FastAPI 后端统一编排,实现一键部署。
5.1 FastAPI 后端接入 (Python)
deploy/api_server.py
from fastapi import FastAPI
from pydantic import BaseModel
import requests # 用于请求独立部署的 vLLM 服务
app = FastAPI(title="垂直领域问答系统API")
class ChatRequest(BaseModel):
query: str
@app.post("/v1/chat")
def chat_endpoint(req: ChatRequest):
# 实际生产中需注入 RAG 逻辑与安全校验,此处为简化版调用 vLLM 推理服务
# 假设 vLLM 服务部署在 8001 端口
vllm_url = "http://vllm-server:8001/v1/completions"
payload = {
"model": "/app/models/merged_model",
"prompt": f"用户问题:{req.query}\n请根据专业知识回答:",
"max_tokens": 512,
"temperature": 0.3
}
response = requests.post(vllm_url, json=payload)
answer = response.json().get("choices", [{}])[0].get("text", "服务异常")
return {"query": req.query, "answer": answer.strip()}
5.2 Docker Compose 编排 (YAML)
deploy/docker-compose.yml
version: '3.8'
services:
# 1. 关系型向量数据库服务
postgres-db:
image: pgvector/pgvector:pg16
environment:
POSTGRES_USER: admin
POSTGRES_PASSWORD: secret
POSTGRES_DB: vertical_qa_db
ports:
- "5432:5432"
volumes:
- pg_data:/var/lib/postgresql/data
- ../scripts/init_db.sql:/docker-entrypoint-initdb.d/init.sql # 挂载初始化SQL
# 2. vLLM 高性能推理服务
vllm-server:
image: vllm/vllm-openai:latest
command: --model /app/models/merged_model --trust-remote-code --port 8001
ports:
- "8001:8001"
volumes:
- ../models/merged_model:/app/models/merged_model
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# 3. FastAPI 业务后端
api-backend:
build:
context: .
dockerfile: Dockerfile.api
ports:
- "8000:8000"
depends_on:
- postgres-db
- vllm-server
volumes:
pg_data:
5.3 API 后端镜像构建 (Dockerfile)
deploy/Dockerfile.api
FROM python:3.10-slim
WORKDIR /app
# 安装基础依赖与算法库
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY api_server.py .
COPY ../src/ ./src/
CMD ["uvicorn", "api_server:app", "--host", "0.0.0.0", "--port", "8000"]
启动时,在 deploy 目录下执行 docker-compose up -d 即可拉起整套系统。这种多语言、多组件的协作架构,才是垂直大模型真正走向生产的标准形态。
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐


所有评论(0)