模型微调的上限由数据质量决定。再先进的微调算法,如果训练数据存在噪声、偏差或格式问题,最终效果都会大打折扣。"Garbage in, garbage out"这一原则在LLM微调领域尤为突出——因为大模型的强大拟合能力意味着它会忠实地学习数据中的错误模式。
本文聚焦于微调数据工程的完整链路:从原始数据采集,到数据清洗、格式化、质量评估,再到训练集的最终构建。这是LLM微调中最被低估、也最值得投入的环节。## 微调数据的来源与类型### 指令跟随数据指令跟随(Instruction Following)是最常见的微调任务。训练数据格式为(指令,输入,输出)三元组:json{ "instruction": "将以下Python代码转换为等效的Go代码", "input": "def fibonacci(n):\n if n <= 1:\n return n\n return fibonacci(n-1) + fibonacci(n-2)", "output": "func fibonacci(n int) int {\n if n <= 1 {\n return n\n }\n return fibonacci(n-1) + fibonacci(n-2)\n}"}text数据来源策略:1. 人工标注:质量最高,成本最高。适合核心能力的精标数据集(500-2000条)2. GPT-4/Claude生成:用强模型生成数据来微调弱模型("self-instruct"思路),需要仔细的质量过滤3. 从现有文档提取:将公司内部的FAQ、操作手册、API文档转化为问答对4. 用户日志提炼:从真实用户与AI系统的交互日志中提取高质量样本### 对话数据多轮对话数据格式(遵循ChatML格式):json{ "conversations": [ {"role": "system", "content": "你是一个专业的Python编程助手"}, {"role": "user", "content": "如何在Python中读取CSV文件?"}, {"role": "assistant", "content": "可以使用pandas库的read_csv函数:\npython\nimport pandas as pd\ndf = pd.read_csv(‘file.csv’)\n"}, {"role": "user", "content": "如果CSV文件是GBK编码怎么办?"}, {"role": "assistant", "content": "指定encoding参数即可:\npython\ndf = pd.read_csv(‘file.csv’, encoding=‘gbk’)\n"} ]}text### 偏好数据(RLHF/DPO)用于偏好优化的数据需要成对的"优选"和"劣选"输出:json{ "prompt": "解释量子纠缠", "chosen": "量子纠缠是量子力学中的一种现象,两个粒子处于纠缠态后,对其中一个粒子的测量会瞬间影响另一个粒子的状态,无论两者相距多远。这个特性被爱因斯坦称为'鬼魅般的超距作用'...", "rejected": "量子纠缠就是两个粒子连在一起,一个动了另一个也动。"}text## 数据清洗工程### 自动化清洗流水线pythonimport refrom dataclasses import dataclassfrom typing import Optional@dataclassclass CleaningResult: passed: bool cleaned_data: Optional[dict] rejection_reason: Optional[str]class DataCleaner: def __init__(self, config: dict): self.config = config def clean_sample(self, sample: dict) -> CleaningResult: """对单条数据执行完整的清洗流水线""" # 1. 格式验证 format_check = self._check_format(sample) if not format_check.passed: return format_check # 2. 长度过滤 length_check = self._check_length(sample) if not length_check.passed: return length_check # 3. 内容清洗 sample = self._clean_content(sample) # 4. 质量过滤 quality_check = self._check_quality(sample) if not quality_check.passed: return quality_check # 5. 去重检测(需要外部状态) # 在pipeline级别处理,此处略 return CleaningResult(passed=True, cleaned_data=sample, rejection_reason=None) def _check_format(self, sample: dict) -> CleaningResult: required_fields = self.config.get("required_fields", ["instruction", "output"]) for field in required_fields: if field not in sample or not sample[field]: return CleaningResult( passed=False, cleaned_data=None, rejection_reason=f"Missing required field: {field}" ) return CleaningResult(passed=True, cleaned_data=sample, rejection_reason=None) def _check_length(self, sample: dict) -> CleaningResult: max_length = self.config.get("max_total_length", 4096) min_output_length = self.config.get("min_output_length", 20) output = sample.get("output", "") if len(output) < min_output_length: return CleaningResult( passed=False, cleaned_data=None, rejection_reason=f"Output too short: {len(output)} chars" ) total_length = sum(len(str(v)) for v in sample.values()) if total_length > max_length * 4: # 粗略字符估算 return CleaningResult( passed=False, cleaned_data=None, rejection_reason=f"Sample too long: ~{total_length//4} tokens" ) return CleaningResult(passed=True, cleaned_data=sample, rejection_reason=None) def _clean_content(self, sample: dict) -> dict: """内容清洗:去除HTML标签、修复编码、标准化空白字符""" cleaned = {} for key, value in sample.items(): if isinstance(value, str): # 去除HTML标签 value = re.sub(r'<[^>]+>', '', value) # 标准化空白字符 value = re.sub(r'\n{3,}', '\n\n', value) value = re.sub(r' {2,}', ' ', value) # 去除首尾空白 value = value.strip() cleaned[key] = value return cleaned def _check_quality(self, sample: dict) -> CleaningResult: output = sample.get("output", "") # 检测常见质量问题 quality_issues = [ # LLM生成数据中的常见废话开头 (r'^(当然|好的|当然可以|当然了|好的,我来).{0,10}[!。!]', "AI-style filler start"), # 截断的输出(常见于爬取的数据) (r'\.\.\.$|……$', "Truncated output"), # 纯重复内容 (lambda s: len(set(s.split())) / max(len(s.split()), 1) < 0.2, "High repetition"), ] for check, reason in quality_issues: if callable(check): if check(output): return CleaningResult(passed=False, cleaned_data=None, rejection_reason=reason) else: if re.search(check, output): return CleaningResult(passed=False, cleaned_data=None, rejection_reason=reason) return CleaningResult(passed=True, cleaned_data=sample, rejection_reason=None)text### 基于LLM的质量评估对于复杂的质量判断(如回答是否正确、是否帮助性强),可以用强模型做自动化评分:pythonclass LLMQualityScorer: def __init__(self, llm_client): self.llm = llm_client async def score_sample(self, sample: dict) -> dict: """使用LLM对样本进行多维度评分(1-5分)""" score_prompt = f"""评估以下训练数据样本的质量,给出1-5分的评分(1=很差,5=优秀):指令:{sample.get('instruction', '')}输入:{sample.get('input', '无')}输出:{sample.get('output', '')}评分维度(每项1-5分):1. 准确性:输出内容是否正确?2. 完整性:输出是否充分回答了指令?3. 清晰度:输出是否表达清晰?4. 无害性:输出是否不包含有害内容?以JSON格式返回:{{"accuracy": X, "completeness": X, "clarity": X, "safety": X, "overall": X}}""" response = await self.llm.generate(score_prompt, response_format="json") scores = json.loads(response) return { **sample, "_quality_scores": scores, "_quality_overall": scores.get("overall", 0) } async def filter_by_quality( self, dataset: list[dict], min_score: float = 3.5, sample_rate: float = 0.1 # 只对10%样本做LLM评估(控制成本) ) -> list[dict]: import random # 随机抽样进行LLM评估 sampled_indices = set(random.sample(range(len(dataset)), int(len(dataset) * sample_rate))) filtered = [] for i, sample in enumerate(dataset): if i in sampled_indices: scored = await self.score_sample(sample) if scored["_quality_overall"] >= min_score: filtered.append(sample) else: filtered.append(sample) # 未评估的样本默认保留 return filteredtext## 数据多样性与平衡### 检测数据分布pythonclass DatasetAnalyzer: def analyze_diversity(self, dataset: list[dict]) -> dict: """分析数据集的多样性指标""" outputs = [s.get("output", "") for s in dataset] instructions = [s.get("instruction", "") for s in dataset] # 长度分布 output_lengths = [len(o.split()) for o in outputs] # 词汇多样性(Type-Token Ratio) all_tokens = " ".join(outputs).split() ttr = len(set(all_tokens)) / len(all_tokens) if all_tokens else 0 # 话题分布(简单的关键词分类) topic_distribution = self._classify_topics(instructions) return { "total_samples": len(dataset), "output_length": { "mean": sum(output_lengths) / len(output_lengths), "min": min(output_lengths), "max": max(output_lengths), "p50": sorted(output_lengths)[len(output_lengths)//2], "p95": sorted(output_lengths)[int(len(output_lengths)*0.95)] }, "vocabulary_diversity": round(ttr, 3), "topic_distribution": topic_distribution } def _classify_topics(self, instructions: list[str]) -> dict: topic_keywords = { "coding": ["代码", "程序", "函数", "debug", "Python", "Java"], "writing": ["写作", "文章", "邮件", "报告", "总结"], "analysis": ["分析", "对比", "评估", "理解", "解释"], "math": ["计算", "数学", "公式", "统计"], "other": [] } counts = {topic: 0 for topic in topic_keywords} for instruction in instructions: classified = False for topic, keywords in topic_keywords.items(): if any(kw in instruction for kw in keywords): counts[topic] += 1 classified = True break if not classified: counts["other"] += 1 total = len(instructions) return {topic: f"{count/total:.1%}" for topic, count in counts.items()}text### 过采样与欠采样当发现数据集中某些类别严重不均衡时,需要进行平衡处理:pythondef balance_dataset(dataset: list[dict], target_per_category: int = 500) -> list[dict]: """对数据集进行类别平衡处理""" # 按类别分组 categories = {} for sample in dataset: cat = sample.get("_category", "other") categories.setdefault(cat, []).append(sample) balanced = [] for cat, samples in categories.items(): if len(samples) > target_per_category: # 欠采样:随机选取 selected = random.sample(samples, target_per_category) elif len(samples) < target_per_category * 0.5: # 严重不足时发出警告(不建议过度过采样) print(f"Warning: Category '{cat}' only has {len(samples)} samples") selected = samples else: selected = samples balanced.extend(selected) random.shuffle(balanced) return balancedtext## 训练集构建最佳实践黄金比例:针对特定任务的微调,高质量精标数据(500-2000条)+ 通用指令数据(5000-20000条)的组合,通常优于单纯使用大量低质量数据。验证集隔离:从一开始就划分训练集(90%)和验证集(10%),验证集不参与任何清洗决策,以保持评估的客观性。数据版本管理:与代码一样,训练数据也应该版本化管理。推荐使用DVC(Data Version Control)或LakeFS追踪数据集的演变历史。持续监控:微调完成后,跟踪模型在测试集上各维度的表现,发现问题立即溯源到数据层面。很多"模型问题"其实是"数据问题"的变体。高质量的微调数据集是LLM定制化能力的基石。投入数据工程的时间和精力,往往比调整训练超参数产生更大的价值回报。

Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐