第33节：微调框架 PyTorch 从入门到精通【第四部分：大模型微调专题】

Thomas.Sir

403人浏览 · 2026-04-18 08:16:18

Thomas.Sir · 2026-04-18 08:16:18 发布

在这里插入图片描述

文章目录

全文导读

全文导读

第四部分：大模型微调专题

10. 大语言模型（LLM）微调

10.1 LLM 微调的特点与挑战

大语言模型（Large Language Model, LLM）通常指参数量在 10 亿（1B）以上的自回归语言模型，如 LLaMA、Qwen、ChatGLM、GPT-3 等。与 BERT 等“小模型”相比，LLM 的微调面临着全新的挑战和机遇。

LLM 微调的核心特点：

参数量巨大：从 7B 到 180B 不等，全量微调需要数百 GB 显存，远超单卡甚至单机能力。
数据需求变化：LLM 通过预训练已经掌握了丰富的世界知识和语言能力，微调更多是“激发”或“引导”特定行为，而非从零学习。因此指令微调（Instruction Tuning）通常只需要数万到数十万条高质量数据。
训练稳定性问题：大模型对超参数敏感，学习率过大可能导致模型输出乱码（“崩塌”），学习率过小则无法有效学习新任务。
推理部署复杂：微调后的模型需要高效部署，量化、剪枝、推理加速等技术不可或缺。

主要挑战：

挑战	描述	解决方案
显存不足	7B 模型全量微调需约 80GB 显存	QLoRA、梯度检查点、ZeRO 优化
训练时间长	全量微调需要数天甚至数周	LoRA 等 PEFT 方法、多卡并行
灾难性遗忘	在特定任务上过拟合，丢失通用能力	混合微调、PEFT、正则化
数据格式复杂	需要构造指令-输入-输出格式	使用标准模板如 Alpaca、ShareGPT
评估困难	生成任务难以自动化评估	使用 GPT-4 评估、BLEU/ROUGE 辅助

本章将重点介绍如何在有限资源下高效微调 LLM。

10.2 使用 HuggingFace Transformers 加载 LLM（LLaMA, Qwen, ChatGLM）

HuggingFace 生态统一了各种 LLM 的加载方式。不同模型可能在配置细节上有差异，但核心 API 是一致的。

# ========== 安装必要的库 ==========
# pip install torch transformers accelerate bitsandbytes

import torch
from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer
)
import os

# 设置设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"使用设备: {device}")
print(f"CUDA 显存: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB" if torch.cuda.is_available() else "CPU 模式")

# ========== 1. 加载 LLaMA 2 模型（示例） ==========
print("=" * 60)
print("1. 加载 LLaMA 2 模型")
print("=" * 60)

# 注意：LLaMA 2 需要向 Meta 申请访问权限，使用你的 HuggingFace token
# 此处使用较小的模型 "TinyLlama/TinyLlama-1.1B-Chat-v1.0" 作为演示（无需授权）
model_name_llama = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# 使用 4-bit 量化加载（节省显存）
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # 4-bit 量化类型
    bnb_4bit_compute_dtype=torch.float16, # 计算使用 FP16
    bnb_4bit_use_double_quant=True,       # 双重量化
)

tokenizer_llama = AutoTokenizer.from_pretrained(model_name_llama)
tokenizer_llama.pad_token = tokenizer_llama.eos_token

model_llama = AutoModelForCausalLM.from_pretrained(
    model_name_llama,
    quantization_config=bnb_config,       # 使用 4-bit 量化
    device_map="auto",                    # 自动分配到可用设备
    trust_remote_code=True,
)

print(f"LLaMA 风格模型加载完成")
print(f"模型类型: {type(model_llama)}")
print(f"参数量: {sum(p.numel() for p in model_llama.parameters()):,}")

# ========== 2. 加载 Qwen（通义千问）模型 ==========
print("\n" + "=" * 60)
print("2. 加载 Qwen 模型")
print("=" * 60)

# Qwen-1.8B 示例（小模型，便于演示）
model_name_qwen = "Qwen/Qwen-1_8B"  # 或 "Qwen/Qwen-7B"

tokenizer_qwen = AutoTokenizer.from_pretrained(model_name_qwen, trust_remote_code=True)
# Qwen 建议使用其自带的聊天模板
tokenizer_qwen.pad_token = tokenizer_qwen.eos_token

# 使用 4-bit 量化加载
model_qwen = AutoModelForCausalLM.from_pretrained(
    model_name_qwen,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

print(f"Qwen 模型加载完成")

# ========== 3. 加载 ChatGLM3 ==========
print("\n" + "=" * 60)
print("3. 加载 ChatGLM3 模型")
print("=" * 60)

# ChatGLM3-6B 示例（需要较大显存，可使用 4-bit 量化）
model_name_chatglm = "THUDM/chatglm3-6b"

tokenizer_chatglm = AutoTokenizer.from_pretrained(model_name_chatglm, trust_remote_code=True)
model_chatglm = AutoModelForCausalLM.from_pretrained(
    model_name_chatglm,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

print(f"ChatGLM3 模型加载完成")

# ========== 4. 通用加载函数 ==========
print("\n" + "=" * 60)
print("4. 通用 LLM 加载函数")
print("=" * 60)

def load_llm(model_id, use_4bit=True, use_8bit=False):
    """
    通用 LLM 加载函数，支持 4-bit/8-bit 量化
    
    Args:
        model_id: HuggingFace 模型 ID
        use_4bit: 是否使用 4-bit 量化
        use_8bit: 是否使用 8-bit 量化（与 4-bit 互斥）
    
    Returns:
        model, tokenizer
    """
    # 配置量化
    if use_4bit and not use_8bit:
        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
        )
        print(f"使用 4-bit 量化加载 {model_id}")
    elif use_8bit and not use_4bit:
        quantization_config = BitsAndBytesConfig(load_in_8bit=True)
        print(f"使用 8-bit 量化加载 {model_id}")
    else:
        quantization_config = None
        print(f"使用全精度加载 {model_id}（需要大量显存）")
    
    # 加载分词器
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    # 加载模型
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=quantization_config,
        device_map="auto",
        trust_remote_code=True,
        torch_dtype=torch.float16 if not quantization_config else None,
    )
    
    return model, tokenizer

# 示例：加载 TinyLlama（无量化，用于演示）
# model, tokenizer = load_llm("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_4bit=False)

print("通用加载函数已定义")

# ========== 5. 测试模型推理 ==========
print("\n" + "=" * 60)
print("5. 测试 LLM 推理")
print("=" * 60)

def generate_text(model, tokenizer, prompt, max_new_tokens=100):
    """
    使用 LLM 生成文本
    """
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# 使用 TinyLlama 测试
test_prompt = "Hello, how are you?"
print(f"Prompt: {test_prompt}")
# response = generate_text(model_llama, tokenizer_llama, test_prompt)
# print(f"Response: {response}")

print("推理函数已定义（实际生成需在 GPU 上运行）")

10.3 指令微调（Instruction Tuning）数据格式

指令微调是让 LLM 遵循人类指令的关键技术。数据通常采用“指令-输入-输出”的三元组格式。以下是常见的指令数据格式和预处理方法。

print("\n" + "=" * 60)
print("10.3 指令微调数据格式")
print("=" * 60)

# ========== 1. 常见数据格式 ==========

# Alpaca 格式
alpaca_format = {
    "instruction": "将以下句子翻译成英文",
    "input": "今天天气很好",
    "output": "The weather is nice today."
}

# ShareGPT 格式（多轮对话）
sharegpt_format = {
    "conversations": [
        {"from": "human", "value": "什么是机器学习？"},
        {"from": "gpt", "value": "机器学习是人工智能的一个分支..."},
        {"from": "human", "value": "能举个例子吗？"},
        {"from": "gpt", "value": "当然，比如图像分类..."}
    ]
}

# 自定义格式（简洁）
custom_format = {
    "prompt": "用户: 解释一下量子计算\n助手:",
    "completion": "量子计算是一种利用量子力学原理的计算方式..."
}

print("1. Alpaca 格式（单轮指令）")
print(f"   示例: {alpaca_format}")
print("\n2. ShareGPT 格式（多轮对话）")
print(f"   示例: {sharegpt_format['conversations'][:2]}")
print("\n3. 自定义格式（灵活）")

# ========== 2. 构造指令微调数据集 ==========
from torch.utils.data import Dataset
import json

class InstructionDataset(Dataset):
    """
    指令微调数据集
    支持 Alpaca 和 ShareGPT 格式
    """
    
    def __init__(self, data_path, tokenizer, max_length=512, format_type="alpaca"):
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.format_type = format_type
        
        # 加载数据
        with open(data_path, 'r', encoding='utf-8') as f:
            if data_path.endswith('.json'):
                self.data = json.load(f)
            else:
                self.data = [json.loads(line) for line in f]
        
        print(f"加载了 {len(self.data)} 条指令数据")
    
    def _format_alpaca(self, example):
        """将 Alpaca 格式转换为模型输入"""
        instruction = example.get("instruction", "")
        input_text = example.get("input", "")
        output = example.get("output", "")
        
        if input_text:
            prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
        else:
            prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
        
        # 完整文本（prompt + output）
        full_text = prompt + output
        
        return prompt, full_text
    
    def _format_sharegpt(self, example):
        """将 ShareGPT 格式转换为模型输入"""
        conversations = example.get("conversations", [])
        
        # 构建对话历史
        prompt_parts = []
        for i, turn in enumerate(conversations):
            if turn["from"] == "human":
                prompt_parts.append(f"用户: {turn['value']}")
            else:
                prompt_parts.append(f"助手: {turn['value']}")
        
        # 最后一条助手的回复作为目标输出
        if len(prompt_parts) >= 2:
            full_text = "\n".join(prompt_parts)
            # 分离 prompt 和 completion（最后一条助手消息）
            last_assistant_idx = max(i for i, p in enumerate(prompt_parts) if p.startswith("助手:"))
            prompt = "\n".join(prompt_parts[:last_assistant_idx]) + "\n助手:"
            output = prompt_parts[last_assistant_idx][3:]  # 去掉"助手:"前缀
        else:
            prompt = prompt_parts[0]
            output = ""
        
        return prompt, full_text
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        example = self.data[idx]
        
        if self.format_type == "alpaca":
            prompt, full_text = self._format_alpaca(example)
        else:
            prompt, full_text = self._format_sharegpt(example)
        
        # Tokenize
        tokenized_full = self.tokenizer(
            full_text,
            truncation=True,
            max_length=self.max_length,
            padding=False,
            return_tensors=None
        )
        
        # 计算 labels：输入部分的 labels 设为 -100（忽略损失）
        tokenized_prompt = self.tokenizer(
            prompt,
            truncation=True,
            max_length=self.max_length,
            padding=False,
            return_tensors=None
        )
        
        input_ids = tokenized_full["input_ids"]
        labels = input_ids.copy()
        
        # 将 prompt 部分的 labels 设为 -100
        prompt_len = len(tokenized_prompt["input_ids"])
        labels[:prompt_len] = [-100] * prompt_len
        
        attention_mask = tokenized_full.get("attention_mask", [1] * len(input_ids))
        
        return {
            "input_ids": torch.tensor(input_ids),
            "attention_mask": torch.tensor(attention_mask),
            "labels": torch.tensor(labels),
        }

# ========== 3. 创建模拟数据用于演示 ==========
print("\n创建模拟指令数据...")

# 创建示例数据文件（实际使用时替换为真实数据）
sample_data = [
    {
        "instruction": "将以下句子翻译成英文",
        "input": "你好，世界",
        "output": "Hello, world"
    },
    {
        "instruction": "解释什么是深度学习",
        "input": "",
        "output": "深度学习是机器学习的一个子集，使用多层神经网络来学习数据的层次化表示。"
    },
    {
        "instruction": "写一首关于春天的短诗",
        "input": "",
        "output": "春风拂柳绿，\n花开满园香。\n燕子归来早，\n人间好时光。"
    }
]

# 保存为临时文件
import tempfile
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
    json.dump(sample_data, f)
    sample_data_path = f.name

print(f"模拟数据已保存到 {sample_data_path}")

# 示例：创建数据集（需要真实的 tokenizer）
# dataset = InstructionDataset(sample_data_path, tokenizer_llama, format_type="alpaca")
# print(f"数据集大小: {len(dataset)}")

print("\n指令微调数据格式总结:")
print("  1. 指令应清晰明确，避免歧义")
print("  2. 输入可以为空（仅指令）")
print("  3. 输出应为高质量、格式规范的回复")
print("  4. 建议使用标准模板，与模型预训练格式对齐")
print("  5. 数据量：通常 1k-100k 条即可见效")

# 清理临时文件
import os
os.unlink(sample_data_path)

10.4 完整微调 vs LoRA/QLoRA

LLM 的全量微调和参数高效微调在资源消耗和效果上有显著差异。本节通过对比实验代码展示两者的实现方式。

print("\n" + "=" * 60)
print("10.4 完整微调 vs LoRA/QLoRA")
print("=" * 60)

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

# ========== 1. 全量微调配置（需要大量显存） ==========
print("1. 全量微调配置")

full_finetune_config = TrainingArguments(
    output_dir="./full_finetune_checkpoints",
    per_device_train_batch_size=1,           # 批次极小
    gradient_accumulation_steps=8,           # 梯度累积模拟大 batch
    num_train_epochs=3,
    learning_rate=2e-5,
    fp16=True,                               # 使用 FP16 混合精度
    logging_steps=10,
    save_strategy="epoch",
    optim="adamw_torch",
)

print(f"全量微调 batch_size=1, 梯度累积=8, 有效 batch=8")
print("预计显存占用: 7B 模型约 80GB，不可行于消费级 GPU")

# ========== 2. LoRA 微调配置 ==========
print("\n2. LoRA 微调配置")

lora_config_llm = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],      # LLaMA 风格的模块名
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

print(f"LoRA 配置: r={lora_config_llm.r}, alpha={lora_config_llm.lora_alpha}")
print("预计显存占用: 7B 模型约 20GB（含梯度）")

# ========== 3. QLoRA 配置（4-bit + LoRA） ==========
print("\n3. QLoRA 配置")

# 4-bit 量化配置
bnb_config_qlora = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

# 加载模型时使用量化
# model = AutoModelForCausalLM.from_pretrained(
#     model_id,
#     quantization_config=bnb_config_qlora,
#     device_map="auto",
# )

# 准备模型用于 k-bit 训练（需要梯度检查点）
# model = prepare_model_for_kbit_training(model)

# 应用 LoRA
# model = get_peft_model(model, lora_config_llm)

print("QLoRA: 4-bit 量化基础模型 + LoRA 适配器")
print("预计显存占用: 7B 模型约 12-16GB（消费级 GPU 可运行）")

# ========== 4. 完整微调 vs LoRA 代码对比 ==========
print("\n" + "=" * 60)
print("4. 代码实现对比")
print("=" * 60)

# 全量微调代码模板
full_ft_template = """
# 全量微调
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
# 所有参数 requires_grad = True（默认）
trainer = Trainer(model=model, args=training_args, train_dataset=dataset)
trainer.train()
"""

# LoRA 微调代码模板
lora_template = """
# LoRA 微调
from peft import LoraConfig, get_peft_model

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
lora_config = LoraConfig(r=8, lora_alpha=32, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)
# 只训练 LoRA 参数
trainer = Trainer(model=model, args=training_args, train_dataset=dataset)
trainer.train()
"""

# QLoRA 代码模板
qlora_template = """
# QLoRA 微调
from transformers import BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
trainer = Trainer(model=model, args=training_args, train_dataset=dataset)
trainer.train()
"""

print("全量微调模板:")
print(full_ft_template)
print("\nLoRA 模板:")
print(lora_template)
print("\nQLoRA 模板:")
print(qlora_template)

# ========== 5. 资源消耗对比表 ==========
print("\n" + "=" * 60)
print("5. 资源消耗对比（7B 模型）")
print("=" * 60)

comparison = {
    "方法": ["全量微调 (FP16)", "LoRA (FP16)", "QLoRA (4-bit)"],
    "模型权重显存": ["14 GB", "14 GB", "4 GB"],
    "梯度显存": ["14 GB", "~0.5 GB", "~0.5 GB"],
    "优化器状态": ["28 GB", "~1 GB", "~1 GB"],
    "激活值": ["~10 GB", "~10 GB", "~6 GB"],
    "总计显存": ["~66 GB", "~25 GB", "~11 GB"],
    "可训练参数": ["7B", "8M (0.11%)", "8M (0.11%)"],
}

print(f"{'方法':<20} {'模型权重':<12} {'梯度':<10} {'优化器':<12} {'激活值':<10} {'总计':<12} {'可训练参数':<15}")
print("-" * 100)
for i in range(len(comparison["方法"])):
    print(f"{comparison['方法'][i]:<20} {comparison['模型权重显存'][i]:<12} {comparison['梯度显存'][i]:<10} {comparison['优化器状态'][i]:<12} {comparison['激活值'][i]:<10} {comparison['总计显存'][i]:<12} {comparison['可训练参数'][i]:<15}")

print("\n结论:")
print("  - 全量微调: 需要 A100 80GB 或 2×A100")
print("  - LoRA: 可在 24GB 消费级 GPU 上运行（如 RTX 3090/4090）")
print("  - QLoRA: 可在 16GB 消费级 GPU 上运行（如 RTX 4060 Ti 16GB）")

10.5 QLoRA：4-bit 量化 + LoRA

QLoRA（Quantized LoRA）是当前微调 LLM 最主流的方法。它将基础模型量化为 4-bit，然后附加 LoRA 适配器进行训练。这种方法可以在消费级 GPU 上微调 7B-13B 模型，且效果接近全量微调。

print("\n" + "=" * 60)
print("10.5 QLoRA 实战")
print("=" * 60)

# ========== 1. QLoRA 完整训练示例 ==========
def qlora_training_example(model_id, dataset, output_dir="./qlora_output"):
    """
    QLoRA 训练完整示例（函数框架）
    
    Args:
        model_id: 模型 ID
        dataset: 指令数据集
        output_dir: 输出目录
    """
    # 1. 配置 4-bit 量化
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",          # Normal Float 4-bit
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
    )
    
    # 2. 加载模型
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
    )
    
    # 3. 准备模型用于 k-bit 训练（启用梯度检查点等）
    model = prepare_model_for_kbit_training(model)
    
    # 4. 配置 LoRA
    lora_config = LoraConfig(
        r=8,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj"],  # 根据模型调整
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM",
    )
    
    # 5. 应用 LoRA
    model = get_peft_model(model, lora_config)
    
    # 6. 训练参数
    training_args = TrainingArguments(
        output_dir=output_dir,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=2,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        save_strategy="epoch",
        optim="paged_adamw_8bit",            # QLoRA 推荐使用分页优化器
    )
    
    # 7. 创建 Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )
    
    # 8. 开始训练
    # trainer.train()
    
    # 9. 保存模型
    # model.save_pretrained(output_dir)
    
    print("QLoRA 训练流程配置完成")
    return model, trainer

# ========== 2. QLoRA 关键参数调优 ==========
print("\nQLoRA 超参数建议:")

qlora_hyperparams = {
    "学习率 (learning_rate)": "2e-4 到 5e-4（比 LoRA 稍高，因为基础模型被量化）",
    "批次大小 (batch_size)": "1-4（取决于显存）",
    "梯度累积 (gradient_accumulation)": "2-8（保持有效 batch 在 16-64）",
    "优化器 (optim)": "paged_adamw_8bit（QLoRA 专用）",
    "LoRA r": "4-16（常用 8）",
    "LoRA alpha": "16-64（常用 32）",
    "目标模块": "q_proj, v_proj（或增加 k_proj, o_proj）",
}

for param, value in qlora_hyperparams.items():
    print(f"  {param}: {value}")

# ========== 3. 双重量化说明 ==========
print("\n双重量化（Double Quantization）:")
print("  - QLoRA 在 4-bit 量化基础上，对量化常数再进行 8-bit 量化")
print("  - 进一步节省约 0.5-1% 显存")
print("  - 启用方式: bnb_4bit_use_double_quant=True")

# ========== 4. 分页优化器 ==========
print("\n分页优化器（Paged Optimizer）:")
print("  - 使用 CPU 内存作为优化器状态的缓冲")
print("  - 防止显存不足时 OOM")
print("  - 启用方式: optim='paged_adamw_8bit'")

# ========== 5. QLoRA 效果评估 ==========
print("\nQLoRA 效果评估（基于论文数据）:")
print("  - 在多个基准上，QLoRA (4-bit) 与全量微调 (16-bit) 效果相当")
print("  - 在 Alpaca 指令微调上，QLoRA 达到 97-99% 的全量微调性能")
print("  - 在数学推理任务上，QLoRA 略优于全量微调（由于正则化效果）")

10.6 梯度检查点与显存优化

梯度检查点（Gradient Checkpointing）是一种用计算时间换取显存空间的技术。它在前向传播时只保存部分中间激活值，在反向传播时重新计算未保存的部分。对于大模型微调，这是必备的显存优化手段。

print("\n" + "=" * 60)
print("10.6 梯度检查点与显存优化")
print("=" * 60)

# ========== 1. 梯度检查点 ==========
print("1. 梯度检查点（Gradient Checkpointing）")

def enable_gradient_checkpointing(model):
    """
    启用梯度检查点
    """
    # 对于 transformers 模型
    if hasattr(model, "config"):
        model.config.use_cache = False  # 关闭 KV cache（训练时不需要）
    
    # 启用梯度检查点
    model.gradient_checkpointing_enable()
    print("梯度检查点已启用")
    
    return model

# 使用示例
# model = AutoModelForCausalLM.from_pretrained(model_id)
# model = enable_gradient_checkpointing(model)

print("梯度检查点效果:")
print("  - 7B 模型: 显存占用从 ~25GB 降至 ~18GB（节省约 30%）")
print("  - 训练速度: 降低约 20-30%")

# ========== 2. 其他显存优化技巧 ==========
print("\n2. 显存优化技巧汇总")

optimization_tips = {
    "梯度检查点": "保存部分激活值，反向时重新计算",
    "混合精度训练 (FP16/BF16)": "减少模型权重和梯度占用的显存",
    "4-bit/8-bit 量化": "大幅降低模型权重显存",
    "梯度累积": "模拟大 batch 而不增加显存",
    "删除中间变量": "及时 del 不需要的变量",
    "使用 torch.compile": "优化计算图，可能减少显存碎片",
    "CPU Offloading": "将优化器状态或激活值移至 CPU（DeepSpeed ZeRO-Offload）",
    "激活值重计算": "高级版梯度检查点，更精细控制",
}

for tip, desc in optimization_tips.items():
    print(f"  - {tip}: {desc}")

# ========== 3. 显存分析工具 ==========
print("\n3. 显存分析工具")

def print_memory_usage():
    """打印当前 CUDA 显存使用情况"""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        max_allocated = torch.cuda.max_memory_allocated() / 1e9
        print(f"  已分配显存: {allocated:.2f} GB")
        print(f"  保留显存: {reserved:.2f} GB")
        print(f"  峰值显存: {max_allocated:.2f} GB")

print("使用 torch.cuda.memory_summary() 查看详细分配")
print("使用 nvidia-smi 或 gpustat 实时监控")

# ========== 4. DeepSpeed ZeRO 优化简介 ==========
print("\n4. DeepSpeed ZeRO 优化")

deepspeed_config_example = """
# deepspeed_config.json
{
    "train_batch_size": 16,
    "gradient_accumulation_steps": 2,
    "fp16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 2,                    # ZeRO-2: 分割优化器状态和梯度
        "offload_optimizer": {
            "device": "cpu",           # 优化器状态 offload 到 CPU
            "pin_memory": true
        }
    },
    "gradient_checkpointing": true
}
"""

print("ZeRO 阶段:")
print("  - ZeRO-1: 分割优化器状态")
print("  - ZeRO-2: 分割优化器状态 + 梯度")
print("  - ZeRO-3: 分割优化器状态 + 梯度 + 模型参数")
print("\n使用方式:")
print("  training_args = TrainingArguments(..., deepspeed='deepspeed_config.json')")

# ========== 5. 完整显存优化训练示例 ==========
print("\n" + "=" * 60)
print("5. 完整显存优化训练示例（配置汇总）")
print("=" * 60)

def create_memory_efficient_training_args(output_dir="./output"):
    """
    创建显存优化的训练参数
    """
    training_args = TrainingArguments(
        output_dir=output_dir,
        per_device_train_batch_size=2,          # 小批次
        gradient_accumulation_steps=4,          # 梯度累积
        gradient_checkpointing=True,             # 梯度检查点
        fp16=True,                               # 混合精度
        optim="adamw_torch",                     # 优化器
        logging_steps=10,
        save_strategy="steps",
        save_steps=500,
        remove_unused_columns=False,
        dataloader_num_workers=2,
        group_by_length=True,                    # 按长度分组减少 padding
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
    )
    return training_args

args = create_memory_efficient_training_args()
print("显存优化训练参数:")
for key, value in args.__dict__.items():
    if not key.startswith('_'):
        print(f"  {key}: {value}")

print("\n显存优化最佳实践总结:")
print("  1. 始终使用梯度检查点 + 混合精度")
print("  2. 优先使用 QLoRA（4-bit + LoRA）")
print("  3. 使用梯度累积代替增大 batch_size")
print("  4. 对于超大模型，使用 DeepSpeed ZeRO-3 + CPU Offload")
print("  5. 监控显存使用，及时调整配置")

11. 多模态模型微调

多模态模型能够同时处理文本、图像、音频等多种模态的数据。本节聚焦于视觉-语言模型（Vision-Language Models, VLMs）的微调。

11.1 CLIP、BLIP、LLaVA 等模型结构

print("\n" + "=" * 60)
print("11.1 多模态模型结构解析")
print("=" * 60)

# ========== 1. CLIP 模型 ==========
print("1. CLIP (Contrastive Language-Image Pre-training)")

clip_structure = """
CLIP 由两个编码器组成：
┌─────────────────┐     ┌─────────────────┐
│   图像编码器     │     │   文本编码器     │
│  (ViT 或 ResNet) │     │  (Transformer)  │
└────────┬────────┘     └────────┬────────┘
         │                       │
         ▼                       ▼
    图像特征向量              文本特征向量
    (d_model)                 (d_model)
         │                       │
         └───────────┬───────────┘
                     ▼
             对比学习 (Contrastive Loss)
             拉近匹配的图像-文本对
             推远不匹配的对
"""

print(clip_structure)
print("CLIP 核心: 对比学习，无需额外分类头")

# ========== 2. BLIP 模型 ==========
print("\n2. BLIP (Bootstrapping Language-Image Pre-training)")

blip_structure = """
BLIP 包含三个组件：
1. 图像编码器 (ViT) - 提取图像特征
2. 文本编码器 (BERT) - 编码文本
3. 多模态编码器 (Cross-Attention) - 融合图文信息

训练任务：
- ITC (Image-Text Contrastive Loss): 对比学习
- ITM (Image-Text Matching Loss): 图文匹配
- LM (Language Modeling Loss): 语言建模（用于生成）
"""

print(blip_structure)

# ========== 3. LLaVA 模型 ==========
print("\n3. LLaVA (Large Language and Vision Assistant)")

llava_structure = """
LLaVA = 视觉编码器 (CLIP ViT) + 投影层 (MLP) + 大语言模型 (LLaMA/Vicuna)

结构：
图像 ──► ViT ──► 视觉特征 ──► 投影层 ──► 视觉 token ──►
                                                      │
文本 ──► 文本 token ────────────────────────────────► 拼接 ──► LLM ──► 输出

训练阶段：
1. 预训练: 对齐视觉和语言特征（仅训练投影层）
2. 指令微调: 微调投影层 + LLM (或仅投影层)
"""

print(llava_structure)

# ========== 4. 加载多模态模型示例 ==========
print("\n4. 加载多模态模型")

from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration

# CLIP 示例
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
print(f"CLIP 模型加载成功，参数量: {sum(p.numel() for p in clip_model.parameters()):,}")

# BLIP 示例（用于图像描述）
blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
blip_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
print(f"BLIP 模型加载成功")

# LLaVA 示例（需要 transformers >= 4.35.0）
# from transformers import LlavaForConditionalGeneration, AutoProcessor
# llava_model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
# print("LLaVA 模型加载成功")

11.2 视觉-语言模型的微调策略

微调 VLM 需要考虑如何同时处理图像和文本输入。以下是针对 CLIP 和 LLaVA 的微调示例。

print("\n" + "=" * 60)
print("11.2 视觉-语言模型微调策略")
print("=" * 60)

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import requests
from io import BytesIO

# ========== 1. CLIP 微调（对比学习） ==========
print("1. CLIP 微调示例")

class CLIPFineTuneDataset(Dataset):
    """CLIP 微调数据集，包含图像-文本对"""
    
    def __init__(self, image_paths, texts, processor):
        self.image_paths = image_paths
        self.texts = texts
        self.processor = processor
    
    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, idx):
        # 加载图像（模拟）
        # image = Image.open(self.image_paths[idx])
        image = Image.new('RGB', (224, 224), color='white')  # 模拟
        text = self.texts[idx]
        
        inputs = self.processor(
            images=image, 
            text=text, 
            return_tensors="pt",
            padding=True,
            truncation=True
        )
        
        # 移除 batch 维度
        return {
            "pixel_values": inputs.pixel_values.squeeze(0),
            "input_ids": inputs.input_ids.squeeze(0),
            "attention_mask": inputs.attention_mask.squeeze(0),
        }

def train_clip(model, dataloader, optimizer, device, num_epochs=3):
    """
    微调 CLIP 模型（对比损失）
    """
    model.train()
    
    for epoch in range(num_epochs):
        total_loss = 0
        for batch in dataloader:
            pixel_values = batch["pixel_values"].to(device)
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            
            # 前向传播
            outputs = model(
                pixel_values=pixel_values,
                input_ids=input_ids,
                attention_mask=attention_mask,
                return_loss=True
            )
            loss = outputs.loss
            
            # 反向传播
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        print(f"Epoch {epoch+1}, Loss: {total_loss/len(dataloader):.4f}")
    
    return model

print("CLIP 微调函数已定义")

# ========== 2. LLaVA 微调（指令微调） ==========
print("\n2. LLaVA 微调示例")

class LLaVADataset(Dataset):
    """
    LLaVA 指令微调数据集
    每个样本包含: 图像 + 指令 + 回复
    """
    
    def __init__(self, data, processor):
        self.data = data
        self.processor = processor
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = self.data[idx]
        image = item.get("image")  # PIL Image
        instruction = item.get("instruction")
        response = item.get("response")
        
        # 构建对话格式
        conversation = [
            {"role": "user", "content": instruction},
            {"role": "assistant", "content": response}
        ]
        
        # 使用 processor 处理
        # 注意：LLaVA processor 接受 text 和 images
        # 实际实现需根据具体模型 API 调整
        
        # 模拟返回
        return {
            "pixel_values": torch.randn(3, 336, 336),
            "input_ids": torch.randint(0, 32000, (512,)),
            "labels": torch.randint(0, 32000, (512,)),
        }

def train_llava(model, dataloader, optimizer, device, use_lora=True):
    """
    LLaVA 微调（支持 LoRA）
    """
    if use_lora:
        from peft import LoraConfig, get_peft_model
        lora_config = LoraConfig(
            r=8,
            lora_alpha=32,
            target_modules=["q_proj", "v_proj"],  # LLaVA 内部 LLM 的模块
            lora_dropout=0.1,
        )
        model = get_peft_model(model, lora_config)
        print("LLaVA 启用 LoRA 微调")
    
    model.train()
    optimizer = optimizer or torch.optim.AdamW(model.parameters(), lr=2e-5)
    
    # 训练循环（与普通 LLM 类似）
    # ...
    
    return model

print("LLaVA 微调函数已定义")

# ========== 3. 微调策略对比 ==========
print("\n3. 微调策略对比")

vlm_strategies = {
    "CLIP": {
        "常见任务": "图文检索、零样本分类",
        "微调策略": "全量微调或仅微调投影层",
        "学习率": "1e-6 到 5e-6",
        "数据需求": "图文对，数千到数万",
    },
    "BLIP": {
        "常见任务": "图像描述、视觉问答",
        "微调策略": "微调解码器部分",
        "学习率": "1e-5 到 2e-5",
        "数据需求": "图文对，带描述",
    },
    "LLaVA": {
        "常见任务": "视觉对话、多模态指令",
        "微调策略": "LoRA/QLoRA 微调 LLM 部分",
        "学习率": "2e-5 (LLM), 1e-4 (投影层)",
        "数据需求": "多模态指令数据，数千到数万",
    },
}

for model_name, info in vlm_strategies.items():
    print(f"\n{model_name}:")
    for k, v in info.items():
        print(f"  {k}: {v}")

11.3 对比损失与匹配损失的处理

多模态模型的核心损失函数包括对比损失（Contrastive Loss）和匹配损失（Matching Loss）。理解这些损失函数的实现对于微调至关重要。

print("\n" + "=" * 60)
print("11.3 对比损失与匹配损失")
print("=" * 60)

import torch.nn.functional as F

# ========== 1. 对比损失（InfoNCE） ==========
print("1. 对比损失 (Contrastive Loss / InfoNCE)")

def contrastive_loss(image_features, text_features, temperature=0.07):
    """
    计算 CLIP 风格的对比损失
    
    Args:
        image_features: 图像特征 [batch_size, feature_dim]
        text_features: 文本特征 [batch_size, feature_dim]
        temperature: 温度参数，控制分布的平滑度
    
    Returns:
        loss: 标量损失
    """
    # L2 归一化
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)
    
    # 计算相似度矩阵
    logits = torch.matmul(image_features, text_features.T) / temperature
    batch_size = logits.shape[0]
    
    # 标签：对角线为正样本
    labels = torch.arange(batch_size, device=logits.device)
    
    # 对称损失：图像->文本 和 文本->图像
    loss_i2t = F.cross_entropy(logits, labels)
    loss_t2i = F.cross_entropy(logits.T, labels)
    
    loss = (loss_i2t + loss_t2i) / 2
    return loss

# 模拟数据
batch_size = 4
feat_dim = 512
image_feats = torch.randn(batch_size, feat_dim)
text_feats = torch.randn(batch_size, feat_dim)

loss_clip = contrastive_loss(image_feats, text_feats, temperature=0.07)
print(f"对比损失示例值: {loss_clip.item():.4f}")

# ========== 2. 匹配损失（ITM Loss） ==========
print("\n2. 匹配损失 (Image-Text Matching Loss)")

def itm_loss(image_features, text_features, fusion_model):
    """
    图文匹配损失（二分类）
    
    Args:
        image_features: 图像特征
        text_features: 文本特征
        fusion_model: 多模态融合模型（如 Cross-Attention）
    
    Returns:
        loss: 二元交叉熵损失
    """
    # 融合图文特征
    fused = fusion_model(image_features, text_features)
    
    # 二分类头
    logits = torch.nn.Linear(fused.shape[-1], 2)(fused)
    
    # 正样本标签为 1，负样本标签为 0
    # 实际训练中需要构造负样本（如随机配对）
    labels = torch.ones(batch_size, dtype=torch.long, device=logits.device)
    
    loss = F.cross_entropy(logits, labels)
    return loss

print("ITM 损失函数已定义")

# ========== 3. 语言建模损失（用于生成任务） ==========
print("\n3. 语言建模损失 (Language Modeling Loss)")

def lm_loss(logits, labels, ignore_index=-100):
    """
    自回归语言建模损失
    """
    # 移位：预测下一个 token
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()
    
    loss = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
        ignore_index=ignore_index
    )
    return loss

print("LM 损失函数已定义")

# ========== 4. BLIP 的多任务损失组合 ==========
print("\n4. BLIP 多任务损失组合")

class BLIPMultiTaskLoss(nn.Module):
    """
    BLIP 模型的多任务损失：
    - ITC: 对比损失
    - ITM: 匹配损失
    - LM: 语言建模损失
    """
    
    def __init__(self, itc_weight=1.0, itm_weight=1.0, lm_weight=1.0):
        super().__init__()
        self.itc_weight = itc_weight
        self.itm_weight = itm_weight
        self.lm_weight = lm_weight
    
    def forward(self, itc_loss, itm_loss, lm_loss):
        total = (self.itc_weight * itc_loss + 
                 self.itm_weight * itm_loss + 
                 self.lm_weight * lm_loss)
        return total

# 示例
loss_combiner = BLIPMultiTaskLoss(itc_weight=1.0, itm_weight=0.5, lm_weight=1.0)
print(f"多任务损失组合器: ITC权重=1.0, ITM权重=0.5, LM权重=1.0")

# ========== 5. 负样本构造技巧 ==========
print("\n5. 负样本构造技巧")

def construct_negative_pairs(image_features, text_features, method="random"):
    """
    为对比学习构造负样本
    
    Args:
        image_features: [batch, dim]
        text_features: [batch, dim]
        method: "random", "hard", "shuffle"
    """
    batch_size = image_features.shape[0]
    
    if method == "random":
        # 随机打乱文本特征作为负样本
        indices = torch.randperm(batch_size)
        negative_text = text_features[indices]
        
    elif method == "hard":
        # 困难负样本：相似度高的不匹配对
        sim_matrix = torch.matmul(F.normalize(image_features), 
                                  F.normalize(text_features).T)
        # 排除对角线（正样本）
        for i in range(batch_size):
            sim_matrix[i, i] = -float('inf')
        # 选择相似度最高的作为困难负样本
        hard_indices = sim_matrix.argmax(dim=1)
        negative_text = text_features[hard_indices]
    
    else:  # shuffle
        # 循环移位
        negative_text = torch.cat([text_features[1:], text_features[:1]], dim=0)
    
    return negative_text

print("负样本构造方法: random, hard, shuffle")

# ========== 6. 温度参数调优 ==========
print("\n6. 温度参数 (Temperature) 的影响")

temperature_analysis = {
    "temperature=0.01": "极低温度，模型过于自信，容易过拟合",
    "temperature=0.07": "CLIP 默认值，平衡较好",
    "temperature=0.1": "稍高温度，梯度更平滑",
    "temperature=0.5": "较高温度，适合小批次训练",
    "temperature=1.0": "标准 softmax，可能过于平滑",
}

for temp, effect in temperature_analysis.items():
    print(f"  {temp}: {effect}")

print("\n推荐: 从 temperature=0.07 开始，根据训练稳定性调整")

# ========== 7. 完整的多模态微调循环示例 ==========
print("\n" + "=" * 60)
print("7. 完整多模态微调示例（伪代码）")
print("=" * 60)

multimodal_training_loop = """
# 伪代码：多模态模型微调
model = load_multimodal_model("path/to/model")
processor = load_processor("path/to/processor")

# 可选：应用 LoRA 到 LLM 部分
if use_lora:
    model = apply_lora_to_llm(model)

optimizer = AdamW(model.parameters(), lr=2e-5)

for epoch in range(num_epochs):
    for batch in dataloader:
        images = batch["images"].to(device)
        texts = batch["texts"]
        
        # 预处理
        inputs = processor(images=images, text=texts, return_tensors="pt")
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        # 前向传播
        outputs = model(**inputs)
        
        # 根据模型类型计算损失
        if model_type == "clip":
            loss = outputs.loss  # 内置对比损失
        elif model_type == "blip":
            loss = outputs.loss  # 多任务损失组合
        elif model_type == "llava":
            loss = outputs.loss  # 语言建模损失
        
        # 反向传播
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        print(f"Loss: {loss.item():.4f}")
"""

print(multimodal_training_loop)

# ========== 8. 多模态微调的最佳实践总结 ==========
print("\n" + "=" * 60)
print("8. 多模态微调最佳实践")
print("=" * 60)

best_practices = """
1. 数据预处理：
   - 统一图像尺寸（如 224x224 或 336x336）
   - 使用与预训练一致的 normalization
   - 文本处理使用正确的 tokenizer 和模板

2. 显存优化：
   - 使用梯度检查点
   - 考虑冻结视觉编码器，只微调投影层和 LLM
   - 使用 LoRA/QLoRA 微调 LLM 部分

3. 训练技巧：
   - 使用较小的学习率（1e-6 到 5e-5）
   - 使用 warmup 和余弦退火
   - 监控图像-文本对齐质量

4. 评估方法：
   - 图文检索：Recall@K
   - 图像描述：BLEU-4, CIDEr, SPICE
   - 视觉问答：准确率

5. 常见问题：
   - 模态不匹配：确保图像和文本特征对齐
   - 过拟合：使用数据增强、dropout、LoRA
   - 训练不稳定：降低学习率，增加 warmup steps
"""

print(best_practices)

print("\n" + "=" * 60)
print("第四部分完成")
print("=" * 60)

总结

本文第四部分系统讲解了大语言模型和多模态模型的微调技术：

LLM 微调：分析了 LLM 微调的特点与挑战，演示了如何使用 HuggingFace 加载 LLaMA、Qwen、ChatGLM 等主流模型，介绍了指令微调的数据格式，对比了全量微调与 LoRA/QLoRA 的资源消耗，深入讲解了 QLoRA 的实现细节，并提供了梯度检查点等显存优化技巧。
多模态模型微调：解析了 CLIP、BLIP、LLaVA 等典型模型的结构，给出了针对不同模型的微调策略，并详细讲解了对比损失、匹配损失、语言建模损失等核心损失函数的实现原理。

微调技术是连接预训练大模型与具体应用场景的桥梁，掌握这些技术将使你能够在实际项目中充分发挥大模型的能力。随着模型的不断演进，微调方法也在持续发展，建议读者保持对最新研究（如
DoRA、MoRA 等）的关注，并在实践中不断积累经验。

🌟 感谢您耐心阅读到这里！
💡 如果本文对您有所启发欢迎：
👍 点赞📌 收藏 📤 分享给更多需要的伙伴。
🗣️ 期待在评论区看到您的想法, 共同进步。
🔔 关注我，持续获取更多干货内容～
🤗 我们下篇文章见～

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

【EI复现】基于元模型优化算法的主从博弈多虚拟电厂动态定价和能量管理(Matlab代码实现）

基于元模型的优化算法是一种基于历史数据来驱动样本点的加入从而逼近局部或全局最优解的优化机制，能够改善传统启发式智能算法需要繁复数值模拟的缺陷，目前在飞行器设计等航空航天领域的应用[20]最为广泛，在电力系统方面也有初步的应用。提出基于 Kriging 元模型的博弈均衡算法，在求解过程中建立 Kriging 元模型替代 VPP 内部的能量管理模型，利用粒子群优化算法搜索优异采样点，更新修正 Krig

AtomGit开源社区

蒙特卡洛风光场景并通过削减法聚类法得到几个典型场景（包含Matlab代码和Python代码实现）

蒙特卡洛方法是一种基于随机抽样的数值计算方法，通过多次随机抽样来估计系统的行为，从而得到系统的统计性质。在风光模型中，蒙特卡洛方法可以用来模拟风速、风向和太阳光照的变化，进而评估风力和太阳能系统在不同条件下的性能。

AtomGit开源社区

完全免费、绿色免安装的Windows轻量级硬件检测工具，零依赖查看电脑配置

📌 摘要：推荐一款免费免安装的Windows硬件检测工具SysView，单文件便携、零依赖，兼容Win7/10/11系统。支持一键读取CPU、内存、显卡等硬件参数，无广告、不上传隐私。特点包括毫秒级启动、纯本地运行、无需管理员权限，适合普通用户、DIY玩家及运维人员。开源项目，提供32/64位版本下载，点击即用，彻底关闭无残留。 🔗 核心优势： ✅ 永久免费无阉割 ✅ 绿色免安装，U盘随身带

AtomGit开源社区

所有评论(0)

查看更多评论

Thomas.Sir

@SearchB

已为社区贡献179条内容

第33节：微调框架 PyTorch 从入门到精通【第四部分：大模型微调专题】

Thomas.Sir

文章目录

全文导读

【第一部分：基础入门】

【第二部分：微调核心实践篇】

【第三部分：参数高效微调（PEFT）篇】

【第四部分：大模型微调专题】

【第五部分：工程化与进阶篇】

【第六部分：案例实战篇】

【第七部分：常见问题与调试指南】

第四部分：大模型微调专题

10. 大语言模型（LLM）微调

10.1 LLM 微调的特点与挑战

10.2 使用 HuggingFace Transformers 加载 LLM（LLaMA, Qwen, ChatGLM）

10.3 指令微调（Instruction Tuning）数据格式

10.4 完整微调 vs LoRA/QLoRA

10.5 QLoRA：4-bit 量化 + LoRA

10.6 梯度检查点与显存优化

11. 多模态模型微调

11.1 CLIP、BLIP、LLaVA 等模型结构

11.2 视觉-语言模型的微调策略

11.3 对比损失与匹配损失的处理

总结

所有评论(0)

Thomas.Sir

第33节：微调框架 PyTorch 从入门到精通【第四部分：大模型微调专题】

文章目录

全文导读

第四部分：大模型微调专题

10. 大语言模型（LLM）微调

10.1 LLM 微调的特点与挑战

10.2 使用 HuggingFace Transformers 加载 LLM（LLaMA, Qwen, ChatGLM）

10.3 指令微调（Instruction Tuning）数据格式

10.4 完整微调 vs LoRA/QLoRA

10.5 QLoRA：4-bit 量化 + LoRA

10.6 梯度检查点与显存优化

11. 多模态模型微调

11.1 CLIP、BLIP、LLaVA 等模型结构

11.2 视觉-语言模型的微调策略

11.3 对比损失与匹配损失的处理

总结

所有评论(0)

温馨提示：您尚未绑定手机号