除了头部大厂的API调用,大多数的民用LLM大模型的研究和应用在500B一级和以下(排除TAALAS的技术)已有明显区分。

1. 以OpenAI GPT-OSS-120B 为代表的千亿(100B+ 至 500B )的大模型。这一赛道目前呈现低比特率,如4bits (native trained 原生训练精度),强信息编码(Information Encoding),信息优化效率提高而不降低为基准。例如,以GPT-OSS-120B为标杆,直接对标GPT-4类大模型。在没有强大模型优化,或者理论支撑的情况下,国内或不会挑战万亿参数大模型,即使有,也难以与1200亿的GPT-OSS-120B相竞争。这是国内大模型发展数理受限的阶段性局限,并非对十万亿和百万亿参数的否定。然而,千亿数量的大模型通常需要多卡运作,存在一定技术壁垒。

2. 以OpenAI GPT-OSS-20B 为代表的百亿级的大模型,这其中以14B和7B为经典的Microsoft Phi-4系列,和Qwen, DeepSeek 系列的蒸馏模型(distill),为主流一线二线厂商模型。这些大模型在工业4bit的帮助下可以直接在大众计算机设备上运行(30-50GB),其Token生成数与个体用户的信息理解速率相当,尤其,其微调所需计算量较小,适合中小企业直接在模型层面进行二次开发,特定情况下成本或低至数千元。在模型效率提升或<4bit优化后,可成为Edge Compute的主流。

3. 以DeepSeek为代表的十亿级别的大模型,其中以DeepSeek 1.5B蒸馏模型为经典,其占用内存通常在3GB或以下。这类模型进一步优化后可直接放置入移动手机端或小型电脑。而且,其技术参数与百亿级大模型相当,适合教学使用,训练成本极低,通常在数十元以内。所获得的技术通常可以快速部署到百亿级大模型的运行,是良好的试错方法。

接下来,文章将展示如何使用SCNet节点对代号为“deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B” 的大模型进行微调,目的是介绍SCNet,以及云环境部署,还有中小企业所需的快速开发代码。本文为公益类代码,由DeepSeek辅助生成,经过实例测试。

1. 注册超算互联 https://www.scnet.cnhttps://www.scnet.cn2. 点击右上角红色按钮“控制台”

3. 点击“服务导航”->“人工智能”(蓝色按钮)

这会进入人工智能Notebook所需界面 https://www.scnet.cn/ui/console/index.html#/notebookhttps://www.scnet.cn/ui/console/index.html#/notebook4. 点击右上角“费用”->“总览”

5. 点击“充值” ->“支付宝” 

具体充值金额按服务所需选择,接下来的案例消耗在10元以内(实测约2元)。

6.充值完成后返回人工智能Notebook 界面(https://www.scnet.cn/ui/console/index.html#/notebook)

7. 点击 “Notebook”->“创建Notebook” 选择“013组” (华东一区【昆山】),“加速卡数量1" 使用4090加速卡,包含24GB缓存(避免初学者调试需求)。

8. 点击”开发镜像“->”基础镜像“->"框架名称PyTorch"->”框架版本2.6.0“->"Python版本py3.12-ubuntu22.04"->“CUDA/DTK版本 cuda12.4” 点击右下角(红色)“创建”按钮

这会自动创建并切换回Notebook界面,在此界面可以直接操作Jupyter Notebook,或使用VS Code通过Remote SSH登录。

9.点击“快捷工具”->“JupyterLab” 

这会加载界面,通常,系统会自动允许网络连接,若不能,则需联系客服。

10. 点击“root”->“笔记本”-“Python3”

同时,建议点击页面中间上方”+“新增标签页,打开“其他”->"终端"

注意,容器内默认为root账号。

11. 在“终端”内输入(代码可直接复制粘贴)

pip install --upgrade pip

如果不行,则切换阿里源

pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
pip install --upgrade pip

12. 安装相关library

pip install transformers accelerate peft bitsandbytes datasets trl scikit-learn pandas

13.下载 crowdflower/twitter-airline-sentiment 的数据库为案例(需要注册账户,免费)

https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentimenthttps://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment点击“Download”->"Download dataset as zip",大约3MB的.csv 文件,其内容大致如下。

tweet_id	sentiment	author	content
1956967341	empty	xoshayzers	@tiffanylue i know  i was listenin to bad habit earlier and i started freakin at his part =[
1956967666	sadness	wannamama	Layin n bed with a headache  ughhhh...waitin on your call...

这是一个非常知名的数据文件,也可以替换为选择的任意数据库。

14. 直接解压.zip 并将“twitter-airline-sentimentSentiment_Analysis.csv” 直接拖拽至jupyter notebook左侧与.ipynb相同的文件夹内(/root/)。

15. 下载LLM文件“"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"”至本地 "/root/private_data/DeepSeek1.5B"

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Define local save directory for the 1.5B model
local_model_dir = "/root/private_data/DeepSeek1.5B"

# ✅ Correct Model Name (1.5 Billion parameters)
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

print(f"Loading model: {model_name} (Official size: 1.54B parameters)")

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,   # bfloat16 is safe and memory efficient
    device_map="auto"             # automatic device placement
)

# Save locally
tokenizer.save_pretrained(local_model_dir)
model.save_pretrained(local_model_dir)
print(f"Model saved to {local_model_dir}")

这会消耗大约10分钟。

16. 附注一段SCNet官方的提示

# https://www.scnet.cn/help/docs/mainsite/ai/notebook/function-introduction/

# 一、关机环境保存/保存镜像

# 可以在关机时保存开发环境或者使用“保存镜像”功能对开发环境进行备份,保证机器具有一致的环境和配置,满足再次启动环境、团队开发环境搭建、在其他平台复现环境等需求,容器实例开关机条件下皆可保存镜像。

# 注意: 为保证镜像正常运行,保存环境镜像时单层镜像数据量不得超过15 GiB,系统会对镜像大小进行校验,若镜像大小超过限额限制,您需要手动将容器环境下文件转移到文件存储中。

# 您可以使用如下代码快速定位当前环境中的大文件(含文件夹):

# cd /
# find . -path "./proc" -prune -o \
#        -path "/root/private_data/*" -prune -o \  ##排除个人文件
#        -path "/root/public_data/*" -prune -o \  ##排除平台共享文件
#        -path "/root/group_data/*" -prune -o \  ##排除团队共享文件
#        -path "/public/*" -prune -o \   ##排除共享存储文件
#        -path "/work/*" -prune -o \  ##排除共享存储文件
#        -type f -exec du -h {} + | sort -hr | head -n 20  ##展示大小排名前20的文件

# 识别到大文件后,使用如下代码将文件迁移至文件存储永久保存:

# mv /root/model_file /root/private_data/model_file

# 迁移后文件可能无法在文件存储中使用(属主为root),您需要在当前环境中执行如下代码修改权限:

# # 其中user_name需要替换为你的计算用户名
# chown user_name:user_name /root/private_data/model_file

17. 对已下载的模型进行一个测试

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

local_model_dir = "/root/private_data/DeepSeek1.5B"

tokenizer = AutoTokenizer.from_pretrained(local_model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    local_model_dir,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Example prompt
input_text = "Explain quantum computing in simple terms"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

18. 读入数据.csv

import pandas as pd

df = pd.read_csv('twitter-airline-sentimentSentiment_Analysis.csv')
first_50 = df.head(5000)
print(f"Loaded {len(first_50)} rows")
print(first_50[['tweet_id', 'sentiment', 'author', 'content']].head())

注意,必须放置在与jupyter notebook相同文件夹下。

19. 准备微调数据(选取头5000条)
 

# Define instruction
instruction = "Analyze the sentiment of the following tweet:"

# Create a list of formatted texts
formatted_texts = []
for idx, row in first_50.iterrows():
    text = f"Instruction: {instruction}\nInput: {row['content']}\nOutput: {row['sentiment']}"
    formatted_texts.append(text)

# Convert to a Hugging Face Dataset
from datasets import Dataset
dataset = Dataset.from_dict({"text": formatted_texts})

print(dataset[0]['text'])

20. 4bit quantization并准备模型训练参数

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments
)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import Dataset

# 4‑bit quantization config (saves memory)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load model with quantization
model_name = "/root/private_data/DeepSeek1.5B"  # or the HF name
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# LoRA configuration
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],   # typical for DeepSeek
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Wrap model with LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # should show ~0.1% trainable

21. 准备训练数据

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=512,
        return_tensors=None  # we'll handle with data collator
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
tokenized_dataset.set_format("torch", columns=["input_ids", "attention_mask"])

def set_labels(example):
    example["labels"] = example["input_ids"].clone()
    return example

tokenized_dataset = tokenized_dataset.map(set_labels)

22. 接下来有不同训练方法,为了简便,采用与Huggingface相同的训练方式


from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # causal LM
)

23. 设定训练参数

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=3,
    # logging_steps=10,
    save_strategy="epoch",
    num_train_epochs=3,
    optim="paged_adamw_8bit",
    report_to="none"
)

24. 开始训练(耗时约16分钟)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

trainer.train()

25. 保存微调的模型

output_dir = "/root/private_data/DeepSeek1.5B_finetuned"
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"LoRA adapter saved to {output_dir}")

26. 使用微调的模型进行一个示范

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# Paths
# base_model_name = "deepseek-ai/deepseek-llm-1.5b-base"   # or your local path if saved
base_model_name = "/root/private_data/DeepSeek1.5B"
adapter_path = "/root/private_data/DeepSeek1.5B_finetuned"

# Optional: 4‑bit quantization (same as during training)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token   # important for generation

# Load base model (with or without quantization)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,    # remove if you didn't use quantization
    device_map="auto",
    trust_remote_code=True
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_path)

# Switch to evaluation mode
model.eval()

# --- Test the model ---
# Example tweet input
tweet = "I love this new phone! It's amazing 😍"
instruction = "Analyze the sentiment of the following tweet:"

# Format the prompt exactly as during training
prompt = f"Instruction: {instruction}\nInput: {tweet}\nOutput:"

# Tokenize and move to same device as model
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate (limit new tokens to a short answer, e.g., sentiment label)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=20,            # sentiment label is short
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode only the newly generated part (skip the prompt)
generated_ids = outputs[0][inputs.input_ids.shape[1]:]   # take only new tokens
generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)

print(f"Tweet: {tweet}")
print(f"Predicted sentiment: {generated_text}")

27. 使用微调的模型对比微调所使用的数据集进行案例示范

import torch
import pandas as pd
import random
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# --------------------------
# 1. Paths and model loading
# --------------------------
# base_model_name = "deepseek-ai/deepseek-llm-1.5b-base"   # or your local path if saved
base_model_name = "/root/private_data/DeepSeek1.5B"
adapter_path = "/root/private_data/DeepSeek1.5B_finetuned"

# If you used 4‑bit quantization during training, load with same config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token   # important for generation

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,    # remove if you didn't use quantization
    device_map="auto",
    trust_remote_code=True
)

model = PeftModel.from_pretrained(base_model, adapter_path)
model.eval()   # inference mode

# --------------------------
# 2. Load the original CSV
# --------------------------
csv_path = "twitter-airline-sentimentSentiment_Analysis.csv"   # adjust if needed
df = pd.read_csv(csv_path)

# Ensure we have the required columns
print(f"CSV loaded with {len(df)} rows. Columns: {df.columns.tolist()}")

# --------------------------
# 3. Randomly pick 5 tweets
# --------------------------
sample_rows = df.sample(n=5, random_state=42)   # fixed seed for reproducibility

instruction = "Analyze the sentiment of the following tweet:"

# --------------------------
# 4. Test each sample
# --------------------------
for idx, row in sample_rows.iterrows():
    tweet = row['content']
    actual_sentiment = row['sentiment']
    
    # Build prompt exactly as during training
    prompt = f"Instruction: {instruction}\nInput: {tweet}\nOutput:"
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=20,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Decode only the newly generated tokens (skip the prompt)
    generated_ids = outputs[0][inputs.input_ids.shape[1]:]
    predicted_text = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
    
    # Optional: clean up predicted text (remove trailing newline, etc.)
    predicted_text = predicted_text.split('\n')[0]   # take first line
    
    print("\n" + "="*60)
    print(f"Tweet: {tweet[:100]}...")
    print(f"Actual sentiment: {actual_sentiment}")
    print(f"Predicted sentiment: {predicted_text}")
    print("="*60)

28. 返回"控制台"->“Notebook”->"操作" 关闭容器防止额外付费。

29. 在左侧“人工智能”->“文件管理“内,可下载微调后的”DeepSeek1.5B_finetuned“模型。

至此,一个单卡微调的十亿参数的模型的示例便完成了。

这类微调后的模型可在本地高效部署,大量节约API调用的时间成本(延迟)和费用。相同方法可以直接类推到80GB以内百亿的大模型,如7B的LLM模型。

我在找工作,HR或项目合作请联系:yucongcai_business@outlook.com
与科研相关的请联系:yucongcai_research@outlook.com

Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐