SCNet 原生训练GPT-2类模型 10-50M LLM 实例

YucongCai

401人浏览 · 2026-04-01 20:16:47

YucongCai · 2026-04-01 20:16:47 发布

本文为“SCNet 超算互联网 LLM Fine-Tuning LoRA 实例”的拓展，聚焦于教学用原生训练GPT类大模型的实例。

https://blog.csdn.net/YucongCai/article/details/159696147?spm=1001.2014.3001.5501https://blog.csdn.net/YucongCai/article/details/159696147?spm=1001.2014.3001.5501已阅读上文读者可直至上文中步骤12开始快速部署开发代码。

在安装好library并从Kaggle链接下载免费的 crowdflower/twitter-airline-sentiment 数据库为案例后，本文将展示GPT-2类的LLM模型训练流程。

https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentimenthttps://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment本文不涉及构架开发或优化，仅为技术实践实例。本文为公益类代码，由DeepSeek辅助生成，经过实例测试。

文章分两部分，第一部分直接使用GPT2LMHeadModel进行训练，第二部分用pytorch复现其构架代码。本文仅模拟GPT-2模型的训练流程和构架搭建，所训练数据集均属于sentiment analysis，并非通用LLM数据。

15. 加载并处理数据集

import pandas as pd
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import Dataset

# Load tokenizer early to use for truncation
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# -------------------------------
# 1. Load and prepare the dataset
# -------------------------------
df = pd.read_csv('twitter-airline-sentimentSentiment_Analysis.csv')
data = df.head(50000)
print(f"Loaded {len(data)} rows")

# Define instruction and compute fixed token count
instruction = "Analyze the sentiment of the following tweet. Output exactly one word, with no punctuation or extra text:"
# The fixed text without tweet and label
fixed_template = f"Instruction: {instruction}\nInput:\nOutput:"
fixed_token_count = len(tokenizer.encode(fixed_template))
print(f"Fixed tokens (without tweet and label): {fixed_token_count}")

# Target total tokens (e.g., 60)
max_total_tokens = 31+80
# Leave 1 token for the sentiment label
max_tweet_tokens = max_total_tokens - fixed_token_count - 1
print(f"Max tokens allowed for tweet content: {max_tweet_tokens}")

def truncate_tweet(tweet, max_tokens):
    tokens = tokenizer.encode(tweet, truncation=True, max_length=max_tokens)
    return tokenizer.decode(tokens, skip_special_tokens=True)

# Format each row with truncated tweet
formatted_texts = []
for _, row in data.iterrows():
    short_tweet = truncate_tweet(row['content'], max_tweet_tokens)
    text = f"Instruction: {instruction}\nInput: {short_tweet}\nOutput: {row['sentiment']}"
    formatted_texts.append(text)

# Create dataset
dataset = Dataset.from_dict({"text": formatted_texts})

注意 max_total_tokens 需要和GPT-2中的 n_positions 保持一致。

16 准备训练数据

# -------------------------------
# 2. Tokenize the dataset
# -------------------------------
def tokenize_function(examples):
    # Use max_length equal to the total token limit (e.g., 60)
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=max_total_tokens,   # e.g., 60
        # Do NOT set return_tensors here – let collator handle it
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

17. 检测输入数据

# -------------------------------
# 3. Inspect the tokenized dataset
# -------------------------------
print("\n--- Inspection of tokenized dataset ---")
for i in range(3):   # first 3 examples
    sample = tokenized_dataset[i]
    input_ids = sample["input_ids"]
    print(f"\nSample {i}:")
    print("  Input IDs length:", len(input_ids))
    actual_len = sum(1 for token in input_ids if token != tokenizer.pad_token_id)
    print("  Actual tokens (excluding padding):", actual_len)
    print("  Decoded text:")
    print(tokenizer.decode(input_ids, skip_special_tokens=True))
    print("-" * 50)

# Overall statistics
token_counts = [len(sample["input_ids"]) for sample in tokenized_dataset]
print(f"\nMax token length: {max(token_counts)}")
print(f"Min token length: {min(token_counts)}")
print(f"Average token length: {sum(token_counts)/len(token_counts):.1f}")

第一部分

18.a 设置GPT-2模型大小

# -------------------------------
# 3. Configure a small GPT-2 model (~100M parameters)
# -------------------------------
config = GPT2Config(
    vocab_size=50257,          # same as GPT-2/GPT-3
    n_positions=max_total_tokens,           # context length #12 for twitts #256 it changed the number of positional embedding, 
    # the matrix  (n_positions, n_embd) that sends the input to the geometrtic vector in the embedding space
    n_embd=512,                # embedding dimension (≈ 110M params if n_layer=12) #768 
    # `embed_dim` must be divisible by num_heads (got `embed_dim`: 256 and `num_heads`: 12).
    n_layer=6,                # number of transformer blocks
    n_head=8,                 # number of attention heads
    resid_pdrop=0.1,           # dropout for residuals
    embd_pdrop=0.1,
    attn_pdrop=0.1,
)

model = GPT2LMHeadModel(config)
print(f"Model has {model.num_parameters():,} parameters")

19.b 并检测tokenizer输出

model.resize_token_embeddings(len(tokenizer))

检测硬件规格

import torch
print("CUDA available:", torch.cuda.is_available())
print("Device count:", torch.cuda.device_count())
print("Device name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GPT2LMHeadModel(config).to(device)
print("Model device:", next(model.parameters()).device)

20.b 设定训练参数


# -------------------------------
# 4. Prepare data collator and training arguments
# -------------------------------
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,                 # we are training a causal LM, not masked LM
)

# Training arguments (adjust according to your hardware)
training_args = TrainingArguments(
    output_dir="/root/private_data/gpt2-small-twitter-checkpoint",
    overwrite_output_dir=True,
    num_train_epochs=3,               # small dataset, few epochs
    per_device_train_batch_size=8,    # depends on your GPU memory (4090 has 24GB)
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=1,
    warmup_steps=100,
    weight_decay=0.01,
    logging_steps=50,
    save_steps=500,
    save_total_limit=2,
    prediction_loss_only=True,
    fp16=True,                        # enable mixed precision for speed
    dataloader_num_workers=4,
    report_to="none",                 # disable wandb/tensorboard if not needed
)

21.b 开始训练（需要消耗数分钟）

# -------------------------------
# 5. Create Trainer and start training
# -------------------------------
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# Train the model
trainer.train()

22.b 保存训练模型

# -------------------------------
# 6. Save the trained model and tokenizer
# -------------------------------
model.save_pretrained("/root/private_data/gpt2-small-twitter")
tokenizer.save_pretrained("/root/private_data//gpt2-small-twitter")
print("Model and tokenizer saved to /root/private_data/gpt2-small-twitter")

23.b 调用模型测试训练成果

import pandas as pd
import torch
import random
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# -------------------------------
# 1. Load the saved model and tokenizer
# -------------------------------
model_path = "/root/private_data/gpt2-small-twitter"
model = GPT2LMHeadModel.from_pretrained(model_path)
tokenizer = GPT2Tokenizer.from_pretrained(model_path)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
print(f"Model loaded on {device}")

# -------------------------------
# 2. Preprocessing parameters (must match training)
# -------------------------------
df = pd.read_csv('twitter-airline-sentimentSentiment_Analysis.csv')
data = df.head(500)

instruction = "Analyze the sentiment of the following tweet. Output exactly one word, with no punctuation or extra text:"
fixed_template = f"Instruction: {instruction}\nInput:\nOutput:"
fixed_token_count = len(tokenizer.encode(fixed_template))
print(f"Fixed tokens (without tweet and label): {fixed_token_count}")

# This must match the max_total_tokens used during training.
# If you used 60, set it here.
max_total_tokens = 60
max_tweet_tokens = max_total_tokens - fixed_token_count - 1   # leave 1 for the label
print(f"Max tokens allowed for tweet content: {max_tweet_tokens}")

def truncate_tweet(tweet, max_tokens):
    tokens = tokenizer.encode(tweet, truncation=True, max_length=max_tokens)
    return tokenizer.decode(tokens, skip_special_tokens=True)

# Build formatted texts (for loss calculation)
formatted_texts = []
for _, row in data.iterrows():
    short_tweet = truncate_tweet(row['content'], max_tweet_tokens)
    text = f"Instruction: {instruction}\nInput: {short_tweet}\nOutput: {row['sentiment']}"
    formatted_texts.append(text)

print(f"Loaded {len(formatted_texts)} formatted examples")

# -------------------------------
# 3. Choose 5 random examples
# -------------------------------
random.seed(42)
indices = random.sample(range(len(formatted_texts)), 50)#5)

# -------------------------------
# 4. Test each example
# -------------------------------
for idx in indices:
    full_text = formatted_texts[idx]
    tweet = data.iloc[idx]['content']
    expected_sentiment = data.iloc[idx]['sentiment']
    short_tweet = truncate_tweet(tweet, max_tweet_tokens)
    
    print(f"\n--- Example {idx} ---")
    print(f"Original tweet: {tweet[:100]}...")
    print(f"Truncated tweet: {short_tweet[:100]}...")
    print(f"Expected sentiment: {expected_sentiment}")
    
    # ---- Compute loss on the full formatted text ----
    inputs = tokenizer(full_text, return_tensors="pt", truncation=True, max_length=max_total_tokens)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss
        perplexity = torch.exp(loss)
    print(f"Loss: {loss.item():.4f}, Perplexity: {perplexity.item():.2f}")
    
    # ---- Generate the sentiment ----
    # Build prompt: only up to "Output:"
    prompt = f"Instruction: {instruction}\nInput: {short_tweet}\nOutput:"
    # Tokenize WITHOUT padding to get actual length
    prompt_ids = tokenizer(prompt, return_tensors="pt", truncation=True)
    prompt_ids = {k: v.to(device) for k, v in prompt_ids.items()}
    
    # Determine how many new tokens we can generate without exceeding n_positions
    available_positions = model.config.n_positions - prompt_ids["input_ids"].shape[1]
    max_new_tokens = min(5, available_positions)  # we only need one word, but ensure it fits
    print(f"Prompt length: {prompt_ids['input_ids'].shape[1]}, Available positions: {available_positions}, Generating up to {max_new_tokens} tokens.")
    
    if max_new_tokens <= 0:
        print("WARNING: No room for generation; skipping.")
        continue
    
    with torch.no_grad():
        generated_ids = model.generate(
            prompt_ids["input_ids"],
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

        
    generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    print(f"Full generated text: {generated_text}")   # for debugging
    
    # Extract the generated output after the last "Output:"
    if "Output:" in generated_text:
        after_output = generated_text.split("Output:")[-1].strip()
        if after_output:
            predicted = after_output.split()[0]   # first word
        else:
            predicted = "<no output>"
    else:
        # If "Output:" not found, take the whole generated text (fallback)
        predicted = generated_text.strip().split()[0] if generated_text.strip() else "<no output>"
    
    print(f"Predicted sentiment: {predicted}")

至此，一个GPT-2类的LLM构架就训练完成。需要注意的是，此训练数据集仅仅作为sentiment analysis使用，并非用作few-shot的通用解决。

第二部分

这一部分不采用18.a 的做法，而直接从PyTorch定义GPT-2的构架，然后手动训练大模型。

18.b 首先，删除GPT2LMHeadModel，并重新导入objects

GPT2LMHeadModel
# transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel

del GPT2LMHeadModel

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional, Tuple

19.b 定义Multi Head Attention layer

class GPT2Attention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.head_dim = self.n_embd // self.n_head
        assert self.head_dim * self.n_head == self.n_embd, "n_embd must be divisible by n_head"

        # Projections
        self.c_attn = nn.Linear(self.n_embd, 3 * self.n_embd)   # Q, K, V
        self.c_proj = nn.Linear(self.n_embd, self.n_embd)
        self.attn_dropout = nn.Dropout(config.attn_pdrop)
        self.resid_dropout = nn.Dropout(config.resid_pdrop)

        # Causal mask (upper triangular)
        self.register_buffer("bias", torch.tril(torch.ones(config.n_positions, config.n_positions))
                                     .view(1, 1, config.n_positions, config.n_positions))

    def forward(self, x, attention_mask=None):
        B, T, C = x.size()   # batch, sequence length, embedding dim
        # Compute Q, K, V
        qkv = self.c_attn(x)   # (B, T, 3*C)
        q, k, v = qkv.split(self.n_embd, dim=2)

        # Reshape to (B, n_head, T, head_dim)
        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)

        # Scaled dot‑product attention
        att = (q @ k.transpose(-2, -1)) * (1.0 / (self.head_dim ** 0.5))   # (B, n_head, T, T)
        # Apply causal mask
        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
        # Apply optional attention mask (from padding)
        if attention_mask is not None:
            # attention_mask: (B, T) with 1 for real tokens, 0 for padding
            att_mask = attention_mask[:, None, None, :]   # (B, 1, 1, T)
            att = att.masked_fill(att_mask == 0, float('-inf'))

        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        y = att @ v   # (B, n_head, T, head_dim)
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        y = self.resid_dropout(self.c_proj(y))
        return y

20.b 定义MLP layer

class GPT2MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
        self.dropout = nn.Dropout(config.resid_pdrop)
        self.act = nn.GELU()

    def forward(self, x):
        x = self.act(self.c_fc(x))
        x = self.dropout(self.c_proj(x))
        return x

21.b 使用 Multi Head Attention layer 和 MLP layer组成一个GPT-2的Transformer block/layer(Decoder block)

class GPT2Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = GPT2Attention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = GPT2MLP(config)

    def forward(self, x, attention_mask=None):
        # Pre‑norm residual connection
        x = x + self.attn(self.ln_1(x), attention_mask)
        x = x + self.mlp(self.ln_2(x))
        return x

22.b 使用GPT2Block组成GPT-2模型(embedding+Transformer blocks)

class GPT2Model(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.wte = nn.Embedding(config.vocab_size, config.n_embd)
        self.wpe = nn.Embedding(config.n_positions, config.n_embd)
        self.blocks = nn.ModuleList([GPT2Block(config) for _ in range(config.n_layer)])
        self.ln_f = nn.LayerNorm(config.n_embd)

    def forward(self, input_ids, attention_mask=None):
        B, T = input_ids.size()
        assert T <= self.config.n_positions, f"Sequence length {T} exceeds n_positions {self.config.n_positions}"

        # Token and position embeddings
        token_embeds = self.wte(input_ids)                     # (B, T, n_embd)
        position_ids = torch.arange(T, device=input_ids.device).unsqueeze(0)   # (1, T)
        pos_embeds = self.wpe(position_ids)                    # (1, T, n_embd)
        x = token_embeds + pos_embeds

        # Apply transformer blocks
        for block in self.blocks:
            x = block(x, attention_mask)

        x = self.ln_f(x)
        return x

23.b 对应加入language model wrapper，尤其是linear head(lm_head)，组成可以训练的GPT2LMHeadModel

class GPT2LMHeadModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.transformer = GPT2Model(config)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        # Tie weights (optional but common)
        self.lm_head.weight = self.transformer.wte.weight

    def forward(self, input_ids, attention_mask=None, labels=None):
        hidden_states = self.transformer(input_ids, attention_mask)
        logits = self.lm_head(hidden_states)   # (B, T, vocab_size)

        loss = None
        if labels is not None:
            # Shift so that tokens < n predict token n
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            loss = F.cross_entropy(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

        # Return a dictionary (compatible with Hugging Face Trainer)
        return {"loss": loss, "logits": logits} if loss is not None else {"logits": logits}

24.b 定义训练所需设置

# Define the config (adjust as needed)
class Config:
    vocab_size = 50257
    n_positions = max_total_tokens             # must match max_total_tokens
    n_embd = 128
    n_layer = 4
    n_head = 4                  # 128/4 = 32 head dimension
    attn_pdrop = 0.1
    resid_pdrop = 0.1
    embd_pdrop = 0.1            # not used in our implementation, but you can add dropout on embeddings if desired

config = Config()
model = GPT2LMHeadModel(config)

# Move to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Count parameters
print(f"Model has {sum(p.numel() for p in model.parameters()):,} parameters")


# -------------------------------
# 4. Prepare data collator and training arguments
# -------------------------------
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,                 # we are training a causal LM, not masked LM
)

25.b 训练GPT2LMHeadModel

from torch.utils.data import DataLoader
from transformers import DataCollatorForLanguageModeling

# Prepare dataset (already tokenized)
train_dataloader = DataLoader(tokenized_dataset, batch_size=32, shuffle=True, collate_fn=data_collator)

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
model.train()

for epoch in range(3):
    total_loss = 0
    for batch in train_dataloader:
        # batch contains input_ids, attention_mask, etc.
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
        loss = outputs["loss"]
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, loss: {total_loss/len(train_dataloader):.4f}")

26.b 测试训练的模型

# -------------------------------
# 2. Helper: Greedy generation for custom model
# -------------------------------
def generate(model, input_ids, max_new_tokens, eos_token_id, pad_token_id=None):
    """Greedy generation for causal LM."""
    generated = input_ids.clone()
    for _ in range(max_new_tokens):
        with torch.no_grad():
            # Get logits for the last position
            outputs = model(generated)
            logits = outputs["logits"]           # (B, T, vocab_size)
            next_token_logits = logits[:, -1, :]  # (B, vocab_size)
            next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)  # (B, 1)
        generated = torch.cat([generated, next_token], dim=-1)
        # Stop if EOS token generated
        if (next_token == eos_token_id).any():
            break
    return generated

# -------------------------------
# 3. Preprocessing parameters (must match training)
# -------------------------------
df = pd.read_csv('twitter-airline-sentimentSentiment_Analysis.csv')
data = df.head(500)

instruction = "Analyze the sentiment of the following tweet. Output exactly one word, with no punctuation or extra text:"
fixed_template = f"Instruction: {instruction}\nInput:\nOutput:"
fixed_token_count = len(tokenizer.encode(fixed_template))
print(f"Fixed tokens (without tweet and label): {fixed_token_count}")

# This must match the max_total_tokens used during training.
max_total_tokens = 60          # set to whatever you used (e.g., 60)
max_tweet_tokens = max_total_tokens - fixed_token_count - 1   # leave 1 for the label
print(f"Max tokens allowed for tweet content: {max_tweet_tokens}")

def truncate_tweet(tweet, max_tokens):
    tokens = tokenizer.encode(tweet, truncation=True, max_length=max_tokens)
    return tokenizer.decode(tokens, skip_special_tokens=True)

# Build formatted texts (for loss calculation)
formatted_texts = []
for _, row in data.iterrows():
    short_tweet = truncate_tweet(row['content'], max_tweet_tokens)
    text = f"Instruction: {instruction}\nInput: {short_tweet}\nOutput: {row['sentiment']}"
    formatted_texts.append(text)

print(f"Loaded {len(formatted_texts)} formatted examples")

# -------------------------------
# 4. Choose 5 random examples
# -------------------------------
random.seed(42)
indices = random.sample(range(len(formatted_texts)), 5)

# -------------------------------
# 5. Test each example
# -------------------------------
for idx in indices:
    full_text = formatted_texts[idx]
    tweet = data.iloc[idx]['content']
    expected_sentiment = data.iloc[idx]['sentiment']
    short_tweet = truncate_tweet(tweet, max_tweet_tokens)
    
    print(f"\n--- Example {idx} ---")
    print(f"Original tweet: {tweet[:100]}...")
    print(f"Truncated tweet: {short_tweet[:100]}...")
    print(f"Expected sentiment: {expected_sentiment}")
    
    # ---- Compute loss on the full formatted text ----
    inputs = tokenizer(full_text, return_tensors="pt", truncation=True, max_length=max_total_tokens)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs["loss"]
        perplexity = torch.exp(loss)
    print(f"Loss: {loss.item():.4f}, Perplexity: {perplexity.item():.2f}")
    
    # ---- Generate the sentiment ----
    # Build prompt: only up to "Output:"
    prompt = f"Instruction: {instruction}\nInput: {short_tweet}\nOutput:"
    prompt_ids = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=max_total_tokens - 1)  # leave room for generation
    prompt_ids = prompt_ids["input_ids"].to(device)
    
    # Determine how many new tokens we can generate
    available_positions = model.config.n_positions - prompt_ids.shape[1]
    max_new_tokens = min(5, available_positions)  # we only need one word, but ensure it fits
    print(f"Prompt length: {prompt_ids.shape[1]}, Available positions: {available_positions}, Generating up to {max_new_tokens} tokens.")
    
    if max_new_tokens <= 0:
        print("WARNING: No room for generation; skipping.")
        continue
    
    # Generate using our custom function
    with torch.no_grad():
        generated_ids = generate(
            model,
            prompt_ids,
            max_new_tokens,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,  # pad with EOS for generation
        )
    
    generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    print(f"Full generated text: {generated_text}")
    
    # Extract the generated output after the last "Output:"
    if "Output:" in generated_text:
        after_output = generated_text.split("Output:")[-1].strip()
        if after_output:
            predicted = after_output.split()[0]   # first word
        else:
            predicted = "<no output>"
    else:
        predicted = generated_text.strip().split()[0] if generated_text.strip() else "<no output>"
    
    print(f"Predicted sentiment: {predicted}")

这一模型的保存有直接保存PyTorch文件和转换成Huggingface模型再保存等不同方式，本文不做阐述。

至此，一个10M-50M的GPT-2类LLM就训练完成了。相同的，训练数据是作为sentimental analysis使用，并非实际的通用LLM大模型。

我在找工作，HR或项目合作请联系：yucongcai_business@outlook.com
与科研相关的请联系：yucongcai_research@outlook.com

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

孤舟笔记互联网常用框架篇四 Netty中的Reactor模式你真懂了吗？主从Reactor到底怎么工作的

Netty高性能的核心在于其采用的Reactor模式实现。文章详细解析了Reactor模式的三种变体：单Reactor单线程、单Reactor多线程和主从Reactor多线程模型。Netty采用主从Reactor多线程模型，通过Boss Group（主Reactor）负责Accept连接，Worker Group（从Reactor）处理I/O读写，实现职责分离。其中Boss Group通常只需1个