本文记录了在 AWS SageMaker Notebook 实例(ml.g4dn.xlarge,Tesla T4 16GB)上,使用 vLLM 和 transformers 部署 Qwen3-4B 与 Qwen3.5-4B 两个模型,并通过自定义基准测试进行对比评测的完整过程。

Qwen3-4B vs Qwen3.5-4B 架构差异

特性 Qwen3-4B Qwen3.5-4B
发布时间 2025年Q2 2026年2月
架构 标准 Transformer(Attention only) Hybrid(Gated DeltaNet + Attention)
参数量 ~4B ~4B
模型大小 7.6 GB 8.8 GB
FlashAttention2 需要 CC≥8.0 需要 CC≥8.0
特殊要求 transformers ≥ 4.49 transformers ≥ 4.57(新架构 qwen3_5

Qwen3.5-4B 引入了 Hybrid 架构,交替使用 Gated DeltaNet 层和标准 Attention 层,类似 Mamba 的状态空间模型思路。这带来了更好的长文本处理能力,但也增加了推理复杂度和显存占用。

AgentScope 是阿里巴巴开源的多智能体框架,其内置评测模块提供了系统化的模型/智能体评估能力。

核心架构

Evaluator(评估器)
  │
  ├── Benchmark(基准测试)
  │     ├── Task 1 { input, ground_truth, metrics[] }
  │     ├── Task 2 { input, ground_truth, metrics[] }
  │     └── Task N { input, ground_truth, metrics[] }
  │
  ├── Solution Function(Agent 解题逻辑)
  │     └── 输入 Task → Agent 推理 → 输出 SolutionOutput
  │
  └── Metric Evaluation(评分)
        └── 对比 SolutionOutput vs ground_truth → MetricResult(0~1)

AgentScope 代码示例

以下是使用 AgentScope 官方 API 的标准评测写法:

import asyncio
from agentscope.evaluate import (
    Task, BenchmarkBase, MetricBase, MetricResult,
    MetricType, SolutionOutput, GeneralEvaluator, FileEvaluatorStorage,
)
from agentscope.agent import ReActAgent
from agentscope.model import OpenAIChatModel
from agentscope.message import Msg
from agentscope.formatter import OpenAIChatFormatter
from typing import Generator, Callable


# ===== Step 1: 定义评分指标 =====
class ContainsAnswer(MetricBase):
    """检查模型输出是否包含期望答案,规则匹配(检查输出是否包含期望答案字符串)"""
    def __init__(self, ground_truth: str):
        super().__init__(
            name="contains_answer",
            metric_type=MetricType.NUMERICAL,
            description="Check if response contains the expected answer",
        )
        self.ground_truth = ground_truth

    async def __call__(self, solution: SolutionOutput) -> MetricResult:
        output = str(solution.output).lower()
        expected = self.ground_truth.lower()
        if expected in output:
            return MetricResult(name=self.name, result=1.0, message="Correct")
        return MetricResult(name=self.name, result=0.0, message="Incorrect")


# ===== Step 2: 定义基准测试 =====
TASKS = [
    {"id": "math_01", "question": "Calculate: 15 + 27 = ?", "answer": "42"},
    {"id": "math_02", "question": "Calculate: 100 - 37 = ?", "answer": "63"},
    {"id": "reason_01", "question": "What comes next: 2, 4, 8, 16, ?", "answer": "32"},
]

class MyBenchmark(BenchmarkBase):
    def __init__(self):
        super().__init__(name="Qwen Eval", description="Math & Reasoning benchmark")
        self.dataset = [
            Task(
                id=t["id"], input=t["question"], ground_truth=t["answer"],
                metrics=[ContainsAnswer(t["answer"])],
            )
            for t in TASKS
        ]

    def __iter__(self) -> Generator[Task, None, None]:
        yield from self.dataset

    def __len__(self) -> int:
        return len(self.dataset)


# ===== Step 3: 定义 Agent 解题逻辑 =====
async def solution_func(task: Task, pre_hook: Callable) -> SolutionOutput:
    """Agent 解题函数,符合 AgentScope WorkflowType 签名"""
    model = OpenAIChatModel(
        model_name="Qwen3-4B",
        api_key="not-needed",
        base_url="http://localhost:8000/v1",  # vLLM 本地服务
    )
    agent = ReActAgent(
        name="solver",
        sys_prompt="Answer directly and concisely.",
        model=model,
        formatter=OpenAIChatFormatter(),
    )
    response = await agent(Msg("user", task.input, role="user"))
    return SolutionOutput(success=True, output=response.get_text_content())


# ===== Step 4: 运行评测 =====
async def main():
    evaluator = GeneralEvaluator(
        name="Qwen3-4B Evaluation",
        benchmark=MyBenchmark(),
        n_repeat=1,
        n_workers=1,
        storage=FileEvaluatorStorage(save_dir="./eval_results"),
    )
    await evaluator.run(solution_func)

asyncio.run(main())

输出示例:

Metric: contains_answer
    Type: MetricType.NUMERICAL
    Involved tasks: 20
    Completed tasks: 20
    Aggregation: { "mean": 0.95, "max": 1.0, "min": 0.0 }

本次直接使用 AgentScope 的 GeneralEvaluator进行正式评测。

  1. Qwen3-4B:通过 OpenAIChatModel 调用 vLLM 的 OpenAI 兼容 API
  2. Qwen3.5-4B:由于 vLLM 在 T4 上无法部署该模型,在 solution function 中直接使用 transformers 推理

AgentScope OpenAIChatModel 连接本地 vLLM 的关键参数:

model = OpenAIChatModel(
    model_name="Qwen3-4B",
    api_key="not-needed",
    stream=False,
    client_kwargs={"base_url": "http://localhost:8000/v1"},
    generate_kwargs={"temperature": 0.1, "max_tokens": 512},
)
# 注意:model() 是异步调用,返回 ChatResponse 对象
response = await model(messages=[{"role": "user", "content": prompt}])

AgentScope 评测输出格式(evaluation_result.json):

{
    "total_tasks": 20,
    "repeats": {
        "0": {
            "metrics": {
                "contains_answer": {
                    "aggregation": {"mean": 0.95, "max": 1.0, "min": 0.0},
                    "distribution": {"math_01": 1.0, "know_02": 0.0, ...}
                }
            }
        }
    }
}

本次评测使用自建基准测试集,共 20 道题,覆盖 6 个类别:

类别 题数 难度分布 考察能力
Math(英文数学) 6 Easy×3 + Medium×3 四则运算、百分比
Reasoning(推理) 5 Easy×2 + Medium×1 + Hard×2 逻辑推理、认知偏差
Knowledge(英文常识) 3 Easy×3 基础事实
Knowledge_zh(中文常识) 2 Easy×2 中文理解
Math_zh(中文数学) 1 Medium×1 中文数学
Code(编程) 3 Easy×3 Python 基础

完整测试数据集

BENCHMARK_DATA = [
    # Math (English)
    {"id": "math_01", "question": "Calculate: 15 + 27 = ?", "answer": "42", "category": "math", "difficulty": "easy"},
    {"id": "math_02", "question": "Calculate: 100 - 37 = ?", "answer": "63", "category": "math", "difficulty": "easy"},
    {"id": "math_03", "question": "Calculate: 8  7 = ?", "answer": "56", "category": "math", "difficulty": "easy"},
    {"id": "math_04", "question": "Calculate: 123 + 456 + 789 = ?", "answer": "1368", "category": "math", "difficulty": "medium"},
    {"id": "math_05", "question": "Calculate: 25  16 = ?", "answer": "400", "category": "math", "difficulty": "medium"},
    {"id": "math_06", "question": "What is 15% of 200?", "answer": "30", "category": "math", "difficulty": "medium"},
    # Reasoning
    {"id": "reason_01", "question": "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly? Answer yes or no only.", "answer": "no", "category": "reasoning", "difficulty": "medium"},
    {"id": "reason_02", "question": "A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost in cents?", "answer": "5", "category": "reasoning", "difficulty": "hard"},
    {"id": "reason_03", "question": "If it takes 5 machines 5 minutes to make 5 widgets, how many minutes would it take 100 machines to make 100 widgets?", "answer": "5", "category": "reasoning", "difficulty": "hard"},
    {"id": "reason_04", "question": "If you have 3 apples and take away 2, how many do YOU have?", "answer": "2", "category": "reasoning", "difficulty": "easy"},
    {"id": "reason_05", "question": "What comes next: 2, 4, 8, 16, ?", "answer": "32", "category": "reasoning", "difficulty": "easy"},
    # Knowledge (English)
    {"id": "know_01", "question": "What is the capital of France?", "answer": "Paris", "category": "knowledge", "difficulty": "easy"},
    {"id": "know_02", "question": "What is the chemical symbol for water?", "answer": "H2O", "category": "knowledge", "difficulty": "easy"},
    {"id": "know_03", "question": "How many planets are in our solar system?", "answer": "8", "category": "knowledge", "difficulty": "easy"},
    # Knowledge (Chinese)
    {"id": "zh_01", "question": "中国的首都是哪里?只回答城市名。", "answer": "北京", "category": "knowledge_zh", "difficulty": "easy"},
    {"id": "zh_03", "question": "地球上最大的海洋是什么?只回答名称。", "answer": "太平洋", "category": "knowledge_zh", "difficulty": "easy"},
    # Math (Chinese)
    {"id": "zh_02", "question": "计算:99  99 = ? 只回答数字。", "answer": "9801", "category": "math_zh", "difficulty": "medium"},
    # Code
    {"id": "code_01", "question": "What is the output of: print(len('hello'))?", "answer": "5", "category": "code", "difficulty": "easy"},
    {"id": "code_02", "question": "In Python, what does [1,2,3][1] return?", "answer": "2", "category": "code", "difficulty": "easy"},
    {"id": "code_03", "question": "What is 210?", "answer": "1024", "category": "code", "difficulty": "easy"},
]

评分逻辑

import re

def extract_answer(content):
    """去除 Qwen3 的 <think>...</think> 标签,提取实际回答"""
    if not content:
        return ""
    cleaned = re.sub(r'<think>.?</think>', '', content, flags=re.DOTALL).strip()
    if not cleaned and '<think>' in content:
        parts = content.split('</think>')
        cleaned = parts[-1].strip() if len(parts) > 1 else content
    return cleaned

def check_answer(response_text, expected):
    """包含匹配评分:期望答案出现在模型输出中即得分"""
    cleaned = extract_answer(response_text).lower().strip()
    expected_lower = expected.lower().strip()
    # 直接包含
    if expected_lower in cleaned:
        return 1.0
    # 数字提取匹配
    numbers = re.findall(r'-?\d+\.?\d', cleaned)
    if expected_lower in numbers:
        return 1.0
    return 0.0

数据集设计考量

  1. 为什么自建而非用公开基准? 公开基准(MMLU、GSM8K)题量大,跑一次耗时长。20 道精选题可在 5 分钟内完成评测,适合快速对比。
  2. 为什么包含中文题? Qwen 系列是中英双语模型,需要验证两种语言的能力。
  3. 为什么包含"认知偏差"题(reason_02)? 蝙蝠和球问题是经典的 System 1 vs System 2 测试,能区分模型是否真正推理。
  4. 局限性: 20 道题区分度有限。生产环境建议 100+ 题,加入更多 hard 难度。

模型下载

使用 Docker 容器 + ModelScope 下载(适合国内网络环境):

# 启动下载容器(已预配置挂载 /home/ec2-user/SageMaker/models → /models)
docker start downloader
docker exec downloader pip install modelscope -q

# 下载两个模型
docker exec downloader modelscope download --model Qwen/Qwen3-4B --local_dir /models/Qwen3-4B
docker exec downloader modelscope download --model Qwen/Qwen3.5-4B --local_dir /models/Qwen3.5-4B

7.6G    /home/ec2-user/SageMaker/models/Qwen3-4B
8.8G    /home/ec2-user/SageMaker/models/Qwen3.5-4B

vLLM Docker Compose 配置

version: '3.3'
services:
  vllm:
    container_name: vllm-server
    image: public.ecr.aws/kraft-llm/vllm/vllm-openai:v0.9.2
    volumes:
      - /home/ec2-user/SageMaker/models:/models
    ports:
      - '8000:8000'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: >
      --model /models/Qwen3-4B
      --served-model-name Qwen3-4B
      --host 0.0.0.0
      --port 8000
      --max-model-len 4096
      --gpu-memory-utilization 0.85
      --dtype half
      --trust-remote-code
    restart: unless-stopped

客户端依赖安装

source activate pytorch_p310

pip install transformers --upgrade -i https://mirrors.aliyun.com/pypi/simple/
pip install tiktoken --only-binary=:all: -i https://mirrors.aliyun.com/pypi/simple/
pip install agentscope openai accelerate -i https://mirrors.aliyun.com/pypi/simple/

Qwen3-4B 评测

启动 vLLM 服务

docker-compose -f /home/ec2-user/SageMaker/docker-compose/vllm/compose.yaml up -d

关键日志:

INFO: vLLM API server version 0.9.2
INFO: model='/models/Qwen3-4B', dtype=torch.float16, max_seq_len=4096
INFO: Using XFormers backend.  # T4 不支持 FA2,回退到 XFormers
INFO: Application startup complete.

AgentScope 评测脚本(Qwen3-4B)

# eval_agentscope.py - 核心 solution function
async def qwen_solution(task: Task, pre_hook: Callable) -> SolutionOutput:
    model = OpenAIChatModel(
        model_name="Qwen3-4B",
        api_key="not-needed",
        stream=False,
        client_kwargs={"base_url": "http://localhost:8000/v1"},
        generate_kwargs={"temperature": 0.1, "max_tokens": 512},
    )
    prompt = f"{task.input}\nAnswer directly and concisely."
    response = await model(messages=[{"role": "user", "content": prompt}])
    text = extract_text_from_response(response)
    return SolutionOutput(success=True, output=text, trajectory=[])

运行

source activate pytorch_p310
python3 eval_agentscope.py Qwen3-4B http://localhost:8000/v1

AgentScope 评测结果

============================================================
AgentScope Evaluation: Qwen3-4B
API: http://localhost:8000/v1
============================================================

  [math_01] expected=42, got=42
  [math_02] expected=63, got=63
  [math_03] expected=56, got=56
  [math_04] expected=1368, got=1368
  [math_05] expected=400, got=400
  [math_06] expected=30, got=30
  [reason_01] expected=no, got=no
  [reason_02] expected=5, got=5
  [reason_03] expected=5, got=5
  [know_01] expected=Paris, got=Paris
  [know_02] expected=H2O, got=H₂O          ← FAIL(Unicode格式差异)
  [know_03] expected=8, got=8
  [zh_01] expected=北京, got=北京
  [zh_02] expected=9801, got=9801
  [zh_03] expected=太平洋, got=太平洋
  [code_01] expected=5, got=5
  [code_02] expected=2, got=2
  [code_03] expected=1024, got=1024
  [reason_04] expected=2, got=2
  [reason_05] expected=32, got=32

Repeat ID: 0
    Metric: contains_answer
        Type: MetricType.NUMERICAL
        Involved tasks: 20
        Completed tasks: 20
        Aggregation: { "mean": 0.95, "max": 1.0, "min": 0.0 }

Total evaluation time: ~360s

AgentScope 输出的 evaluation_result.json

{
    "total_tasks": 20,
    "total_stats": {
        "chat_usage": {
            "Qwen3-4B": {"input_tokens": 679, "output_tokens": 5800}
        }
    },
    "repeats": {
        "0": {
            "completed_tasks": 20,
            "metrics": {
                "contains_answer": {
                    "aggregation": {"mean": 0.95, "max": 1.0, "min": 0.0}
                }
            }
        }
    }
}

Qwen3.5-4B 评测

vLLM 部署尝试失败

vLLM v0.9.2架构不支持。vLLM v0.9.2 内置 transformers 4.53.1,不认识 Qwen3.5 的新架构标识。

ValidationError: The checkpoint has model type `qwen3_5` 
but Transformers does not recognize this architecture.

vLLM v0.20.2显存不足docker pull swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/vllm/vllm-openai:v0.20.2。模型加载占 8.61 GiB,torch.compile + cudagraph profiling 需要额外显存,T4 16GB 不够。Qwen3.5-4B 在 T4 16GB 上无法通过 vLLM 部署。需要 A10G(24GB)或更大显存的 GPU。

torch.OutOfMemoryError: CUDA out of memory. 
Tried to allocate 1.03 GiB. GPU has 14.56 GiB total, 56.81 MiB free.

使用Transformers 直接推理 AgentScope 评测。由于 vLLM 无法部署 Qwen3.5-4B,在 AgentScope 的 solution function 中直接使用 transformers 推理:

# eval_agentscope_transformers.py - 核心部分
from transformers import AutoTokenizer, AutoModelForCausalLM

# 全局加载模型(只加载一次)
TOKENIZER = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
MODEL = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True
)

def generate(prompt, max_new_tokens=256):
    messages = [{"role": "user", "content": prompt}]
    text = TOKENIZER.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = TOKENIZER(text, return_tensors="pt").to(MODEL.device)
    with torch.no_grad():
        outputs = MODEL.generate(inputs, max_new_tokens=max_new_tokens,
                                  temperature=0.1, do_sample=True, top_p=0.9)
    return TOKENIZER.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

# AgentScope solution function
async def qwen35_solution(task: Task, pre_hook: Callable) -> SolutionOutput:
    prompt = f"{task.input}\nAnswer directly and concisely."
    text = generate(prompt)  # 同步调用 transformers
    return SolutionOutput(success=True, output=text, trajectory=[])

运行

source activate pytorch_p310
python3 eval_agentscope_transformers.py /home/ec2-user/SageMaker/models/Qwen3.5-4B Qwen3.5-4B

AgentScope 评测结果

Model loaded. GPU memory: 7.83 GB

============================================================
AgentScope Evaluation: Qwen3.5-4B
============================================================

  [math_01] expected=42, got=42
  [math_02] expected=63, got=63
  [math_03] expected=56, got=56
  [math_04] expected=1368, got=1368
  [math_05] expected=400, got=400
  [math_06] expected=30, got=30
  [reason_01] expected=no, got=no
  [reason_02] expected=5, got=???           ← FAIL(认知偏差题)
  [reason_03] expected=5, got=5
  [know_01] expected=Paris, got=Paris
  [know_02] expected=H2O, got=H₂O          ← FAIL(Unicode格式差异)
  [know_03] expected=8, got=8
  [zh_01] expected=北京, got=北京
  [zh_02] expected=9801, got=9801
  [zh_03] expected=太平洋, got=太平洋
  [code_01] expected=5, got=5
  [code_02] expected=2, got=2
  [code_03] expected=1024, got=1024
  [reason_04] expected=2, got=2
  [reason_05] expected=32, got=32

Repeat ID: 0
    Metric: contains_answer
        Type: MetricType.NUMERICAL
        Involved tasks: 20
        Completed tasks: 20
        Aggregation: { "mean": 0.9, "max": 1.0, "min": 0.0 }

Total evaluation time: 296.7s

对比分析

准确率对比

类别 Qwen3-4B Qwen3.5-4B
Math(6题) 6/6 = 100% 6/6 = 100%
Reasoning(5题) 5/5 = 100% 4/5 = 80%
Knowledge(3题) 2/3 = 66.7% 2/3 = 66.7%
Knowledge_zh(2题) 2/2 = 100% 2/2 = 100%
Math_zh(1题) 1/1 = 100% 1/1 = 100%
Code(3题) 3/3 = 100% 3/3 = 100%
总计 19/20 = 95.0% 18/20 = 90.0%

为什么新版模型分数反而更低?

表面上看 Qwen3.5-4B(90%)不如 Qwen3-4B(95%),但这不能说明新版更差:

  • 统计上不显著。 20 道题差 1 道 = 5% 差异,属于随机波动范围。如果跑 3 次取平均,结果可能反转。
  • Qwen3.5 的设计目标不同。 它的改进方向是长文本处理(DeltaNet 架构优势在长序列)、代码能力和多轮对话。这些优势在 20 道简单短问答里体现不出来——就像用百米冲刺来评价马拉松选手。
  • 失败可能是输出格式问题。 reason_02 那道题,Qwen3.5 的 “Thinking Process” 结构化输出更长更复杂,答案可能被正则提取逻辑漏掉,而不是模型真的算错了。
  • 推理条件不同。 Qwen3-4B 用 vLLM(有 KV Cache 优化),Qwen3.5-4B 用 transformers 直接推理,生成参数可能存在细微差异。

真正公平的对比需要:

  • 更大的测试集(200+ 题,多次运行取平均)
  • 覆盖 Qwen3.5 擅长的场景(长文本摘要、代码生成、多轮对话)
  • 相同的推理条件(都用 vLLM 或都用 transformers,相同硬件)

参考资料

  • https://doc.agentscope.io/zh_CN/tutorial/task_eval.html
Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐