使用agentscope评测Qwen3-4B和Qwen3.5-4B模型的实践

zhaojiew10

396人浏览 · 2026-05-15 09:05:31

zhaojiew10 · 2026-05-15 09:05:31 发布

本文记录了在 AWS SageMaker Notebook 实例（ml.g4dn.xlarge，Tesla T4 16GB）上，使用 vLLM 和 transformers 部署 Qwen3-4B 与 Qwen3.5-4B 两个模型，并通过自定义基准测试进行对比评测的完整过程。

Qwen3-4B vs Qwen3.5-4B 架构差异

特性	Qwen3-4B	Qwen3.5-4B
发布时间	2025年Q2	2026年2月
架构	标准 Transformer（Attention only）	Hybrid（Gated DeltaNet + Attention）
参数量	~4B	~4B
模型大小	7.6 GB	8.8 GB
FlashAttention2	需要 CC≥8.0	需要 CC≥8.0
特殊要求	transformers ≥ 4.49	transformers ≥ 4.57（新架构 `qwen3_5`）

Qwen3.5-4B 引入了 Hybrid 架构，交替使用 Gated DeltaNet 层和标准 Attention 层，类似 Mamba 的状态空间模型思路。这带来了更好的长文本处理能力，但也增加了推理复杂度和显存占用。

AgentScope 是阿里巴巴开源的多智能体框架，其内置评测模块提供了系统化的模型/智能体评估能力。

核心架构

Evaluator（评估器）
  │
  ├── Benchmark（基准测试）
  │     ├── Task 1 { input, ground_truth, metrics[] }
  │     ├── Task 2 { input, ground_truth, metrics[] }
  │     └── Task N { input, ground_truth, metrics[] }
  │
  ├── Solution Function（Agent 解题逻辑）
  │     └── 输入 Task → Agent 推理 → 输出 SolutionOutput
  │
  └── Metric Evaluation（评分）
        └── 对比 SolutionOutput vs ground_truth → MetricResult(0~1)

AgentScope 代码示例

以下是使用 AgentScope 官方 API 的标准评测写法：

import asyncio
from agentscope.evaluate import (
    Task, BenchmarkBase, MetricBase, MetricResult,
    MetricType, SolutionOutput, GeneralEvaluator, FileEvaluatorStorage,
)
from agentscope.agent import ReActAgent
from agentscope.model import OpenAIChatModel
from agentscope.message import Msg
from agentscope.formatter import OpenAIChatFormatter
from typing import Generator, Callable


# ===== Step 1: 定义评分指标 =====
class ContainsAnswer(MetricBase):
    """检查模型输出是否包含期望答案，规则匹配（检查输出是否包含期望答案字符串）"""
    def __init__(self, ground_truth: str):
        super().__init__(
            name="contains_answer",
            metric_type=MetricType.NUMERICAL,
            description="Check if response contains the expected answer",
        )
        self.ground_truth = ground_truth

    async def __call__(self, solution: SolutionOutput) -> MetricResult:
        output = str(solution.output).lower()
        expected = self.ground_truth.lower()
        if expected in output:
            return MetricResult(name=self.name, result=1.0, message="Correct")
        return MetricResult(name=self.name, result=0.0, message="Incorrect")


# ===== Step 2: 定义基准测试 =====
TASKS = [
    {"id": "math_01", "question": "Calculate: 15 + 27 = ?", "answer": "42"},
    {"id": "math_02", "question": "Calculate: 100 - 37 = ?", "answer": "63"},
    {"id": "reason_01", "question": "What comes next: 2, 4, 8, 16, ?", "answer": "32"},
]

class MyBenchmark(BenchmarkBase):
    def __init__(self):
        super().__init__(name="Qwen Eval", description="Math & Reasoning benchmark")
        self.dataset = [
            Task(
                id=t["id"], input=t["question"], ground_truth=t["answer"],
                metrics=[ContainsAnswer(t["answer"])],
            )
            for t in TASKS
        ]

    def __iter__(self) -> Generator[Task, None, None]:
        yield from self.dataset

    def __len__(self) -> int:
        return len(self.dataset)


# ===== Step 3: 定义 Agent 解题逻辑 =====
async def solution_func(task: Task, pre_hook: Callable) -> SolutionOutput:
    """Agent 解题函数，符合 AgentScope WorkflowType 签名"""
    model = OpenAIChatModel(
        model_name="Qwen3-4B",
        api_key="not-needed",
        base_url="http://localhost:8000/v1",  # vLLM 本地服务
    )
    agent = ReActAgent(
        name="solver",
        sys_prompt="Answer directly and concisely.",
        model=model,
        formatter=OpenAIChatFormatter(),
    )
    response = await agent(Msg("user", task.input, role="user"))
    return SolutionOutput(success=True, output=response.get_text_content())


# ===== Step 4: 运行评测 =====
async def main():
    evaluator = GeneralEvaluator(
        name="Qwen3-4B Evaluation",
        benchmark=MyBenchmark(),
        n_repeat=1,
        n_workers=1,
        storage=FileEvaluatorStorage(save_dir="./eval_results"),
    )
    await evaluator.run(solution_func)

asyncio.run(main())

输出示例：

Metric: contains_answer
    Type: MetricType.NUMERICAL
    Involved tasks: 20
    Completed tasks: 20
    Aggregation: { "mean": 0.95, "max": 1.0, "min": 0.0 }

本次直接使用 AgentScope 的 GeneralEvaluator进行正式评测。

Qwen3-4B：通过 OpenAIChatModel 调用 vLLM 的 OpenAI 兼容 API
Qwen3.5-4B：由于 vLLM 在 T4 上无法部署该模型，在 solution function 中直接使用 transformers 推理

AgentScope OpenAIChatModel 连接本地 vLLM 的关键参数：

model = OpenAIChatModel(
    model_name="Qwen3-4B",
    api_key="not-needed",
    stream=False,
    client_kwargs={"base_url": "http://localhost:8000/v1"},
    generate_kwargs={"temperature": 0.1, "max_tokens": 512},
)
# 注意：model() 是异步调用，返回 ChatResponse 对象
response = await model(messages=[{"role": "user", "content": prompt}])

AgentScope 评测输出格式（evaluation_result.json）：

{
    "total_tasks": 20,
    "repeats": {
        "0": {
            "metrics": {
                "contains_answer": {
                    "aggregation": {"mean": 0.95, "max": 1.0, "min": 0.0},
                    "distribution": {"math_01": 1.0, "know_02": 0.0, ...}
                }
            }
        }
    }
}

本次评测使用自建基准测试集，共 20 道题，覆盖 6 个类别：

类别	题数	难度分布	考察能力
Math（英文数学）	6	Easy×3 + Medium×3	四则运算、百分比
Reasoning（推理）	5	Easy×2 + Medium×1 + Hard×2	逻辑推理、认知偏差
Knowledge（英文常识）	3	Easy×3	基础事实
Knowledge_zh（中文常识）	2	Easy×2	中文理解
Math_zh（中文数学）	1	Medium×1	中文数学
Code（编程）	3	Easy×3	Python 基础

完整测试数据集

BENCHMARK_DATA = [
    # Math (English)
    {"id": "math_01", "question": "Calculate: 15 + 27 = ?", "answer": "42", "category": "math", "difficulty": "easy"},
    {"id": "math_02", "question": "Calculate: 100 - 37 = ?", "answer": "63", "category": "math", "difficulty": "easy"},
    {"id": "math_03", "question": "Calculate: 8  7 = ?", "answer": "56", "category": "math", "difficulty": "easy"},
    {"id": "math_04", "question": "Calculate: 123 + 456 + 789 = ?", "answer": "1368", "category": "math", "difficulty": "medium"},
    {"id": "math_05", "question": "Calculate: 25  16 = ?", "answer": "400", "category": "math", "difficulty": "medium"},
    {"id": "math_06", "question": "What is 15% of 200?", "answer": "30", "category": "math", "difficulty": "medium"},
    # Reasoning
    {"id": "reason_01", "question": "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly? Answer yes or no only.", "answer": "no", "category": "reasoning", "difficulty": "medium"},
    {"id": "reason_02", "question": "A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost in cents?", "answer": "5", "category": "reasoning", "difficulty": "hard"},
    {"id": "reason_03", "question": "If it takes 5 machines 5 minutes to make 5 widgets, how many minutes would it take 100 machines to make 100 widgets?", "answer": "5", "category": "reasoning", "difficulty": "hard"},
    {"id": "reason_04", "question": "If you have 3 apples and take away 2, how many do YOU have?", "answer": "2", "category": "reasoning", "difficulty": "easy"},
    {"id": "reason_05", "question": "What comes next: 2, 4, 8, 16, ?", "answer": "32", "category": "reasoning", "difficulty": "easy"},
    # Knowledge (English)
    {"id": "know_01", "question": "What is the capital of France?", "answer": "Paris", "category": "knowledge", "difficulty": "easy"},
    {"id": "know_02", "question": "What is the chemical symbol for water?", "answer": "H2O", "category": "knowledge", "difficulty": "easy"},
    {"id": "know_03", "question": "How many planets are in our solar system?", "answer": "8", "category": "knowledge", "difficulty": "easy"},
    # Knowledge (Chinese)
    {"id": "zh_01", "question": "中国的首都是哪里？只回答城市名。", "answer": "北京", "category": "knowledge_zh", "difficulty": "easy"},
    {"id": "zh_03", "question": "地球上最大的海洋是什么？只回答名称。", "answer": "太平洋", "category": "knowledge_zh", "difficulty": "easy"},
    # Math (Chinese)
    {"id": "zh_02", "question": "计算：99  99 = ? 只回答数字。", "answer": "9801", "category": "math_zh", "difficulty": "medium"},
    # Code
    {"id": "code_01", "question": "What is the output of: print(len('hello'))?", "answer": "5", "category": "code", "difficulty": "easy"},
    {"id": "code_02", "question": "In Python, what does [1,2,3][1] return?", "answer": "2", "category": "code", "difficulty": "easy"},
    {"id": "code_03", "question": "What is 210?", "answer": "1024", "category": "code", "difficulty": "easy"},
]

评分逻辑

import re

def extract_answer(content):
    """去除 Qwen3 的 <think>...</think> 标签，提取实际回答"""
    if not content:
        return ""
    cleaned = re.sub(r'<think>.?</think>', '', content, flags=re.DOTALL).strip()
    if not cleaned and '<think>' in content:
        parts = content.split('</think>')
        cleaned = parts[-1].strip() if len(parts) > 1 else content
    return cleaned

def check_answer(response_text, expected):
    """包含匹配评分：期望答案出现在模型输出中即得分"""
    cleaned = extract_answer(response_text).lower().strip()
    expected_lower = expected.lower().strip()
    # 直接包含
    if expected_lower in cleaned:
        return 1.0
    # 数字提取匹配
    numbers = re.findall(r'-?\d+\.?\d', cleaned)
    if expected_lower in numbers:
        return 1.0
    return 0.0

数据集设计考量

为什么自建而非用公开基准？公开基准（MMLU、GSM8K）题量大，跑一次耗时长。20 道精选题可在 5 分钟内完成评测，适合快速对比。
为什么包含中文题？ Qwen 系列是中英双语模型，需要验证两种语言的能力。
为什么包含"认知偏差"题（reason_02）？蝙蝠和球问题是经典的 System 1 vs System 2 测试，能区分模型是否真正推理。
局限性： 20 道题区分度有限。生产环境建议 100+ 题，加入更多 hard 难度。

模型下载

使用 Docker 容器 + ModelScope 下载（适合国内网络环境）：

# 启动下载容器（已预配置挂载 /home/ec2-user/SageMaker/models → /models）
docker start downloader
docker exec downloader pip install modelscope -q

# 下载两个模型
docker exec downloader modelscope download --model Qwen/Qwen3-4B --local_dir /models/Qwen3-4B
docker exec downloader modelscope download --model Qwen/Qwen3.5-4B --local_dir /models/Qwen3.5-4B

7.6G    /home/ec2-user/SageMaker/models/Qwen3-4B
8.8G    /home/ec2-user/SageMaker/models/Qwen3.5-4B

vLLM Docker Compose 配置

version: '3.3'
services:
  vllm:
    container_name: vllm-server
    image: public.ecr.aws/kraft-llm/vllm/vllm-openai:v0.9.2
    volumes:
      - /home/ec2-user/SageMaker/models:/models
    ports:
      - '8000:8000'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: >
      --model /models/Qwen3-4B
      --served-model-name Qwen3-4B
      --host 0.0.0.0
      --port 8000
      --max-model-len 4096
      --gpu-memory-utilization 0.85
      --dtype half
      --trust-remote-code
    restart: unless-stopped

客户端依赖安装

source activate pytorch_p310

pip install transformers --upgrade -i https://mirrors.aliyun.com/pypi/simple/
pip install tiktoken --only-binary=:all: -i https://mirrors.aliyun.com/pypi/simple/
pip install agentscope openai accelerate -i https://mirrors.aliyun.com/pypi/simple/

Qwen3-4B 评测

启动 vLLM 服务

docker-compose -f /home/ec2-user/SageMaker/docker-compose/vllm/compose.yaml up -d

关键日志：

INFO: vLLM API server version 0.9.2
INFO: model='/models/Qwen3-4B', dtype=torch.float16, max_seq_len=4096
INFO: Using XFormers backend.  # T4 不支持 FA2，回退到 XFormers
INFO: Application startup complete.

AgentScope 评测脚本（Qwen3-4B）

# eval_agentscope.py - 核心 solution function
async def qwen_solution(task: Task, pre_hook: Callable) -> SolutionOutput:
    model = OpenAIChatModel(
        model_name="Qwen3-4B",
        api_key="not-needed",
        stream=False,
        client_kwargs={"base_url": "http://localhost:8000/v1"},
        generate_kwargs={"temperature": 0.1, "max_tokens": 512},
    )
    prompt = f"{task.input}\nAnswer directly and concisely."
    response = await model(messages=[{"role": "user", "content": prompt}])
    text = extract_text_from_response(response)
    return SolutionOutput(success=True, output=text, trajectory=[])

运行

source activate pytorch_p310
python3 eval_agentscope.py Qwen3-4B http://localhost:8000/v1

AgentScope 评测结果

============================================================
AgentScope Evaluation: Qwen3-4B
API: http://localhost:8000/v1
============================================================

  [math_01] expected=42, got=42
  [math_02] expected=63, got=63
  [math_03] expected=56, got=56
  [math_04] expected=1368, got=1368
  [math_05] expected=400, got=400
  [math_06] expected=30, got=30
  [reason_01] expected=no, got=no
  [reason_02] expected=5, got=5
  [reason_03] expected=5, got=5
  [know_01] expected=Paris, got=Paris
  [know_02] expected=H2O, got=H₂O          ← FAIL（Unicode格式差异）
  [know_03] expected=8, got=8
  [zh_01] expected=北京, got=北京
  [zh_02] expected=9801, got=9801
  [zh_03] expected=太平洋, got=太平洋
  [code_01] expected=5, got=5
  [code_02] expected=2, got=2
  [code_03] expected=1024, got=1024
  [reason_04] expected=2, got=2
  [reason_05] expected=32, got=32

Repeat ID: 0
    Metric: contains_answer
        Type: MetricType.NUMERICAL
        Involved tasks: 20
        Completed tasks: 20
        Aggregation: { "mean": 0.95, "max": 1.0, "min": 0.0 }

Total evaluation time: ~360s

AgentScope 输出的 evaluation_result.json：

{
    "total_tasks": 20,
    "total_stats": {
        "chat_usage": {
            "Qwen3-4B": {"input_tokens": 679, "output_tokens": 5800}
        }
    },
    "repeats": {
        "0": {
            "completed_tasks": 20,
            "metrics": {
                "contains_answer": {
                    "aggregation": {"mean": 0.95, "max": 1.0, "min": 0.0}
                }
            }
        }
    }
}

Qwen3.5-4B 评测

vLLM 部署尝试失败

vLLM v0.9.2架构不支持。vLLM v0.9.2 内置 transformers 4.53.1，不认识 Qwen3.5 的新架构标识。

ValidationError: The checkpoint has model type `qwen3_5` 
but Transformers does not recognize this architecture.

vLLM v0.20.2显存不足docker pull swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/vllm/vllm-openai:v0.20.2。模型加载占 8.61 GiB，torch.compile + cudagraph profiling 需要额外显存，T4 16GB 不够。Qwen3.5-4B 在 T4 16GB 上无法通过 vLLM 部署。需要 A10G（24GB）或更大显存的 GPU。

torch.OutOfMemoryError: CUDA out of memory. 
Tried to allocate 1.03 GiB. GPU has 14.56 GiB total, 56.81 MiB free.

使用Transformers 直接推理 AgentScope 评测。由于 vLLM 无法部署 Qwen3.5-4B，在 AgentScope 的 solution function 中直接使用 transformers 推理：

# eval_agentscope_transformers.py - 核心部分
from transformers import AutoTokenizer, AutoModelForCausalLM

# 全局加载模型（只加载一次）
TOKENIZER = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
MODEL = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True
)

def generate(prompt, max_new_tokens=256):
    messages = [{"role": "user", "content": prompt}]
    text = TOKENIZER.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = TOKENIZER(text, return_tensors="pt").to(MODEL.device)
    with torch.no_grad():
        outputs = MODEL.generate(inputs, max_new_tokens=max_new_tokens,
                                  temperature=0.1, do_sample=True, top_p=0.9)
    return TOKENIZER.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

# AgentScope solution function
async def qwen35_solution(task: Task, pre_hook: Callable) -> SolutionOutput:
    prompt = f"{task.input}\nAnswer directly and concisely."
    text = generate(prompt)  # 同步调用 transformers
    return SolutionOutput(success=True, output=text, trajectory=[])

运行

source activate pytorch_p310
python3 eval_agentscope_transformers.py /home/ec2-user/SageMaker/models/Qwen3.5-4B Qwen3.5-4B

AgentScope 评测结果

Model loaded. GPU memory: 7.83 GB

============================================================
AgentScope Evaluation: Qwen3.5-4B
============================================================

  [math_01] expected=42, got=42
  [math_02] expected=63, got=63
  [math_03] expected=56, got=56
  [math_04] expected=1368, got=1368
  [math_05] expected=400, got=400
  [math_06] expected=30, got=30
  [reason_01] expected=no, got=no
  [reason_02] expected=5, got=???           ← FAIL（认知偏差题）
  [reason_03] expected=5, got=5
  [know_01] expected=Paris, got=Paris
  [know_02] expected=H2O, got=H₂O          ← FAIL（Unicode格式差异）
  [know_03] expected=8, got=8
  [zh_01] expected=北京, got=北京
  [zh_02] expected=9801, got=9801
  [zh_03] expected=太平洋, got=太平洋
  [code_01] expected=5, got=5
  [code_02] expected=2, got=2
  [code_03] expected=1024, got=1024
  [reason_04] expected=2, got=2
  [reason_05] expected=32, got=32

Repeat ID: 0
    Metric: contains_answer
        Type: MetricType.NUMERICAL
        Involved tasks: 20
        Completed tasks: 20
        Aggregation: { "mean": 0.9, "max": 1.0, "min": 0.0 }

Total evaluation time: 296.7s

对比分析

准确率对比

类别	Qwen3-4B	Qwen3.5-4B
Math（6题）	6/6 = 100%	6/6 = 100%
Reasoning（5题）	5/5 = 100%	4/5 = 80%
Knowledge（3题）	2/3 = 66.7%	2/3 = 66.7%
Knowledge_zh（2题）	2/2 = 100%	2/2 = 100%
Math_zh（1题）	1/1 = 100%	1/1 = 100%
Code（3题）	3/3 = 100%	3/3 = 100%
总计	19/20 = 95.0%	18/20 = 90.0%

为什么新版模型分数反而更低？

表面上看 Qwen3.5-4B（90%）不如 Qwen3-4B（95%），但这不能说明新版更差：

统计上不显著。 20 道题差 1 道 = 5% 差异，属于随机波动范围。如果跑 3 次取平均，结果可能反转。
Qwen3.5 的设计目标不同。它的改进方向是长文本处理（DeltaNet 架构优势在长序列）、代码能力和多轮对话。这些优势在 20 道简单短问答里体现不出来——就像用百米冲刺来评价马拉松选手。
失败可能是输出格式问题。 reason_02 那道题，Qwen3.5 的 “Thinking Process” 结构化输出更长更复杂，答案可能被正则提取逻辑漏掉，而不是模型真的算错了。
推理条件不同。 Qwen3-4B 用 vLLM（有 KV Cache 优化），Qwen3.5-4B 用 transformers 直接推理，生成参数可能存在细微差异。

真正公平的对比需要：

更大的测试集（200+ 题，多次运行取平均）
覆盖 Qwen3.5 擅长的场景（长文本摘要、代码生成、多轮对话）
相同的推理条件（都用 vLLM 或都用 transformers，相同硬件）

参考资料

https://doc.agentscope.io/zh_CN/tutorial/task_eval.html

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

孤舟笔记互联网常用框架篇四 Netty中的Reactor模式你真懂了吗？主从Reactor到底怎么工作的

Netty高性能的核心在于其采用的Reactor模式实现。文章详细解析了Reactor模式的三种变体：单Reactor单线程、单Reactor多线程和主从Reactor多线程模型。Netty采用主从Reactor多线程模型，通过Boss Group（主Reactor）负责Accept连接，Worker Group（从Reactor）处理I/O读写，实现职责分离。其中Boss Group通常只需1个