数据新颖度 NovelSum 的计算卡点,求助大佬指导
近期在学习 MagicAgent 这篇论文的时候,在第四章接提到:模型在合成数据的不同子集上进行训练。为了确保数据的多样性,该框架采用了 NovelSum 指标,根据候选样本 (candidate samples)的邻近权重和密度因子计算其新颖度得分。
原文链接:https://arxiv.org/abs/2602.19000
原文相关字段:
We adopt the Qwen3 series as the base model and further train it using our synthetic dataset, including the Hierarchical Task Decomposition, Tool-Augmented Planning, Multi-Constraint Scheduling, Procedural Logic Orchestration, and Long-Horizon Tool Execution tasks. To ensure comprehensive coverage and diversity across domains, we employ the NovelSum metric (Yang et al., 2025b) to construct a representative sub-training set DsampleD_{sample}Dsample, thereby facilitating the effective fusion of heterogeneous planning data. Specifically, the novelty of a candidate sample x relative to the currently selected training set DsampleD_{sample}Dsample is defined as follows:
v(x)=∑xj∈Dsamplew(x,xj)κ1⋅σ(xj)κ2⋅d(x,xj),\begin{equation} v(x) = \sum_{x_j \in \mathcal{D}_{\text{sample}}} w(x, x_j)^{\kappa_1} \cdot \sigma(x_j)^{\kappa_2} \cdot d(x, x_j), \tag{1} \end{equation}v(x)=xj∈Dsample∑w(x,xj)κ1⋅σ(xj)κ2⋅d(x,xj),(1)
where the proximity weight w(x,xj)=1π(j)w(x, x_j) = \frac{1}{\pi(j)}w(x,xj)=π(j)1 assigns higher importance to closer neighbors, with π(j)π(j)π(j)indicating the rank of xjx_jxj when samples are ordered by increasing distance to xxx. The density factor is defined as σ(xj)=1∑k=1Kd(xj,Nk(xj))\sigma(x_j) = \frac{1}{\sum_{k=1}^K d(x_j, N_k(x_j))}σ(xj)=∑k=1Kd(xj,Nk(xj))1 , Nk(xj)N_k(x_j)Nk(xj)denotes the k-th nearest neighbor of xxx, and d(⋅,⋅)d(·, ·)d(⋅,⋅) denotes the distance between sample embeddings. The hyper-parameters κ1 and κ2 control the degree of the proximity weight and density factor, respectively. At each iteration, the sample xhighx_{high}xhighwith the highest novelty score is selected with xhigh=argmaxxv(x)x_{\text{high}} = {\arg\max}_x v(x)xhigh=argmaxxv(x), and added to the training set as Dsample←Dsample∪xhigh\mathcal{D}_{\text{sample}} \leftarrow \mathcal{D}_{\text{sample}} \cup {x_{\text{high}}}Dsample←Dsample∪xhigh. This procedure is repeated, starting from Dsample=∅D_{sample} = ∅Dsample=∅, until the predefined data budget is met.
The loss function formulation for the optimization objective of the SFT stage is defined as the standard token-level cross-entropy loss:
Lsft=−1N∑i=1N∑j=1Vyijlog(pij),\begin{equation} L_{\text{sft}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{V} y_{ij} \log(p_{ij}), \tag{2} \end{equation}Lsft=−N1i=1∑Nj=1∑Vyijlog(pij),(2)
where NNN is the sequence length, VVV is the vocabulary size, yijy_{ij}yij is the ground-truth indicator for token iii, and pijp_{ij}pij is the predicted probability for class jjj at token iii.
我先想通过一个实例去尝试计算这个公式来选 candidate 数据到 D_sample 中,但是我发现这个公式不知道怎么带入计算。我现在假设我有x1,x2,x3,x4,x5x_1 ,x_2, x_3, x_4, x_5x1,x2,x3,x4,x5这五个数据。我给出来这五个数据之间的 d 值,见下表:
| 距离 | x1x_1x1 | x2x_2x2 | x3x_3x3 | x4x_4x4 | x5x_5x5 |
|---|---|---|---|---|---|
| x1x_1x1 | 0 | 0.2 | 0.8 | 0.9 | 1.0 |
| x2x_2x2 | 0.2 | 0 | 0.75 | 0.85 | 0.95 |
| x3x_3x3 | 0.8 | 0.75 | 0 | 0.3 | 0.35 |
| x4x_4x4 | 0.9 | 0.85 | 0.3 | 0 | 0.25 |
| x5x_5x5 | 1.0 | 0.95 | 0.35 | 0.25 | 0 |
在初始状态中,我分别设超参数 k1=1,k2=1k1=1,k2=1k1=1,k2=1.此时Dsample=∅D_{sample} = ∅Dsample=∅,采用非严谨计算方式,我这里直接将原本需要随机选择的第一个样本定为x1x_1x1,也就是Dsample={x1}D_{sample} = \{x_1\}Dsample={x1}。
接下来第一轮,我们需要通过上面提到的公式(1)来计算v(x)v(x)v(x)的值:
根据相关定义,我们计算
v(x2)=w(x2,x1)∗σ(x1)∗d(x2,x1)=1∗0.2∗σ(x1)=1∑k=1Kd(x1,Nk(x1))∗0.2v(x_2)=w(x_2, x_1)*\sigma(x_1)*d(x_2,x_1)=1*0.2*\sigma(x_1) = \frac{1}{\sum_{k=1}^K d(x_1, N_k(x_1))}*0.2v(x2)=w(x2,x1)∗σ(x1)∗d(x2,x1)=1∗0.2∗σ(x1)=∑k=1Kd(x1,Nk(x1))1∗0.2
但是到了这一步我发现我不知道怎么计算分母中∑k=1Kd(x1,Nk(x1)){\sum_{k=1}^K d(x_1, N_k(x_1))}∑k=1Kd(x1,Nk(x1))怎么去计算。
希望各位大佬高抬贵手,帮小弟指导一下。
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐


所有评论(0)