近期在学习 MagicAgent 这篇论文的时候,在第四章接提到:模型在合成数据的不同子集上进行训练。为了确保数据的多样性,该框架采用了 NovelSum 指标,根据候选样本 (candidate samples)的邻近权重和密度因子计算其新颖度得分。

原文链接:https://arxiv.org/abs/2602.19000

原文相关字段:

We adopt the Qwen3 series as the base model and further train it using our synthetic dataset, including the Hierarchical Task Decomposition, Tool-Augmented Planning, Multi-Constraint Scheduling, Procedural Logic Orchestration, and Long-Horizon Tool Execution tasks. To ensure comprehensive coverage and diversity across domains, we employ the NovelSum metric (Yang et al., 2025b) to construct a representative sub-training set DsampleD_{sample}Dsample, thereby facilitating the effective fusion of heterogeneous planning data. Specifically, the novelty of a candidate sample x relative to the currently selected training set DsampleD_{sample}Dsample is defined as follows:
v(x)=∑xj∈Dsamplew(x,xj)κ1⋅σ(xj)κ2⋅d(x,xj),\begin{equation} v(x) = \sum_{x_j \in \mathcal{D}_{\text{sample}}} w(x, x_j)^{\kappa_1} \cdot \sigma(x_j)^{\kappa_2} \cdot d(x, x_j), \tag{1} \end{equation}v(x)=xjDsamplew(x,xj)κ1σ(xj)κ2d(x,xj),(1)
where the proximity weight w(x,xj)=1π(j)w(x, x_j) = \frac{1}{\pi(j)}w(x,xj)=π(j)1 assigns higher importance to closer neighbors, with π(j)π(j)π(j)indicating the rank of xjx_jxj when samples are ordered by increasing distance to xxx. The density factor is defined as σ(xj)=1∑k=1Kd(xj,Nk(xj))\sigma(x_j) = \frac{1}{\sum_{k=1}^K d(x_j, N_k(x_j))}σ(xj)=k=1Kd(xj,Nk(xj))1 , Nk(xj)N_k(x_j)Nk(xj)denotes the k-th nearest neighbor of xxx, and d(⋅,⋅)d(·, ·)d(⋅,⋅) denotes the distance between sample embeddings. The hyper-parameters κ1 and κ2 control the degree of the proximity weight and density factor, respectively. At each iteration, the sample xhighx_{high}xhighwith the highest novelty score is selected with xhigh=arg⁡max⁡xv(x)x_{\text{high}} = {\arg\max}_x v(x)xhigh=argmaxxv(x), and added to the training set as Dsample←Dsample∪xhigh\mathcal{D}_{\text{sample}} \leftarrow \mathcal{D}_{\text{sample}} \cup {x_{\text{high}}}DsampleDsamplexhigh. This procedure is repeated, starting from Dsample=∅D_{sample} = ∅Dsample=, until the predefined data budget is met.

The loss function formulation for the optimization objective of the SFT stage is defined as the standard token-level cross-entropy loss:

Lsft=−1N∑i=1N∑j=1Vyijlog⁡(pij),\begin{equation} L_{\text{sft}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{V} y_{ij} \log(p_{ij}), \tag{2} \end{equation}Lsft=N1i=1Nj=1Vyijlog(pij),(2)

where NNN is the sequence length, VVV is the vocabulary size, yijy_{ij}yij is the ground-truth indicator for token iii, and pijp_{ij}pij is the predicted probability for class jjj at token iii.


我先想通过一个实例去尝试计算这个公式来选 candidate 数据到 D_sample 中,但是我发现这个公式不知道怎么带入计算。我现在假设我有x1,x2,x3,x4,x5x_1 ,x_2, x_3, x_4, x_5x1,x2,x3,x4,x5这五个数据。我给出来这五个数据之间的 d 值,见下表:

距离 x1x_1x1 x2x_2x2 x3x_3x3 x4x_4x4 x5x_5x5
x1x_1x1 0 0.2 0.8 0.9 1.0
x2x_2x2 0.2 0 0.75 0.85 0.95
x3x_3x3 0.8 0.75 0 0.3 0.35
x4x_4x4 0.9 0.85 0.3 0 0.25
x5x_5x5 1.0 0.95 0.35 0.25 0

在初始状态中,我分别设超参数 k1=1,k2=1k1=1,k2=1k1=1,k2=1.此时Dsample=∅D_{sample} = ∅Dsample=,采用非严谨计算方式,我这里直接将原本需要随机选择的第一个样本定为x1x_1x1,也就是Dsample={x1}D_{sample} = \{x_1\}Dsample={x1}
接下来第一轮,我们需要通过上面提到的公式(1)来计算v(x)v(x)v(x)的值:
根据相关定义,我们计算
v(x2)=w(x2,x1)∗σ(x1)∗d(x2,x1)=1∗0.2∗σ(x1)=1∑k=1Kd(x1,Nk(x1))∗0.2v(x_2)=w(x_2, x_1)*\sigma(x_1)*d(x_2,x_1)=1*0.2*\sigma(x_1) = \frac{1}{\sum_{k=1}^K d(x_1, N_k(x_1))}*0.2v(x2)=w(x2,x1)σ(x1)d(x2,x1)=10.2σ(x1)=k=1Kd(x1,Nk(x1))10.2
但是到了这一步我发现我不知道怎么计算分母中∑k=1Kd(x1,Nk(x1)){\sum_{k=1}^K d(x_1, N_k(x_1))}k=1Kd(x1,Nk(x1))怎么去计算。

希望各位大佬高抬贵手,帮小弟指导一下。

Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐