数据新颖度 NovelSum 的计算卡点，求助大佬指导

w2001ky

500人浏览 · 2026-03-17 11:00:35

w2001ky · 2026-03-17 11:00:35 发布

近期在学习 MagicAgent 这篇论文的时候，在第四章接提到：模型在合成数据的不同子集上进行训练。为了确保数据的多样性，该框架采用了 NovelSum 指标，根据候选样本 (candidate samples)的邻近权重和密度因子计算其新颖度得分。

原文链接：https://arxiv.org/abs/2602.19000

原文相关字段：

We adopt the Qwen3 series as the base model and further train it using our synthetic dataset, including the Hierarchical Task Decomposition, Tool-Augmented Planning, Multi-Constraint Scheduling, Procedural Logic Orchestration, and Long-Horizon Tool Execution tasks. To ensure comprehensive coverage and diversity across domains, we employ the NovelSum metric (Yang et al., 2025b) to construct a representative sub-training set $D_{sample}$ , thereby facilitating the effective fusion of heterogeneous planning data. Specifically, the novelty of a candidate sample x relative to the currently selected training set $D_{sample}$ is defined as follows:
$v(x)=∑xj∈Dsamplew(x,xj)κ1⋅σ(xj)κ2⋅d(x,xj),\begin{equation} v(x) = \sum_{x_j \in \mathcal{D}_{\text{sample}}} w(x, x_j)^{\kappa_1} \cdot \sigma(x_j)^{\kappa_2} \cdot d(x, x_j), \tag{1} \end{equation}$
where the proximity weight $x_j) = \frac{1}{\pi(j)}$ assigns higher importance to closer neighbors, with $π (j)$ indicating the rank of $x_j$ when samples are ordered by increasing distance to $x$ . The density factor is defined as $σ(xj)=1∑k=1Kd(xj,Nk(xj))\sigma(x_j) = \frac{1}{\sum_{k=1}^K d(x_j, N_k(x_j))}$ , $N_k(x_j)$ denotes the k-th nearest neighbor of $x$ , and $d (\cdot,\cdot)$ denotes the distance between sample embeddings. The hyper-parameters κ1 and κ2 control the degree of the proximity weight and density factor, respectively. At each iteration, the sample $x_{high}$ with the highest novelty score is selected with $xhigh=arg⁡max⁡xv(x)x_{\text{high}} = {\arg\max}_x v(x)$ , and added to the training set as $Dsample←Dsample∪xhigh\mathcal{D}_{\text{sample}} \leftarrow \mathcal{D}_{\text{sample}} \cup {x_{\text{high}}}$ . This procedure is repeated, starting from $D_{sample} = ∅$ , until the predefined data budget is met.

The loss function formulation for the optimization objective of the SFT stage is defined as the standard token-level cross-entropy loss:

$Lsft=−1N∑i=1N∑j=1Vyijlog⁡(pij),\begin{equation} L_{\text{sft}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{V} y_{ij} \log(p_{ij}), \tag{2} \end{equation}$

where $N$ is the sequence length, $V$ is the vocabulary size, $y_{ij}$ is the ground-truth indicator for token $i$ , and $p_{ij}$ is the predicted probability for class $j$ at token $i$ .

我先想通过一个实例去尝试计算这个公式来选 candidate 数据到 D_sample 中，但是我发现这个公式不知道怎么带入计算。我现在假设我有 $x_1 ,x_2, x_3, x_4, x_5$ 这五个数据。我给出来这五个数据之间的 d 值，见下表：

距离	$x_1$	$x_2$	$x_3$	$x_4$	$x_5$
$x_1$	0	0.2	0.8	0.9	1.0
$x_2$	0.2	0	0.75	0.85	0.95
$x_3$	0.8	0.75	0	0.3	0.35
$x_4$	0.9	0.85	0.3	0	0.25
$x_5$	1.0	0.95	0.35	0.25	0

在初始状态中，我分别设超参数 $k 1 = 1, k 2 = 1$ .此时 $D_{sample} = ∅$ ，采用非严谨计算方式，我这里直接将原本需要随机选择的第一个样本定为 $x_1$ ，也就是 $D_{sample} = \{x_1\}$ 。
接下来第一轮，我们需要通过上面提到的公式（1）来计算 $v (x)$ 的值：
根据相关定义，我们计算
$v(x2)=w(x2,x1)∗σ(x1)∗d(x2,x1)=1∗0.2∗σ(x1)=1∑k=1Kd(x1,Nk(x1))∗0.2v(x_2)=w(x_2, x_1)*\sigma(x_1)*d(x_2,x_1)=1*0.2*\sigma(x_1) = \frac{1}{\sum_{k=1}^K d(x_1, N_k(x_1))}*0.2$
但是到了这一步我发现我不知道怎么计算分母中 $∑k=1Kd(x1,Nk(x1)){\sum_{k=1}^K d(x_1, N_k(x_1))}$ 怎么去计算。

希望各位大佬高抬贵手，帮小弟指导一下。

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

从Java转行大模型应用，RAG使用效果评估及相关工具

1. 评估落地：优先用Ragas进行快速原型评估，量化核心质量指标与能力指标；用TruLens进行生产级全链路评估与监控，定位问题并迭代优化。2. 应用选型：个人/小团队入门用FastGPT，快速部署验证；企业级场景用RAGFlow或Dify，兼顾扩展性与协作需求；需高度自定义用纯代码开发（LangChain+评估工具）。3. 核心优化方向：围绕“上下文相关性、答案忠实度”优化检索策略（切片、向量