【山东大学软件学院创新项目实训】（四）数据处理

I Ferb

67人浏览 · 2025-04-12 23:03:47

I Ferb · 2025-04-12 23:03:47 发布

数据集介绍：ROCStory

ROCStories and the Story Cloze Test

[1604.01696] A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories

于 2016 年发布，包含约 10 万（98159 个故事，以及 3744 个没有标题的故事）个英文日常的故事；每个故事都包含一个标题，以及五句话，且包含了日常事件之间常见的因果和时序关系。

Story Cloze Test（SCT）数据集由美国卡内基梅隆大学和微软研究院于2016年共同创建，旨在评估自然语言处理系统在理解和生成连贯故事方面的能力。该数据集的核心研究问题是如何通过提供前四个句子来预测故事的结尾，从而测试模型对故事情节发展和逻辑推理的理解。

传统的评估方法Narrative Cloze Test：

在给定一组观察到的事件的情况下预测一个保持事件（动词及其参数）。

例如: {X 扔出，拉出 X，告诉 X，???，X 完成}。

该评估方法并不能引导模型学习常识性知识，增强语言表达和文本生成的逻辑性；相反，它导致模型关注于如何通过浅层测试。

ROCStory提出的Story Cloze Test：

提供前四个句子，一个正确的结局，一个不符合逻辑的结局，以及标签值，属性如下：

json {

"InputSentence1": "Rick grew up in a troubled household.",

"InputSentence2": "He never found good support in family, and turned to gangs.",

"InputSentence3": "It wasnt long before Rick got shot in a robbery.",

"InputSentence4": "The incident caused him to turn a new leaf.",

"RandomFifthSentenceQuiz1": "He is happy now.",

"RandomFifthSentenceQuiz2": "He joined a gang.",

"AnswerRightEnding": 1,

}

我们尝试基于ROCStory构造数据集用于在微调过程中增加LLM生成故事的逻辑性和连贯性，这需要两方面的能力，对故事的理解（逻辑性）以及生成故事的能力（连贯性）。

我们尝试用不同的方法构造微调数据集：

A.根据用户提供的关键词生成故事：

通过从故事中逆向提取关键词的方式构造用户输入

使用 spaCy 对文本进行分词和词性标注。

import pandas as pd
import spacy
import json
from collections import Counter

# 加载spaCy模型
nlp = spacy.load("en_core_web_sm")

筛选出词性为名词（NOUN）、动词（VERB）和形容词（ADJ）的单词；

显式过滤停用词（如 "is", "are" 等），避免筛选出无意义的单词；

去重并保留单词的原始顺序，按词频选择关键词；

# 停用词过滤和关键词提取函数
def extract_keywords_with_spacy(story, num_keywords=5):
    # 将故事的句子合并为一个文本
    full_story = " ".join(story)
    # 使用spaCy处理文本
    doc = nlp(full_story)
    # 提取名词、动词和形容词，排除停用词
    words = [token.text for token in doc if token.pos_ in {"NOUN", "VERB", "ADJ"} and not token.is_stop]
    # 统计词频并排序
    word_counts = Counter(words)
    # 按词频选择前 num_keywords 个关键词
    keywords = [word for word, _ in word_counts.most_common(num_keywords)]
    return keywords

构造Alpaca格式数据集：

def create_alpaca_dataset(df, output_file, num_samples=1000):
    alpaca_data = []
    for i in range(min(num_samples, len(df))):
        # 获取每行的故事内容
        story = df.iloc[i, 2:7].tolist()  # sentence1 到 sentence5
        # 提取关键词
        keywords = extract_keywords_with_spacy(story)
        # 构造Alpaca格式的条目
        entry = {
            "instruction": "请使用下面提供的单词创作一个趣味故事，要求故事情节生动、连贯，并巧妙地将这些单词自然地融入故事中。",
            "input": ", ".join(keywords),
            "output": " ".join(story),
            "system": "你是一个擅长创作富有情节和趣味性的英文学习故事的助手，故事需自然地融合指定单词，帮助记忆。",
            "history": []
        }
        alpaca_data.append(entry)
    
    # 保存为JSON文件
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(alpaca_data, f, ensure_ascii=False, indent=2)

生成的json文件条目如下：

  {
    "instruction": "请使用下面提供的单词创作一个趣味故事，要求故事情节生动、连贯，并巧妙地将这些单词自然地融入故事中。",
    "input": "garden, yard, digging, recalled, flourishing",
    "output": "Andy recalled the flourishing garden at his mother's house. Andy decided that he would start a garden in his yard. After gathering the supplies and seeds, Andy started digging. Andy was in the grove of digging, when his shovel hit a item. Andy was pleased to discover a safe buried with jewels in his yard.",
    "system": "你是一个擅长创作富有情节和趣味性的英文学习故事的助手，故事需自然地融合指定单词，帮助记忆。",
    "history": []
  },

与微调后的模型chat：

模型检查点路径：train_2025-04-10-17-23-52

———————————————————————————————————————————

instruction:

请使用下面提供的单词创作一个趣味故事，要求故事情节生动、连贯，并巧妙地将这些单词自然地融入故事中。

input:cartoon sexy lips

output:

Bessie was a very sexy cartoon. She had a very big mouth. Bessie was always having people come to her for advice. Bessie had a very popular show on TV. Bessie was a very big star. Bessie was a very sexy cartoon.

翻译：

贝茜是一个非常性感的卡通人物。她的嘴很大。贝茜总是有人来向她请教。贝茜在电视上有一个很受欢迎的节目。贝茜是个大明星。贝茜是一个非常性感的卡通人物。

———————————————————————————————————————————

instruction:

请使用下面提供的单词创作一个趣味故事，要求故事情节生动、连贯，并巧妙地将这些单词自然地融入故事中。

input:elephant philosophy smooth

output:

I went to the elephant's philosophy class.He taught me about life and the universe.He was very smooth and cool.Iwanted to learn more.I asked him to teach me how to meditate

翻译：

我去了大象的哲学课。他教会了我生命和宇宙。他非常圆滑和冷静。我想了解更多。我请他教我如何冥想

———————————————————————————————————————————

instruction:

请使用下面提供的单词创作一个趣味故事，要求故事情节生动、连贯，并巧妙地将这些单词自然地融入故事中。

input:tank phobia university

output:

Trent had a phobia of university. Trent was very afraid of university. Trent was very scared of the university. Trent was very scared of the university. Trent was very afraid of the university. Trent had a phobia of university.

翻译：

特伦特有大学恐惧症。特伦特非常害怕上大学。特伦特非常害怕大学。特伦特非常害怕大学。特伦特非常害怕大学。特伦特有大学恐惧症。

———————————————————————————————————————————

可以看出模型已经建立了自我认知，“你是一个擅长创作富有情节和趣味性的英文学习故事的助手，故事需自然地融合指定单词，帮助记忆。”并能够生成与数据集相同格式的5句话短故事。

(我们并未要求生成5句话的短故事，可能由于我们的数据过多导致过拟合)

但模型并不能稳定地产生正确输出（有时会在故事的5句话中会有重复的句子，并且一旦出现这种状况之后就会频繁发生），并且有时故事的逻辑性并不强，只是把每个单词的句子强行拼接。

B.基于Story Cloze Test，引导模型选择更符合逻辑的结尾；

构造偏好数据集：

def create_preference_dataset(df, output_file):
    preference_data = []
    for _, row in df.iterrows():
        # 构造故事的完整输入
        story = " ".join([
            row["InputSentence1"],
            row["InputSentence2"],
            row["InputSentence3"],
            row["InputSentence4"]
        ])
        
        # 根据 AnswerRightEnding 确定 chosen 和 rejected
        if row["AnswerRightEnding"] == 1:
            chosen = row["RandomFifthSentenceQuiz1"]
            rejected = row["RandomFifthSentenceQuiz2"]
        else:
            chosen = row["RandomFifthSentenceQuiz2"]
            rejected = row["RandomFifthSentenceQuiz1"]
        
        # 构造偏好数据条目
        entry = {
            "instruction": "请根据以下故事选择一个更合适的结尾。",
            "input": story,
            "chosen": chosen,
            "rejected": rejected
        }
        preference_data.append(entry)
    
    # 保存为 JSON 文件
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(preference_data, f, ensure_ascii=False, indent=2)

生成的json文件条目如下：

  {
    "instruction": "请根据以下故事选择一个更合适的结尾。",
    "input": "Laverne needs to prepare something for her friend's party. She decides to bake a batch of brownies. She chooses a recipe and follows it closely. Laverne tests one of the brownies to make sure it is delicious.",
    "chosen": "The brownies are so delicious Laverne eats two of them.",
    "rejected": "Laverne doesn't go to her friend's party."
  },

C.在完整的故事中随机删除一句话，引导模型推测缺失的情节

  {
    "instruction": "请根据以下故事选择一个更合适的结尾。",
    "input": "Laverne needs to prepare something for her friend's party. She decides to bake a batch of brownies. She chooses a recipe and follows it closely. Laverne tests one of the brownies to make sure it is delicious.",
    "chosen": "The brownies are so delicious Laverne eats two of them.",
    "rejected": "Laverne doesn't go to her friend's party."
  },

对于三种方式我们都保留了词性，复数，时态等等，以保持数据的多样性，但是这三种方式都会出现过拟合，deepseek的深度思考没了，并且返回的故事格式总跟我们的故事格式完全一致，并且经常出现无意义的单词重复，最终我们取2000条故事进行训练，并且混合我们上面的后两种数据集，分阶段训练：第一阶段是给故事情节，让模型预测故事结局，让模型学习故事的连贯性和情感；第二阶段是引导模型根据关键词生成完整故事，教会模型怎么在生成故事的时候自然的使用单词。

微调记录如下：

prompt

请使用下面提供的单词创作一个趣味故事，要求故事情节生动、连贯，并巧妙地将这些单词自然地融入故事中。

关键词：cartoon sexy lips

I had a bite from a dog. It hurt so bad I developed a cartilage allergy. I became a professional dog trainer. One day I saw a "sexy lips" cartoon. It was so cute!

关键词：elephant philosophy smooth

Daryl was an elephant with a smooth skin. Daryl was very popular in the herd. Daryl was very smooth and kind. Daryl had a great philosophy. Daryl was very smooth and kind.

关键词：elephant philosophy smooth

Daryl was an elephant with a smooth skin. Daryl was very popular in the herd. Daryl was very smooth and kind. Daryl had a great philosophy. Daryl was very smooth and kind.

关键词：elephant philosophy smooth

I went to the elephant's philosophy class.He taught me about life and the universe.He was very smooth and cool.Iwanted to learn more.I asked him to teach me how to meditate

关键词：tank phobia university

Frank was afraid of tanks. He went to the university and he was assigned to take a course in military history. Frank was scared to death. He avoided all the lectures. Finally, he had to take a test. He was still scared. He ran away from the test and dropped the course.

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

C语言函数表实现多态

C 语言通过结构体和函数指针实现面向对象特性摘要：本文展示如何在 C 语言中模拟面向对象编程的核心概念。通过 Animal/Cat/Dog 模型，利用结构体保存公共数据，函数指针表描述抽象行为。具体类型在自己的实现文件中提供函数表，调用端通过抽象接口调用行为，由运行时对象内部的函数表决定实际执行的函数。这种设计实现了封装（不透明类型）、抽象（统一接口）和多态（运行时绑定）等面向对象特性，展示了

AtomGit开源社区

我开发了一个 AI 表单填写 Chrome 插件：AutoFormX，提升 Web 测试和表单联调效率

AutoFormX 是一款面向开发者和测试人员的 Chrome 插件，通过 AI 自动识别网页表单字段并生成合适的测试数据，支持单字段填写和一键填充整个页面。它适用于注册页测试、后台管理系统联调、产品演示数据填充和表单功能验收等场景，支持 DeepSeek、OpenAI 等多种 AI 服务，旨在减少重复填表工作，提升 Web 测试和表单联调效率。

AtomGit开源社区

GB/T 42131——2022 知识图谱

知识的推理计算是从已有的事实或关系推断出知识图谱隐性知识的发现与挖掘。例如，对知识图谱中所有“公司”实体的“成立年份”和“所属行业”进行分布统计，从而快速获得产业的时间与领域格局全貌。：将图中的实体和关系转化为低维、稠密的向量表示，使得语义上相近的实体/关系在向量空间中的距离也近。2.实体类型之间的关系：明确定义类与类之间的语义连接，如 "是一种"（继承关系）、"治疗"、"包含" 等。随本体模型、