【学习总结】Python transformers 预处理 SQuAD 数据集,并展示
1. 数据介绍
SQuAD 官网
SQuAD(Stanford Question Answering Dataset)是由斯坦福大学开发的一个广泛使用的机器阅读理解数据集。它被用于训练和评估问答系统,旨在测试模型对自然语言文本中的问题和答案的理解能力。
详细介绍 : SQuAD数据集简介
1.1 评价指标
SQuAD的评价指标是基于精确匹配(Exact Match,简称EM)和部分匹配(Partial Match,简称F1 Score)的度量。
1.1.1 精确匹配(EM)
精确匹配是指模型给出的答案与参考答案完全一致时的评价指标。
如果模型的答案与参考答案完全相同,则EM得分为1;否则为0。
计算公式:
EM = 1,如果答案与参考答案完全一致;
EM = 0,如果答案与参考答案不一致。
1.1.2 部分匹配(F1 Score):
部分匹配是通过比较模型答案与参考答案之间的共享词汇来评估答案的相似性。
F1 Score是根据模型答案和参考答案之间的匹配度来计算的。
prediction指模型生成的答案,而ground truth指参考答案。
计算公式:
首先,将模型答案和参考答案分别划分为单个单词或字符的列表。
然后,计算模型答案和参考答案之间的共享词汇,分别记为count_pred和count_ref。
count_pred:模型答案中与参考答案相同的词汇数量。
count_ref:参考答案中的词汇数量。
接下来,计算精确匹配(precision)和召回率(recall),然后使用这些值计算F1 Score:
精确匹配(precision):precision = count_pred / len(prediction)
召回率(recall):recall = count_pred / len(ground truth)
F1 Score:F1 = 2 * (precision * recall) / (precision + recall)
1.1.2.1 F1 Score 物理意义
F1 Score是一种常用的二分类问题评价指标,它综合考虑了分类模型的精确性(precision)和召回率(recall)。F1 Score的物理意义是衡量了模型在识别正例(Positive)和负例(Negative)样本方面的综合性能。
在二分类问题中,我们通常关注两个方面:
精确性(Precision):模型正确预测为正例的样本数占所有预测为正例的样本数的比例。精确性衡量了模型在预测为正例的样本中的准确性。
召回率(Recall):模型正确预测为正例的样本数占所有实际正例样本数的比例。召回率衡量了模型对于实际正例样本的覆盖程度。
F1 Score结合了精确性和召回率,通过计算二者的调和平均值来综合评估模型的性能。F1 Score的物理意义是衡量模型在同时考虑精确性和召回率时的综合表现。
F1 Score的取值范围是0到1,其中1表示最佳性能,0表示最差性能。当模型的精确性和召回率都很高时,F1 Score接近1;当精确性和召回率之一较低时,F1 Score会减小。
在机器学习和评估任务中,F1 Score是一个常用的指标,特别适用于不平衡数据集或当我们希望精确性和召回率双重考虑时。通过优化模型的F1 Score,我们可以追求在正例和负例样本识别方面的平衡和综合性能。
1.2 数据结构
DatasetDict({
train: Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 87599
})
validation: Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 10570
})
})
单个数据:
{'id': '5733bed24776f41900661188',
'title': 'University_of_Notre_Dame',
'context': 'The university is the major seat of the Congregation of Holy Cross (albeit not its official headquarters, which are in Rome). Its main seminary, Moreau Seminary, is located on the campus across St. Joseph lake from the Main Building. Old College, the oldest building on campus and located near the shore of St. Mary lake, houses undergraduate seminarians. Retired priests and brothers reside in Fatima House (a former retreat center), Holy Cross House, as well as Columba Hall near the Grotto. The university through the Moreau Seminary has ties to theologian Frederick Buechner. While not Catholic, Buechner has praised writers from Notre Dame and Moreau Seminary created a Buechner Prize for Preaching.',
'question': 'Where is the headquarters of the Congregation of the Holy Cross?',
'answers': {'text': ['Rome'], 'answer_start': [119]}}
2. 数据预处理
2.1 数据随机展示
from datasets import load_dataset
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML
def show_random_elements(dataset, num_examples=3):
assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
picks = []
for _ in range(num_examples):
pick = random.randint(0, len(dataset) - 1)
while pick in picks:
pick = random.randint(0, len(dataset) - 1)
picks.append(pick)
df = pd.DataFrame(dataset[picks])
print(dataset.features.items())
for column, typ in dataset.features.items():
print(column, typ)
if isinstance(typ, ClassLabel):
df[column] = df[column].transform(lambda i: typ.names[i])
elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
# display(HTML(df.to_html()))
print(df)
"""
"squad_v2" if squad_v2 else "squad" 是一个三元运算符。
如果squad_v2为True,则返回字符串"squad_v2",表示加载SQuAD 2.0数据集;
如果squad_v2为False,则返回字符串"squad",表示加载SQuAD 1.1数据集"""
squad_v2 = False
datasets = load_dataset("squad_v2" if squad_v2 else "squad")
show_random_elements(datasets["train"])
输出结果:
dict_items([('id', Value(dtype='string', id=None)), ('title', Value(dtype='string', id=None)), ('context', Value(dtype='string', id=None)), ('question', Value(dtype='string', id=None)), ('answers', Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None))])
id Value(dtype='string', id=None)
title Value(dtype='string', id=None)
context Value(dtype='string', id=None)
question Value(dtype='string', id=None)
answers Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)
id ... answers
0 572812cd2ca10214002d9d4a ... {'text': ['group homomorphisms'], 'answer_star...
1 56faebc18f12f319006302d3 ... {'text': ['Nadifa Mohamed'], 'answer_start': [...
2 5726e490dd62a815002e9420 ... {'text': ['Rabobank, a large bank, has its hea...
[3 rows x 5 columns]
其中 from datasets import ClassLabel, Sequence 中的 ClassLabel, Sequence :
在Hugging Face的datasets库中,ClassLabel和Sequence是用于描述和处理数据集中特征的类。
2.1.1 ClassLabel类的作用
- ClassLabel用于表示具有离散类别的特征,通常用于分类任务中的标签或类别。
- 提供了一种方便的方法来处理类别标签的索引和名称之间的转换,以及对类别标签的操作和访问。
- ClassLabel类的实例可以用作数据集中的特征类型,以指定特征的类别标签集合。
from datasets import ClassLabel
# 定义一个ClassLabel类型的特征
label_feature = ClassLabel(names=["cat", "dog", "bird"])
# 获取类别标签的数量
num_labels = label_feature.num_classes
print(num_labels)
# 将索引转换为类别标签名称
label_index = 1
label_name = label_feature.names[label_index]
print(label_name)
# 将类别标签名称转换为索引
label_name = "bird"
label_index = label_feature.str2int(label_name)
print(label_index)
print(label_feature)
输出结果:
3
dog
2
ClassLabel(names=['cat', 'dog', 'bird'], id=None)
2.1.2 Sequence类的作用
- Sequence类是Hugging Face的datasets库中的一个容器类,用于表示长序列特征,如文本的标记序列或时间序列数据。
- 它允许对序列进行操作、访问和转换,并提供了一些方法用于处理序列数据。
代码展示:
from datasets import Sequence, Value
# 定义一个包含序列的特征
sequence_feature = Sequence(Value("float32"))
# 创建一个包含序列的示例
example_sequence = [1, 2, 3]
# 将序列赋值给特征
print(sequence_feature)
print(type(sequence_feature))
print(sequence_feature.length)
print(sequence_feature.feature)
sequence_feature.feature = example_sequence
print(type(sequence_feature.feature[0]))
print(sequence_feature)
输出结果:
Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None)
<class 'datasets.features.features.Sequence'>
-1
Value(dtype='float32', id=None)
<class 'int'>
Sequence(feature=[1, 2, 3], length=-1, id=None)
2.2 Tokenizer 处理
代码:
from datasets import load_dataset
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from transformers import AutoTokenizer
import transformers
"""
"squad_v2" if squad_v2 else "squad" 是一个三元运算符。
如果squad_v2为True,则返回字符串"squad_v2",表示加载SQuAD 2.0数据集;
如果squad_v2为False,则返回字符串"squad",表示加载SQuAD 1.1数据集"""
squad_v2 = False
datasets = load_dataset("squad_v2" if squad_v2 else "squad")
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# 断言确保我们的 Tokenizers 使用的是 FastTokenizer(Rust 实现,速度和功能性上有一定优势)。
# isinstance()是一个内置函数,检查一个对象是否属于指定的类或类型
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)
print(tokenizer("What is your name?", "My name is Sylvain."))
print(tokenizer("What is your name?"))
print(tokenizer("My name is Sylvain."))
"""
在问答预处理中的一个特定问题是如何处理非常长的文档。
在其他任务中,当文档的长度超过模型最大句子长度时,我们通常会截断它们,但在这里,删除上下文的一部分可能会导致我们丢失正在寻找的答案。
为了解决这个问题,我们允许数据集中的一个(长)示例生成多个输入特征,每个特征的长度都小于模型的最大长度(或我们设置的超参数)。
"""
# The maximum length of a feature (question and context)
max_length = 384
print("--------------------在训练集数据中找到第一个超过384个标记的样本,并停止遍历----------------------")
for i, example in enumerate(datasets["train"]):
if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
break
example = datasets["train"][i]
print(example)
print(tokenizer(example["question"], example["context"]))
print("--------------------截断上下文不保留超出部分------------------------")
tokenized_example0 = tokenizer(example["question"],
example["context"],
max_length=max_length,
truncation="only_second")
print(tokenized_example0)
print(tokenized_example0["input_ids"])
print("--------------------截断上下文,保留问题和超出部分------------------------")
"""
直接截断超出部分: truncation=only_second
仅截断上下文(context),保留问题(question):return_overflowing_tokens=True & 设置stride"""
# 当需要拆分时,上下文的两个部分之间的授权重叠
# The authorized overlap between two part of the context when splitting it is needed.
doc_stride = 120
tokenized_example1 = tokenizer(
example["question"],
example["context"],
max_length=max_length,
truncation="only_second",
return_overflowing_tokens=True,
stride=doc_stride
)
print([len(x) for x in tokenized_example1["input_ids"]])
for x in tokenized_example1["input_ids"][:2]:
print(tokenizer.decode(x))
"""
使用 offsets_mapping 获取原始的 input_ids
设置 return_offsets_mapping=True,将使得截断分割生成的多个 input_ids 列表中的 token,通过映射保留原始文本的 input_ids。
如下所示:第一个标记([CLS])的起始和结束字符都是(0, 0),因为它不对应问题/答案的任何部分,然后第二个标记与问题(question)的字符0到3相同."""
tokenized_example2 = tokenizer(
example["question"],
example["context"],
max_length=max_length,
truncation="only_second",
return_overflowing_tokens=True,
return_offsets_mapping=True,
stride=doc_stride
)
print(len(tokenized_example2["offset_mapping"][0]))
print(tokenized_example2["offset_mapping"][0][:20])
print([len(x) for x in tokenized_example2["offset_mapping"]])
"""
可以使用这个映射来找到答案在给定特征中的起始和结束标记的位置。
只需区分偏移的哪些部分对应于问题,哪些部分对应于上下文。"""
print(example["question"])
first_token_id = tokenized_example2["input_ids"][0][1]
offsets = tokenized_example2["offset_mapping"][0][1]
print(first_token_id)
print(offsets)
print(tokenizer.convert_ids_to_tokens([first_token_id])[0], example["question"][offsets[0]:offsets[1]])
second_token_id = tokenized_example2["input_ids"][0][2]
offsets = tokenized_example2["offset_mapping"][0][2]
print(tokenizer.convert_ids_to_tokens([second_token_id])[0], example["question"][offsets[0]:offsets[1]])
"""
借助tokenized_example的sequence_ids方法,我们可以方便的区分token的来源编号:
对于特殊标记:返回None,
对于正文Token:返回句子编号(从0开始编号)。
综上,可以很方便的在一个输入特征中找到答案的起始和结束 Token。"""
sequence_ids = tokenized_example2.sequence_ids()
print(sequence_ids)
print("-------------------answers----------------------")
answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])
print(answers)
print(end_char)
# 当前span在文本中的起始标记索引。
"""
当找到第一个标识符为1的位置时,循环停止,
token_start_index即为当前span在文本中的起始标记索引"""
token_start_index = 0
while sequence_ids[token_start_index] != 1:
token_start_index += 1
print("token_start_index : ",token_start_index)
# 当前span在文本中的结束标记索引。
token_end_index = len(tokenized_example2["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
token_end_index -= 1
print("token_end_index : ",token_end_index)
# 检测答案是否超出span范围(如果超出范围,该特征将以CLS标记索引标记)。
offsets = tokenized_example2["offset_mapping"][0]
#根据答案字符级别的起始和结束位置(start_char 和 end_char),调整标记级别的起始和结束索引(token_start_index 和 token_end_index)
#如果答案的起始字符位置大于等于当前标记的起始字符位置,并且答案的结束字符位置小于等于当前标记的结束字符位置,则判断答案在当前特征中
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
# 将token_start_index和token_end_index移动到答案的两端。
# 注意:如果答案是最后一个单词,我们可以移到最后一个标记之后(边界情况)。
while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
token_start_index += 1
start_position = token_start_index - 1
while offsets[token_end_index][1] >= end_char:
token_end_index -= 1
end_position = token_end_index + 1
print(start_position, end_position)
else:
print("答案不在此特征中...")
# 通过查找 offset mapping 位置,解码 context 中的答案
print(tokenizer.decode(tokenized_example2["input_ids"][0][start_position: end_position+1]))
# 直接打印 数据集中的标准答案(answer["text"])
print(answers["text"][0])
输出结果:
{'input_ids': [101, 2054, 2003, 2115, 2171, 1029, 102, 2026, 2171, 2003, 25353, 22144, 2378, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2054, 2003, 2115, 2171, 1029, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2026, 2171, 2003, 25353, 22144, 2378, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
--------------------在训练集数据中找到第一个超过384个标记的样本,并停止遍历----------------------
{'id': '5733caf74776f4190066124c', 'title': 'University_of_Notre_Dame', 'context': "The men's basketball team has over 1,600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 NCAA tournaments. Former player Austin Carr holds the record for most points scored in a single game of the tournament with 61. Although the team has never won the NCAA Tournament, they were named by the Helms Athletic Foundation as national champions twice. The team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending UCLA's record 88-game winning streak in 1974. The team has beaten an additional eight number-one teams, and those nine wins rank second, to UCLA's 10, all-time in wins against the top team. The team plays in newly renovated Purcell Pavilion (within the Edmund P. Joyce Center), which reopened for the beginning of the 2009–2010 season. The team is coached by Mike Brey, who, as of the 2014–15 season, his fifteenth at Notre Dame, has achieved a 332-165 record. In 2009 they were invited to the NIT, where they advanced to the semifinals but were beaten by Penn State who went on and beat Baylor in the championship. The 2010–11 team concluded its regular season ranked number seven in the country, with a record of 25–5, Brey's fifth straight 20-win season, and a second-place finish in the Big East. During the 2014-15 season, the team went 32-6 and won the ACC conference tournament, later advancing to the Elite 8, where the Fighting Irish lost on a missed buzzer-beater against then undefeated Kentucky. Led by NBA draft picks Jerian Grant and Pat Connaughton, the Fighting Irish beat the eventual national champion Duke Blue Devils twice during the season. The 32 wins were the most by the Fighting Irish team since 1908-09.", 'question': "How many wins does the Notre Dame men's basketball team have?", 'answers': {'text': ['over 1,600'], 'answer_start': [30]}}
{'input_ids': [101, 2129, 2116, 5222, 2515, 1996, 10289, 8214, 2273, 1005, 1055, 3455, 2136, 2031, 1029, 102, 1996, 2273, 1005, 1055, 3455, 2136, 2038, 2058, 1015, 1010, 5174, 5222, 1010, 2028, 1997, 2069, 2260, 2816, 2040, 2031, 2584, 2008, 2928, 1010, 1998, 2031, 2596, 1999, 2654, 5803, 8504, 1012, 2280, 2447, 5899, 12385, 4324, 1996, 2501, 2005, 2087, 2685, 3195, 1999, 1037, 2309, 2208, 1997, 1996, 2977, 2007, 6079, 1012, 2348, 1996, 2136, 2038, 2196, 2180, 1996, 5803, 2977, 1010, 2027, 2020, 2315, 2011, 1996, 16254, 2015, 5188, 3192, 2004, 2120, 3966, 3807, 1012, 1996, 2136, 2038, 23339, 1037, 2193, 1997, 6314, 2015, 1997, 2193, 2028, 4396, 2780, 1010, 1996, 2087, 3862, 1997, 2029, 2001, 4566, 12389, 1005, 1055, 2501, 6070, 1011, 2208, 3045, 9039, 1999, 3326, 1012, 1996, 2136, 2038, 7854, 2019, 3176, 2809, 2193, 1011, 2028, 2780, 1010, 1998, 2216, 3157, 5222, 4635, 2117, 1010, 2000, 12389, 1005, 1055, 2184, 1010, 2035, 1011, 2051, 1999, 5222, 2114, 1996, 2327, 2136, 1012, 1996, 2136, 3248, 1999, 4397, 10601, 26429, 10531, 1006, 2306, 1996, 9493, 1052, 1012, 11830, 2415, 1007, 1010, 2029, 11882, 2005, 1996, 2927, 1997, 1996, 2268, 1516, 2230, 2161, 1012, 1996, 2136, 2003, 8868, 2011, 3505, 7987, 3240, 1010, 2040, 1010, 2004, 1997, 1996, 2297, 1516, 2321, 2161, 1010, 2010, 16249, 2012, 10289, 8214, 1010, 2038, 4719, 1037, 29327, 1011, 13913, 2501, 1012, 1999, 2268, 2027, 2020, 4778, 2000, 1996, 9152, 2102, 1010, 2073, 2027, 3935, 2000, 1996, 8565, 2021, 2020, 7854, 2011, 9502, 2110, 2040, 2253, 2006, 1998, 3786, 23950, 1999, 1996, 2528, 1012, 1996, 2230, 1516, 2340, 2136, 5531, 2049, 3180, 2161, 4396, 2193, 2698, 1999, 1996, 2406, 1010, 2007, 1037, 2501, 1997, 2423, 1516, 1019, 1010, 7987, 3240, 1005, 1055, 3587, 3442, 2322, 1011, 2663, 2161, 1010, 1998, 1037, 2117, 1011, 2173, 3926, 1999, 1996, 2502, 2264, 1012, 2076, 1996, 2297, 1011, 2321, 2161, 1010, 1996, 2136, 2253, 3590, 1011, 1020, 1998, 2180, 1996, 16222, 3034, 2977, 1010, 2101, 10787, 2000, 1996, 7069, 1022, 1010, 2073, 1996, 3554, 3493, 2439, 2006, 1037, 4771, 12610, 2121, 1011, 3786, 2121, 2114, 2059, 15188, 5612, 1012, 2419, 2011, 6452, 4433, 11214, 15333, 6862, 3946, 1998, 6986, 9530, 2532, 18533, 2239, 1010, 1996, 3554, 3493, 3786, 1996, 9523, 2120, 3410, 3804, 2630, 13664, 3807, 2076, 1996, 2161, 1012, 1996, 3590, 5222, 2020, 1996, 2087, 2011, 1996, 3554, 3493, 2136, 2144, 5316, 1011, 5641, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
--------------------截断上下文不保留超出部分------------------------
{'input_ids': [101, 2129, 2116, 5222, 2515, 1996, 10289, 8214, 2273, 1005, 1055, 3455, 2136, 2031, 1029, 102, 1996, 2273, 1005, 1055, 3455, 2136, 2038, 2058, 1015, 1010, 5174, 5222, 1010, 2028, 1997, 2069, 2260, 2816, 2040, 2031, 2584, 2008, 2928, 1010, 1998, 2031, 2596, 1999, 2654, 5803, 8504, 1012, 2280, 2447, 5899, 12385, 4324, 1996, 2501, 2005, 2087, 2685, 3195, 1999, 1037, 2309, 2208, 1997, 1996, 2977, 2007, 6079, 1012, 2348, 1996, 2136, 2038, 2196, 2180, 1996, 5803, 2977, 1010, 2027, 2020, 2315, 2011, 1996, 16254, 2015, 5188, 3192, 2004, 2120, 3966, 3807, 1012, 1996, 2136, 2038, 23339, 1037, 2193, 1997, 6314, 2015, 1997, 2193, 2028, 4396, 2780, 1010, 1996, 2087, 3862, 1997, 2029, 2001, 4566, 12389, 1005, 1055, 2501, 6070, 1011, 2208, 3045, 9039, 1999, 3326, 1012, 1996, 2136, 2038, 7854, 2019, 3176, 2809, 2193, 1011, 2028, 2780, 1010, 1998, 2216, 3157, 5222, 4635, 2117, 1010, 2000, 12389, 1005, 1055, 2184, 1010, 2035, 1011, 2051, 1999, 5222, 2114, 1996, 2327, 2136, 1012, 1996, 2136, 3248, 1999, 4397, 10601, 26429, 10531, 1006, 2306, 1996, 9493, 1052, 1012, 11830, 2415, 1007, 1010, 2029, 11882, 2005, 1996, 2927, 1997, 1996, 2268, 1516, 2230, 2161, 1012, 1996, 2136, 2003, 8868, 2011, 3505, 7987, 3240, 1010, 2040, 1010, 2004, 1997, 1996, 2297, 1516, 2321, 2161, 1010, 2010, 16249, 2012, 10289, 8214, 1010, 2038, 4719, 1037, 29327, 1011, 13913, 2501, 1012, 1999, 2268, 2027, 2020, 4778, 2000, 1996, 9152, 2102, 1010, 2073, 2027, 3935, 2000, 1996, 8565, 2021, 2020, 7854, 2011, 9502, 2110, 2040, 2253, 2006, 1998, 3786, 23950, 1999, 1996, 2528, 1012, 1996, 2230, 1516, 2340, 2136, 5531, 2049, 3180, 2161, 4396, 2193, 2698, 1999, 1996, 2406, 1010, 2007, 1037, 2501, 1997, 2423, 1516, 1019, 1010, 7987, 3240, 1005, 1055, 3587, 3442, 2322, 1011, 2663, 2161, 1010, 1998, 1037, 2117, 1011, 2173, 3926, 1999, 1996, 2502, 2264, 1012, 2076, 1996, 2297, 1011, 2321, 2161, 1010, 1996, 2136, 2253, 3590, 1011, 1020, 1998, 2180, 1996, 16222, 3034, 2977, 1010, 2101, 10787, 2000, 1996, 7069, 1022, 1010, 2073, 1996, 3554, 3493, 2439, 2006, 1037, 4771, 12610, 2121, 1011, 3786, 2121, 2114, 2059, 15188, 5612, 1012, 2419, 2011, 6452, 4433, 11214, 15333, 6862, 3946, 1998, 6986, 9530, 2532, 18533, 2239, 1010, 1996, 3554, 3493, 3786, 1996, 9523, 2120, 3410, 3804, 2630, 13664, 3807, 2076, 1996, 2161, 1012, 1996, 3590, 5222, 2020, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[101, 2129, 2116, 5222, 2515, 1996, 10289, 8214, 2273, 1005, 1055, 3455, 2136, 2031, 1029, 102, 1996, 2273, 1005, 1055, 3455, 2136, 2038, 2058, 1015, 1010, 5174, 5222, 1010, 2028, 1997, 2069, 2260, 2816, 2040, 2031, 2584, 2008, 2928, 1010, 1998, 2031, 2596, 1999, 2654, 5803, 8504, 1012, 2280, 2447, 5899, 12385, 4324, 1996, 2501, 2005, 2087, 2685, 3195, 1999, 1037, 2309, 2208, 1997, 1996, 2977, 2007, 6079, 1012, 2348, 1996, 2136, 2038, 2196, 2180, 1996, 5803, 2977, 1010, 2027, 2020, 2315, 2011, 1996, 16254, 2015, 5188, 3192, 2004, 2120, 3966, 3807, 1012, 1996, 2136, 2038, 23339, 1037, 2193, 1997, 6314, 2015, 1997, 2193, 2028, 4396, 2780, 1010, 1996, 2087, 3862, 1997, 2029, 2001, 4566, 12389, 1005, 1055, 2501, 6070, 1011, 2208, 3045, 9039, 1999, 3326, 1012, 1996, 2136, 2038, 7854, 2019, 3176, 2809, 2193, 1011, 2028, 2780, 1010, 1998, 2216, 3157, 5222, 4635, 2117, 1010, 2000, 12389, 1005, 1055, 2184, 1010, 2035, 1011, 2051, 1999, 5222, 2114, 1996, 2327, 2136, 1012, 1996, 2136, 3248, 1999, 4397, 10601, 26429, 10531, 1006, 2306, 1996, 9493, 1052, 1012, 11830, 2415, 1007, 1010, 2029, 11882, 2005, 1996, 2927, 1997, 1996, 2268, 1516, 2230, 2161, 1012, 1996, 2136, 2003, 8868, 2011, 3505, 7987, 3240, 1010, 2040, 1010, 2004, 1997, 1996, 2297, 1516, 2321, 2161, 1010, 2010, 16249, 2012, 10289, 8214, 1010, 2038, 4719, 1037, 29327, 1011, 13913, 2501, 1012, 1999, 2268, 2027, 2020, 4778, 2000, 1996, 9152, 2102, 1010, 2073, 2027, 3935, 2000, 1996, 8565, 2021, 2020, 7854, 2011, 9502, 2110, 2040, 2253, 2006, 1998, 3786, 23950, 1999, 1996, 2528, 1012, 1996, 2230, 1516, 2340, 2136, 5531, 2049, 3180, 2161, 4396, 2193, 2698, 1999, 1996, 2406, 1010, 2007, 1037, 2501, 1997, 2423, 1516, 1019, 1010, 7987, 3240, 1005, 1055, 3587, 3442, 2322, 1011, 2663, 2161, 1010, 1998, 1037, 2117, 1011, 2173, 3926, 1999, 1996, 2502, 2264, 1012, 2076, 1996, 2297, 1011, 2321, 2161, 1010, 1996, 2136, 2253, 3590, 1011, 1020, 1998, 2180, 1996, 16222, 3034, 2977, 1010, 2101, 10787, 2000, 1996, 7069, 1022, 1010, 2073, 1996, 3554, 3493, 2439, 2006, 1037, 4771, 12610, 2121, 1011, 3786, 2121, 2114, 2059, 15188, 5612, 1012, 2419, 2011, 6452, 4433, 11214, 15333, 6862, 3946, 1998, 6986, 9530, 2532, 18533, 2239, 1010, 1996, 3554, 3493, 3786, 1996, 9523, 2120, 3410, 3804, 2630, 13664, 3807, 2076, 1996, 2161, 1012, 1996, 3590, 5222, 2020, 102]
--------------------截断上下文,保留问题和超出部分------------------------
[384, 149]
[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 2009 – 2010 season. the team is coached by mike brey, who, as of the 2014 – 15 season, his fifteenth at notre dame, has achieved a 332 - 165 record. in 2009 they were invited to the nit, where they advanced to the semifinals but were beaten by penn state who went on and beat baylor in the championship. the 2010 – 11 team concluded its regular season ranked number seven in the country, with a record of 25 – 5, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were [SEP]
[CLS] how many wins does the notre dame men's basketball team have? [SEP] its regular season ranked number seven in the country, with a record of 25 – 5, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were the most by the fighting irish team since 1908 - 09. [SEP]
384
[(0, 0), (0, 3), (4, 8), (9, 13), (14, 18), (19, 22), (23, 28), (29, 33), (34, 37), (37, 38), (38, 39), (40, 50), (51, 55), (56, 60), (60, 61), (0, 0), (0, 3), (4, 7), (7, 8), (8, 9)]
[384, 149]
How many wins does the Notre Dame men's basketball team have?
2129
(0, 3)
how How
many many
[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, None]
-------------------answers----------------------
{'text': ['over 1,600'], 'answer_start': [30]}
40
token_start_index : 16
token_end_index : 382
23 26
over 1, 600
over 1,600
更多推荐
所有评论(0)