使用pipeline

抽取式问答的任务是给定一个文本和一个问题,需要从文本中抽取出问题的回答。有个叫SQuAD的数据集可以完全适用于这个任务。

以下是一个使用pipline来实现抽取式问答的样例,会用到一个基于SQuAD数据集微调后的模型:

示例代码:

from transformers import pipeline

nlp = pipeline("question-answering")

context = r"""
Last year, I went to the countryside to get my internship, my duty was to be a teacher, teaching the middle school students English. 
"""

result = nlp(question="When did I go to countryside to teach someone?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

result = nlp(question="What is my occupation?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

result = nlp(question="What subject does I teach?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

输出结果:

Answer: 'Last year,', score: 0.9787, start: 1, end: 11
Answer: 'teacher,', score: 0.9525, start: 80, end: 88
Answer: 'English.', score: 0.9585, start: 125, end: 134

使用模型和文本标记器

除了使用pipeline快速构建,我们也可以使用一个模型和一个文本标记器来实现问答。具体步骤如下:

  1. 实例化一个预训练的BERT模型和对应文本标记器。
  2. 提供一段文本和几个问题。
  3. 将问题放入迭代器,并利用当前模型的token索引和注意力掩码将文本和问题序列化。
  4. 将这些序列送入模型并获得输出,输出包含两部分start_logitsend_logits ,前者表示每个token作为答案开始的分数,后者表示每个token作为答案结束的分数。
  5. 利用softmax可获得每个token作为开始或结束的可能性。
  6. 获得答案的开始和结束,并将其转变为字符串,即文本答案。
  7. 输出结果

示例代码:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad", cache_dir="./transformersModels/question-answering")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad", cache_dir="./transformersModels/question-answering", return_dict=True)

text = r"""
Last year, I went to the countryside to get my internship, my duty was to be a teacher, teaching the middle school students English. 
"""

questions = [
    "When did I go to countryside to teach someone?",
    "What is my occupation?",
    "What subject does I teach?",
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]
    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    outputs = model(**inputs)
    answer_start_scores = outputs["start_logits"]
    answer_end_scores = outputs["end_logits"]
    answer_start = torch.argmax(
        answer_start_scores
    )  # 获得最可能是答案开始的token的下标
    answer_end = torch.argmax(answer_end_scores) + 1  # 获得最可能是答案结束的token的下标
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print()

输出结果:

Question: When did I go to countryside to teach someone?
Answer: last year

Question: What is my occupation?
Answer: teacher

Question: What subject does I teach?
Answer: english
  • 注意

    这里的tokenizer会将问题和文本都进行序列化,并在两头和中间插入特殊字符,序列化后的文本真实值类似于[CLS] what subject does i teach ? [SEP] last year , i went to the countryside to get my internship , my duty was to be a teacher , teaching the middle school students english . [SEP] 其中[CLS][SEP]是BERT中的特殊符号。

GitHub 加速计划 / tra / transformers
72
5
下载
huggingface/transformers: 是一个基于 Python 的自然语言处理库,它使用了 PostgreSQL 数据库存储数据。适合用于自然语言处理任务的开发和实现,特别是对于需要使用 Python 和 PostgreSQL 数据库的场景。特点是自然语言处理库、Python、PostgreSQL 数据库。
最近提交(Master分支:7 个月前 )
c9d1e523 * Update installation.md * Update README.md 6 小时前
d253de6d * initial * fix * fix * update * fix * fixes * quantization * attention mask visualizer * multimodal * small changes * fix code samples 7 小时前
Logo

旨在为数千万中国开发者提供一个无缝且高效的云端环境,以支持学习、使用和贡献开源项目。

更多推荐