hugging face transformers 库使用手册（二）：调用 hugging face transformers 预训练模型进行快速预测——api: Pipeline

transformers

huggingface/transformers: 是一个基于 Python 的自然语言处理库，它使用了 PostgreSQL 数据库存储数据。适合用于自然语言处理任务的开发和实现，特别是对于需要使用 Python 和 PostgreSQL 数据库的场景。特点是自然语言处理库、Python、PostgreSQL 数据库。

项目地址：https://gitcode.com/gh_mirrors/tra/transformers

免费下载资源

Cleo_Gao

1195人浏览 · 2023-08-07 15:20:47

Cleo_Gao · 2023-08-07 15:20:47 发布

训练过程比预测过程多的东西：数据增广、梯度反传。虽然之多了这两个东西，但是训练的代码要比预测的代码复杂很多，所以先看简单一点的预测过程。

hugging face transformers 的预测过程由 Pipeline 类全权代理。

pipelines 是一种简便的 inference 流程。

实例化： pipeline() 返回 Pipeline 对象

Pipeline 对象包括：

A tokenizer in charge of mapping raw textual input to token.
A model to make predictions from the inputs.
Some (optional) post processing for enhancing model’s output.

Pipeline 对象使用示例

获取处理某个任务的 pipeline

默认传入 pipeline() 的参数是 task 参数

>>> # 获取 Pipeline 对象，通过 str 参数控制返回的 pipeline 对象类型；默认是 task 参数；
>>> pipe = pipeline("text-classification")
>>> 将输入数据传入 pipeline 对象，会返回预测结果
>>> pipe("This restaurant is awesome")
[{'label': 'POSITIVE', 'score': 0.9998743534088135}]

获取某个模型的 pipeline

如果不传 task，可以传具体需要哪个模型（传模型的名字）：

# 可以传模型名字
>>> pipe = pipeline(model="roberta-large-mnli")
>>> pipe("This restaurant is awesome")
[{'label': 'NEUTRAL', 'score': 0.7313136458396912}]

传 model 对象

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

# Sentiment analysis pipeline
analyzer = pipeline("sentiment-analysis")

# Question answering pipeline, specifying the checkpoint identifier
oracle = pipeline(
    "question-answering", model="distilbert-base-cased-distilled-squad", tokenizer="bert-base-cased"
)

# Named entity recognition pipeline, passing in a specific model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
recognizer = pipeline("ner", model=model, tokenizer=tokenizer)

使用 pipeline 一次性预测多个输入

用 list 处理多个输入

>>> pipe = pipeline("text-classification")
>>> pipe(["This restaurant is awesome", "This restaurant is awful"])
[{'label': 'POSITIVE', 'score': 0.9998743534088135},
 {'label': 'NEGATIVE', 'score': 0.9996669292449951}]

直接用 datasets

import datasets
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm

pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
dataset = datasets.load_dataset("superb", name="asr", split="test")

# 把 dataset 传入 pipeline 实例对象即可
# KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item
# as we're not interested in the *target* part of the dataset. For sentence pair use KeyPairDataset
for out in tqdm(pipe(KeyDataset(dataset, "file"))):
    print(out)
    # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
    # {"text": ....}
    # ....

其实传一个 generator 就可以工作：

from transformers import pipeline

pipe = pipeline("text-classification")


def data():
    while True:
        # This could come from a dataset, a database, a queue or HTTP request
        # in a server
        # Caveat: because this is iterative, you cannot use `num_workers > 1` variable
        # to use multiple threads to preprocess data. You can still have 1 thread that
        # does the preprocessing while the main runs the big inference
        yield "This is a test"


for out in pipe(data()):
    print(out)
    # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
    # {"text": ....}
    # ....

`transformers.pipeline()` 参数说明

参数非常多，这里只说最重要的

task: str
model: str 或者 PreTrainedModel 或者 TFPretrainedModel
config: str 或者 PretrainedConfig
- 这里面是 build model 所需超参，不是 train 所需超参
tokenizer: str 或者 PretrainedTokenizer 或者 PreTrainedTokenizerFast
device: int / str / torch.device
num_workers (int, optional, defaults to 8)
batch_size (int, optional, defaults to 1)
feature_extractor: str 或者 SequenceFeatureExtraxtor
- The feature extractor that will be used by the pipeline to encode data for the model.
- Feature extractors are used for non-NLP models, such as Speech or Vision models as well as multi-modal models. Multi-modal models will also require a tokenizer to be passed.
image_processor: str 或者 BaseImageProcessor
framework: str
- either “pt” for PyTorch or “tf” for TensorFlow.
revison: str default ‘main’
- 这个是 git branch 的名字，基本用不上
use_fast: bool
- Whether or not to use a Fast tokenizer if possible
model_kwargs: dict
- 送入 from_pretrained() 的其它参数
kwargs: dict
- 对某个特别的 pipeline 所需的其它参数

支持的 task

“audio-classification”: will return a AudioClassificationPipeline.
“automatic-speech-recognition”: will return a AutomaticSpeechRecognitionPipeline.
“conversational”: will return a ConversationalPipeline.
“depth-estimation”: will return a DepthEstimationPipeline.
“document-question-answering”: will return a DocumentQuestionAnsweringPipeline.
“feature-extraction”: will return a FeatureExtractionPipeline.
“fill-mask”: will return a FillMaskPipeline:.
“image-classification”: will return a ImageClassificationPipeline.
“image-segmentation”: will return a ImageSegmentationPipeline.
“image-to-text”: will return a ImageToTextPipeline.
“mask-generation”: will return a MaskGenerationPipeline.
“object-detection”: will return a ObjectDetectionPipeline.
“question-answering”: will return a QuestionAnsweringPipeline.
“summarization”: will return a SummarizationPipeline.
“table-question-answering”: will return a TableQuestionAnsweringPipeline.
“text2text-generation”: will return a Text2TextGenerationPipeline.
“text-classification” (alias “sentiment-analysis” available): will return a TextClassificationPipeline.
“text-generation”: will return a TextGenerationPipeline:.
“token-classification” (alias “ner” available): will return a TokenClassificationPipeline.
“translation”: will return a TranslationPipeline.
“translation_xx_to_yy”: will return a TranslationPipeline.
“video-classification”: will return a VideoClassificationPipeline.
“visual-question-answering”: will return a VisualQuestionAnsweringPipeline.
“zero-shot-classification”: will return a ZeroShotClassificationPipeline.
“zero-shot-image-classification”: will return a ZeroShotImageClassificationPipeline.
“zero-shot-audio-classification”: will return a ZeroShotAudioClassificationPipeline.
“zero-shot-object-detection”: will return a ZeroShotObjectDetectionPipeline.

Pipeline chunk batching

zero-shot-classification and question-answering 用的是 ChunkPipeline

因为 a single input might yield multiple forward pass of a model（？）Under normal circumstances, this would yield issues with batch_size argument.

之前是直接把数据送到 pipeline 就好了，但是现在要分别调用 pipeline 的方法：

pipe.preprocess()
pipe.forward()
pipe.postprocess()

基础用例：

all_model_outputs = []
for preprocessed in pipe.preprocess(inputs):
    model_outputs = pipe.forward(preprocessed)
    all_model_outputs.append(model_outputs)
outputs = pipe.postprocess(all_model_outputs)

写你自己的 pipeline

首先，弄清输入和输入分别是什么？

输入：strings / raw bytes / dictionaries / …；这将是 preprocess 的输入
输出：越简洁越好，这将是 postprocess 的输出

需要实现 4 个方法

from transformers import Pipeline


class MyPipeline(Pipeline):
    def _sanitize_parameters(self, **kwargs):
        preprocess_kwargs = {}
        if "maybe_arg" in kwargs:
            preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
        return preprocess_kwargs, {}, {}

    def preprocess(self, inputs, maybe_arg=2):
        model_input = Tensor(inputs["input_ids"])
        return {"model_input": model_input}

    def _forward(self, model_inputs):
        # model_inputs == {"model_input": model_input}
        outputs = self.model(**model_inputs)
        # Maybe {"logits": Tensor(...)}
        return outputs

    def postprocess(self, model_outputs):
        best_class = model_outputs["logits"].softmax(-1)
        return best_class

`preprocess()`

输入是你确定的最开始的输入，然后在这个方法里面会做一些处理，变成模型的输入（即 preprocess 的输出）。（注意区分 pipeline 的输入和 model 的输入）

一般 preprocess() 的输出是一个字典，然后送入模型的时候就用 **kwargs 传到模型里面。

`_forward()`

forward() 里面加了一些保护性的代码，让大家在希望的 device 上正常工作，而其它与模型相关的代码，都放到 _forward() 里面，然后让 forward() 调用 _forward()

注意，只有与模型相关的代码才放到 _forward()，前处理后处理都放到对应的方法里面去。

`postprocess()`

_forward() 的输出就是 postprocess() 的输入，然后把它变成用户想要的输出

`_sanitize_parameters()`

This function exists to allow users to pass any parameters whenever they wish, be it at initialization time pipeline(...., maybe_arg=4) or at call time pipe = pipeline(...); output = pipe(...., maybe_arg=4)

该方法返回值为 3 个 dicts，这 3 个 dicts 会分别送入 preprocess() , _forward() 和 postprocess()

示例

目标效果：

>>> pipe = pipeline("my-new-task")
>>> pipe("This is a test")
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}, {"label": "3-star", "score": 0.05}
{"label": "4-star", "score": 0.025}, {"label": "5-star", "score": 0.025}]

>>> pipe("This is a test", top_k=2)
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}]

继承 Pipeline

第一次预测的时候没有传除了输入数据以外的别的参数，自动出来 top-k 是 5 个，也就是默认参数为 5 （这个参数应该是 postprocess() 的参数）。为了实现这个，编辑 _sanitize_parameters() 方法，让这个参数加进去：

def postprocess(self, model_outputs, top_k=5):
	best_class = model_output["logits"].softmax(-1)
	return best_class

def _sanitize_parameters(self, **kwargs):
	preprocess_kwargs = {}
	if "maybe_arg" in kwargs:
		preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
	
	postprocess_kwargs = {}
	if "top_k" in kwargs:
		postprocess_kwargs["top_k"] = kwargs["top_k"]
		return preprocess_kwargs, {}, postprocess_kwargs

注册

调用 PIPELINE_REGISTRY.register_pipeline() 方法

from transformers.pipelines import PIPELINE_REGISTRY

PIPELINE_REGISTRY.register_pipeline(
    "new-task",
    pipeline_class=MyPipeline,
    pt_model=AutoModelForSequenceClassification,
)

针对不同任务的 Pipeline

ImageClassificationPipeline

>>> from transformers import pipeline
>>> classifier = pipeline(model="microsoft/beit-base-patch16-224-pt22k-ft22k")

>>> classifier("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
[{'score': 0.442, 'label': 'macaw'}, {'score': 0.088, 'label': 'popinjay'}, {'score': 0.075, 'label': 'parrot'}, {'score': 0.073, 'label': 'parodist, lampooner'}, {'score': 0.046, 'label': 'poll, poll_parrot'}]

__call__() 的参数：

images (str, List[str], PIL.Image or List[PIL.Image]) — The pipeline handles three types of images:
- A string containing a http link pointing to an image
- A string containing a local path to an image
- An image loaded in PIL directly
top_k (int, optional, defaults to 5) — The number of top labels that will be returned by the pipeline.

ImageSegmentationPipeline

>>> from transformers import pipeline

>>> segmenter = pipeline(model="facebook/detr-resnet-50-panoptic")
>>> segments = segmenter("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
>>> len(segments)
2

>>> segments[0]["label"]
'bird'

>>> segments[1]["label"]
'bird'

>>> type(segments[0]["mask"])  # This is a black and white mask showing where is the bird on the original image.
<class 'PIL.Image.Image'>

>>> segments[0]["mask"].size
(768, 512)

__call__() 的参数：

images (str, List[str], PIL.Image or List[PIL.Image]) — The pipeline handles three types of images:
- A string containing a http link pointing to an image
- A string containing a local path to an image
- An image loaded in PIL directly
subtask (str, optional) — Segmentation task to be performed, choose [semantic, instance and panoptic] depending on model capabilities. If not set, the pipeline will attempt tp resolve in the following order: panoptic, instance, semantic.
threshold (float, optional, defaults to 0.9) — Probability threshold to filter out predicted masks.
mask_threshold (float, optional, defaults to 0.5) — Threshold to use when turning the predicted masks into binary values.
overlap_mask_area_threshold (float, optional, defaults to 0.5) — Mask overlap threshold to eliminate small, disconnected segments.

ObjectDetectionPipeline

>>> from transformers import pipeline

>>> detector = pipeline(model="facebook/detr-resnet-50")

>>> detector("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
[{'score': 0.997, 'label': 'bird', 'box': {'xmin': 69, 'ymin': 171, 'xmax': 396, 'ymax': 507}}, {'score': 0.999, 'label': 'bird', 'box': {'xmin': 398, 'ymin': 105, 'xmax': 767, 'ymax': 507}}]

>>> # x, y  are expressed relative to the top left hand corner.

__call__() 的参数：

images (str, List[str], PIL.Image or List[PIL.Image]) — The pipeline handles three types of images:
- A string containing a http link pointing to an image
- A string containing a local path to an image
- An image loaded in PIL directly
threshold (float, optional, defaults to 0.9) — The probability necessary to make a prediction.

ImageToTextPipeline

>>> from transformers import pipeline
>>> captioner = pipeline(model="ydshieh/vit-gpt2-coco-en")

>>> captioner("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
[{'generated_text': 'two birds are standing next to each other '}]

__call__() 的参数：

images (str, List[str], PIL.Image or List[PIL.Image]) — The pipeline handles three types of images:
- A string containing a http link pointing to an image
- A string containing a local path to an image
- An image loaded in PIL directly
max_new_tokens (int, optional) — The amount of maximum tokens to generate. By default it will use generate default.
generate_kwargs (Dict, optional) — Pass it to send all of these arguments directly to generate allowing full control of this function.

VisualQuestionAnsweringPipeline

This visual question answering pipeline can currently be loaded from pipeline() using the following task identifiers: “visual-question-answering”, “vqa”.

>>> from transformers import pipeline

>>> oracle = pipeline(model="dandelin/vilt-b32-finetuned-vqa")
>>> image_url = "https://huggingface.co/datasets/Narsil/image_dummy/raw/main/lena.png"
>>> oracle(question="What is she wearing ?", image=image_url)
[{'score': 0.948, 'answer': 'hat'}, {'score': 0.009, 'answer': 'fedora'}, {'score': 0.003, 'answer': 'clothes'}, {'score': 0.003, 'answer': 'sun hat'}, {'score': 0.002, 'answer': 'nothing'}]

>>> oracle(question="What is she wearing ?", image=image_url, top_k=1)
[{'score': 0.948, 'answer': 'hat'}]

>>> oracle(question="Is this a person ?", image=image_url, top_k=1)
[{'score': 0.993, 'answer': 'yes'}]

>>> oracle(question="Is this a man ?", image=image_url, top_k=1)
[{'score': 0.996, 'answer': 'no'}]

__call__() 的参数：

images (str, List[str], PIL.Image or List[PIL.Image]) — The pipeline handles three types of images:
- A string containing a http link pointing to an image
- A string containing a local path to an image
- An image loaded in PIL directly
question (str, List[str]) — The question(s) asked. If given a single question, it can be broadcasted to multiple images.
top_k (int, optional, defaults to 5)

GitHub 加速计划 / tra / transformers

130.24 K

25.88 K

下载

最近提交(Master分支：2 个月前 )

33868a05 * [i18n-HI] Translated accelerate page to Hindi * Update docs/source/hi/accelerate.md Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com> * Update docs/source/hi/accelerate.md Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com> * Update docs/source/hi/accelerate.md Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com> * Update docs/source/hi/accelerate.md Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com> --------- Co-authored-by: Kay <kay@Kays-MacBook-Pro.local> Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com> 13 天前

e2ac16b2 * rework converter * Update modular_model_converter.py * Update modular_model_converter.py * Update modular_model_converter.py * Update modular_model_converter.py * cleaning * cleaning * finalize imports * imports * Update modular_model_converter.py * Better renaming to avoid visiting same file multiple times * start converting files * style * address most comments * style * remove unused stuff in get_needed_imports * style * move class dependency functions outside class * Move main functions outside class * style * Update modular_model_converter.py * rename func * add augmented dependencies * Update modular_model_converter.py * Add types_to_file_type + tweak annotation handling * Allow assignment dependency mapping + fix regex * style + update modular examples * fix modular_roberta example (wrong redefinition of __init__) * slightly correct order in which dependencies will appear * style * review comments * Performance + better handling of dependencies when they are imported * style * Add advanced new classes capabilities * style * add forgotten check * Update modeling_llava_next_video.py * Add prority list ordering in check_conversion as well * Update check_modular_conversion.py * Update configuration_gemma.py 13 天前

GitCode 开源社区

旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

[转载]在Windows环境下安装GNU Radio

转自：在Windows环境下安装GNURadio_恐弱智_新浪博客GNU Radio是用Python开发的，大部分开源的工程能够在Linux环境下运行良好，而Windows下却运行的很勉强，而且安装配置都很复杂。GNU Radio算是个例外了，不光提供了Windows的二进制安装，还有比较详细的说明。我是Python小白，所以折腾了好久才弄好，特意记录下来，免得以后再装还折腾。GNU Radio的

GitCode 开源社区

centOS 8 使用dnf安装Docker

DNF是什么？CentOS 8使用YUM软件包管理器版本v4.0.4。现在，该版本使用DNF(已删除YUM)。DNF是软件包管理器。它会在Linux发行版上安装，执行更新并删除软件包。使用DNF安装Docker跳过具有损坏依赖性的程序包一个有效的解决方案是使您的CentOS 8系统使用以下--nobest命令安装最符合条件的版本：sudo dnf install docker...

GitCode 开源社区

定时同步数据库表(mysql+linux+crontab)

sync.sh里面的参数需要改变，ip/username/password/database/tablesync.sh#!/bin/sh# Please change the IP and password of the data source db.# Then change the table name.filename=/home/nington/db/$(date +%Y-%m