huggingface/transformers数据预处理

transformers

huggingface/transformers: 是一个基于 Python 的自然语言处理库，它使用了 PostgreSQL 数据库存储数据。适合用于自然语言处理任务的开发和实现，特别是对于需要使用 Python 和 PostgreSQL 数据库的场景。特点是自然语言处理库、Python、PostgreSQL 数据库。

项目地址：https://gitcode.com/gh_mirrors/tra/transformers

免费下载资源

梆子井欢喜坨

977人浏览 · 2022-08-21 23:59:01

梆子井欢喜坨 · 2022-08-21 23:59:01 发布

1. 自然语言

1.1 Tokenize

处理文本数据的主要工具是tokenizer。

tokenizer 首先根据一组规则将文本拆分为 tokens

tokens-to-index（通常称为vocab）将 token 转换为词典中的下标，通过 look-up table 构建张量作为模型的输入。

模型所需的任何其他输入也由tokenizer添加。

确保文本以与预训练语料库相同的方式拆分，并在预训练期间使用相同的vocab。

通过使用AutoTokenizer类加载预训练的标记器来快速开始。这会下载模型预训练时使用的 vocab。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

将句子送入tokenizer中：

encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)

输出结果

{
    'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], 
    'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
    'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}

tokenizer 返回一个包含三个重要项目的字典：

input_ids are the indices corresponding to each token in the sentence.
attention_mask indicates whether a token should be attended to or not.
token_type_ids identifies which sequence a token belongs to when there is more than one sequence.

可以解码input_ids以返回原始输入

tokenizer.decode(encoded_input["input_ids"])

输出结果

'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'

tokenizer 添加了两个特别的token —— CLS 和 SEP

将包含句子的list送入 tokenizer

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

输出结果

{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1]]}

1.2 Pad

常见的问题：一批句子的长度并不总是相同的，但作为模型输入的张量需要具有统一的形状。

填充是一种策略，通过向较短的句子添加特殊的 padding token 来确保输入模型的张量长度相同。

将padding参数设置True填充一个batch中较短的序列以匹配最长的序列：

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True)
print(encoded_input)

输出结果

{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

注意到第一个和第三个句子的tokens被填充到了第二个句子的长度

阅读官方文档获得更详细的设置

https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#tokenizer

max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

1.3 Truncation

另一方面，有时序列可能太长，模型无法处理。在这种情况下，您需要将序列截断为更短的长度。

将truncation参数设置True为将序列截断为模型接受的最大长度：

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
print(encoded_input)

1.4 Build tensors

整合上述步骤，并设置返回向量的类型（Pytorch: return_tensors = “pt” / Tensorflow: return_tensors=“tf”）

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(encoded_input)

2. 图像

2.1 特征提取

特征提取器还用于处理视觉任务的图像,目标是将原始图像转换为一批张量作为输入。

让我们为本教程加载food101数据集。

使用 Datasets split参数仅从训练拆分中加载一个小样本，因为数据集非常大：

from datasets import load_dataset

dataset = load_dataset("food101", split="train[:100]")

# 查看一张图像
dataset[1]["image"]

通过 AutoFeatureExtractor.from_pretrained() 加载图像特征提取器

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")

2.2 数据增强

对于视觉任务，通常会在图像中添加某种类型的数据增强作为预处理的一部分。

您可以使用您喜欢的任何库添加增强功能，但在本教程中，您将使用 torchvision 的transforms模块。

Normalize the image and use Compose to chain some transforms - RandomResizedCrop and ColorJitter - together:

from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor

normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
_transforms = Compose(
    [RandomResizedCrop(feature_extractor.size), 
     ColorJitter(brightness=0.5, hue=0.5), 
     ToTensor(), 
     normalize]
)

模型接受特征提取器产生的像素值作为输入。

Create a function that generates pixel_values from the transforms:

def transforms(examples):
    examples["pixel_values"] = [_transforms(image.convert("RGB")) for image in examples["image"]]
    return examples

use Datasets set_transform to apply the transforms on-the-fly:

dataset.set_transform(transforms)

再次查看图片

dataset[1]

输出结果

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F971B093FD0>, 'label': 6, 'pixel_values': tensor([[[ 0.7098,  0.7490,  0.7804,  ..., -0.3333, -0.3412, -0.3255],
         [ 0.6941,  0.7333,  0.7804,  ..., -0.3176, -0.3255, -0.3255],
         [ 0.6941,  0.7098,  0.7647,  ..., -0.3647, -0.3176, -0.3176],
         ...,
         [-0.5686, -0.7176, -0.7725,  ..., -0.5686, -0.6000, -0.6235],
         [-0.5686, -0.6863, -0.7333,  ..., -0.5922, -0.6000, -0.6000],
         [-0.5373, -0.6235, -0.7020,  ..., -0.5922, -0.6078, -0.6000]],

        [[ 0.7098,  0.7490,  0.7725,  ..., -0.1843, -0.1608, -0.1451],
         [ 0.7098,  0.7490,  0.7804,  ..., -0.1686, -0.1608, -0.1608],
         [ 0.7098,  0.7333,  0.7725,  ..., -0.2000, -0.1529, -0.1294],
         ...,
         [-0.3176, -0.4588, -0.5216,  ..., -0.3412, -0.3412, -0.3647],
         [-0.3176, -0.4431, -0.4980,  ..., -0.3569, -0.3412, -0.3412],
         [-0.2941, -0.3882, -0.4588,  ..., -0.3647, -0.3490, -0.3412]],

        [[ 0.5529,  0.6000,  0.6392,  ...,  0.0980,  0.0902,  0.1059],
         [ 0.5451,  0.5843,  0.6235,  ...,  0.1137,  0.1216,  0.1137],
         [ 0.5373,  0.5608,  0.6000,  ...,  0.0980,  0.1294,  0.1451],
         ...,
         [-0.1137, -0.2706, -0.3176,  ..., -0.0039, -0.0275, -0.0431],
         [-0.1294, -0.2549, -0.3176,  ..., -0.0118, -0.0196, -0.0275],
         [-0.1137, -0.2000, -0.2941,  ..., -0.0196, -0.0275, -0.0196]]])}

这是图像预处理后的样子。正如您对应用的变换所期望的那样，图像已被随机裁剪，并且其颜色属性不同。

import numpy as np
import matplotlib.pyplot as plt

img = dataset[0]["pixel_values"]
plt.imshow(img.permute(1, 2, 0))

和我脑子里的图像的特征提取不太一样？

GitHub 加速计划 / tra / transformers

130.24 K

25.88 K

下载

最近提交(Master分支：2 个月前 )

33868a05 * [i18n-HI] Translated accelerate page to Hindi * Update docs/source/hi/accelerate.md Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com> * Update docs/source/hi/accelerate.md Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com> * Update docs/source/hi/accelerate.md Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com> * Update docs/source/hi/accelerate.md Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com> --------- Co-authored-by: Kay <kay@Kays-MacBook-Pro.local> Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com> 5 天前

e2ac16b2 * rework converter * Update modular_model_converter.py * Update modular_model_converter.py * Update modular_model_converter.py * Update modular_model_converter.py * cleaning * cleaning * finalize imports * imports * Update modular_model_converter.py * Better renaming to avoid visiting same file multiple times * start converting files * style * address most comments * style * remove unused stuff in get_needed_imports * style * move class dependency functions outside class * Move main functions outside class * style * Update modular_model_converter.py * rename func * add augmented dependencies * Update modular_model_converter.py * Add types_to_file_type + tweak annotation handling * Allow assignment dependency mapping + fix regex * style + update modular examples * fix modular_roberta example (wrong redefinition of __init__) * slightly correct order in which dependencies will appear * style * review comments * Performance + better handling of dependencies when they are imported * style * Add advanced new classes capabilities * style * add forgotten check * Update modeling_llava_next_video.py * Add prority list ordering in check_conversion as well * Update check_modular_conversion.py * Update configuration_gemma.py 5 天前

GitCode 开源社区

旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

[转载]在Windows环境下安装GNU Radio

转自：在Windows环境下安装GNURadio_恐弱智_新浪博客GNU Radio是用Python开发的，大部分开源的工程能够在Linux环境下运行良好，而Windows下却运行的很勉强，而且安装配置都很复杂。GNU Radio算是个例外了，不光提供了Windows的二进制安装，还有比较详细的说明。我是Python小白，所以折腾了好久才弄好，特意记录下来，免得以后再装还折腾。GNU Radio的

GitCode 开源社区

centOS 8 使用dnf安装Docker

DNF是什么？CentOS 8使用YUM软件包管理器版本v4.0.4。现在，该版本使用DNF(已删除YUM)。DNF是软件包管理器。它会在Linux发行版上安装，执行更新并删除软件包。使用DNF安装Docker跳过具有损坏依赖性的程序包一个有效的解决方案是使您的CentOS 8系统使用以下--nobest命令安装最符合条件的版本：sudo dnf install docker...

GitCode 开源社区

定时同步数据库表(mysql+linux+crontab)

sync.sh里面的参数需要改变，ip/username/password/database/tablesync.sh#!/bin/sh# Please change the IP and password of the data source db.# Then change the table name.filename=/home/nington/db/$(date +%Y-%m