huggingface/transformers数据预处理
参考资料
https://huggingface.co/docs/transformers/main/en/preprocessing#preprocess
本篇博客基于官方教程
1. 自然语言
1.1 Tokenize
处理文本数据的主要工具是tokenizer。
tokenizer 首先根据一组规则将文本拆分为 tokens
tokens-to-index(通常称为vocab)将 token 转换为词典中的下标,通过 look-up table 构建张量作为模型的输入。
模型所需的任何其他输入也由tokenizer添加。
确保文本以与预训练语料库相同的方式拆分,并在预训练期间使用相同的vocab。
通过使用AutoTokenizer类加载预训练的标记器来快速开始。这会下载模型预训练时使用的 vocab。
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
将句子送入tokenizer中:
encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)
输出结果
{
'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}
tokenizer 返回一个包含三个重要项目的字典:
- input_ids are the indices corresponding to each token in the sentence.
- attention_mask indicates whether a token should be attended to or not.
- token_type_ids identifies which sequence a token belongs to when there is more than one sequence.
可以解码input_ids
以返回原始输入
tokenizer.decode(encoded_input["input_ids"])
输出结果
'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'
tokenizer 添加了两个特别的token —— CLS
和 SEP
将包含句子的list送入 tokenizer
batch_sentences = [
"But what about second breakfast?",
"Don't think he knows about second breakfast, Pip.",
"What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)
输出结果
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102],
[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
[101, 1327, 1164, 5450, 23434, 136, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1]]}
1.2 Pad
常见的问题:一批句子的长度并不总是相同的,但作为模型输入的张量需要具有统一的形状。
填充是一种策略,通过向较短的句子添加特殊的 padding token 来确保输入模型的张量长度相同。
将padding
参数设置True
填充一个batch中较短的序列以匹配最长的序列:
batch_sentences = [
"But what about second breakfast?",
"Don't think he knows about second breakfast, Pip.",
"What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True)
print(encoded_input)
输出结果
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
[101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
注意到第一个和第三个句子的tokens被填充到了第二个句子的长度
阅读官方文档获得更详细的设置
https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#tokenizer
- max_length (
int
, optional) — Controls the maximum length to use by one of the truncation/padding parameters.If left unset or set toNone
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
1.3 Truncation
另一方面,有时序列可能太长,模型无法处理。在这种情况下,您需要将序列截断为更短的长度。
将truncation
参数设置True
为将序列截断为模型接受的最大长度:
batch_sentences = [
"But what about second breakfast?",
"Don't think he knows about second breakfast, Pip.",
"What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
print(encoded_input)
1.4 Build tensors
整合上述步骤,并设置返回向量的类型(Pytorch: return_tensors = “pt” / Tensorflow: return_tensors=“tf”)
batch_sentences = [
"But what about second breakfast?",
"Don't think he knows about second breakfast, Pip.",
"What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(encoded_input)
2. 图像
2.1 特征提取
特征提取器还用于处理视觉任务的图像,目标是将原始图像转换为一批张量作为输入。
让我们为本教程加载food101数据集。
使用 Datasets split
参数仅从训练拆分中加载一个小样本,因为数据集非常大:
from datasets import load_dataset
dataset = load_dataset("food101", split="train[:100]")
# 查看一张图像
dataset[1]["image"]
通过 AutoFeatureExtractor.from_pretrained() 加载图像特征提取器
from transformers import AutoFeatureExtractor
feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
2.2 数据增强
对于视觉任务,通常会在图像中添加某种类型的数据增强作为预处理的一部分。
您可以使用您喜欢的任何库添加增强功能,但在本教程中,您将使用 torchvision 的transforms
模块。
- Normalize the image and use
Compose
to chain some transforms -RandomResizedCrop
andColorJitter
- together:
from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor
normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
_transforms = Compose(
[RandomResizedCrop(feature_extractor.size),
ColorJitter(brightness=0.5, hue=0.5),
ToTensor(),
normalize]
)
-
模型接受特征提取器产生的像素值作为输入。
Create a function that generates
pixel_values
from the transforms:
def transforms(examples):
examples["pixel_values"] = [_transforms(image.convert("RGB")) for image in examples["image"]]
return examples
- use Datasets
set_transform
to apply the transforms on-the-fly:
dataset.set_transform(transforms)
- 再次查看图片
dataset[1]
输出结果
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F971B093FD0>, 'label': 6, 'pixel_values': tensor([[[ 0.7098, 0.7490, 0.7804, ..., -0.3333, -0.3412, -0.3255],
[ 0.6941, 0.7333, 0.7804, ..., -0.3176, -0.3255, -0.3255],
[ 0.6941, 0.7098, 0.7647, ..., -0.3647, -0.3176, -0.3176],
...,
[-0.5686, -0.7176, -0.7725, ..., -0.5686, -0.6000, -0.6235],
[-0.5686, -0.6863, -0.7333, ..., -0.5922, -0.6000, -0.6000],
[-0.5373, -0.6235, -0.7020, ..., -0.5922, -0.6078, -0.6000]],
[[ 0.7098, 0.7490, 0.7725, ..., -0.1843, -0.1608, -0.1451],
[ 0.7098, 0.7490, 0.7804, ..., -0.1686, -0.1608, -0.1608],
[ 0.7098, 0.7333, 0.7725, ..., -0.2000, -0.1529, -0.1294],
...,
[-0.3176, -0.4588, -0.5216, ..., -0.3412, -0.3412, -0.3647],
[-0.3176, -0.4431, -0.4980, ..., -0.3569, -0.3412, -0.3412],
[-0.2941, -0.3882, -0.4588, ..., -0.3647, -0.3490, -0.3412]],
[[ 0.5529, 0.6000, 0.6392, ..., 0.0980, 0.0902, 0.1059],
[ 0.5451, 0.5843, 0.6235, ..., 0.1137, 0.1216, 0.1137],
[ 0.5373, 0.5608, 0.6000, ..., 0.0980, 0.1294, 0.1451],
...,
[-0.1137, -0.2706, -0.3176, ..., -0.0039, -0.0275, -0.0431],
[-0.1294, -0.2549, -0.3176, ..., -0.0118, -0.0196, -0.0275],
[-0.1137, -0.2000, -0.2941, ..., -0.0196, -0.0275, -0.0196]]])}
这是图像预处理后的样子。正如您对应用的变换所期望的那样,图像已被随机裁剪,并且其颜色属性不同。
import numpy as np
import matplotlib.pyplot as plt
img = dataset[0]["pixel_values"]
plt.imshow(img.permute(1, 2, 0))
和我脑子里的图像的特征提取不太一样?
更多推荐
所有评论(0)