Transformers包tokenizer.encode()方法源码阅读笔记
transformers
huggingface/transformers: 是一个基于 Python 的自然语言处理库,它使用了 PostgreSQL 数据库存储数据。适合用于自然语言处理任务的开发和实现,特别是对于需要使用 Python 和 PostgreSQL 数据库的场景。特点是自然语言处理库、Python、PostgreSQL 数据库。
项目地址:https://gitcode.com/gh_mirrors/tra/transformers
·
1 引言
- Hugging Face公司出的transformers包,能够超级方便的引入预训练模型,BERT、ALBERT、GPT2…
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForTokenClassification.from_pretrained('bert-base-uncased')
- 这两行代码就导入了bert-base-uncased预训练模型和针对于NER任务的BertForTokenClassification微调模型,简直不能再方便。
- 对我们的句子进行token to id的转化,下面的是官网给出的样例:
input_ids = torch.tensor(
tokenizer.encode("Hello, my dog is cuting",
add_special_tokens=True)).unsqueeze(0)
- 在这个NER任务中使用了tokenizer的encode方法,那么:
- 这个encode和tokeninze的区别是什么?这两个方法的输出是什么?
- encode可以对句子做哪些操作?有哪些可以选择的字段?
我们将将在下面进行一一探索于整理。
2 开始读码
2.1 encode和tokeninze方法的区别
- 直接上代码比较直观(其中处理的句子是随便起的,为了凸显WordPiece功能):
sentence = "Hello, my son is cuting."
input_ids_method1 = torch.tensor(
tokenizer.encode(sentence, add_special_tokens=True)) # Batch size 1
# tensor([ 101, 7592, 1010, 2026, 2365, 2003, 3013, 2075, 1012, 102])
input_token2 = tokenizer.tokenize(sentence)
# ['hello', ',', 'my', 'son', 'is', 'cut', '##ing', '.']
input_ids_method2 = tokenizer.convert_tokens_to_ids(input_token2)
# tensor([7592, 1010, 2026, 2365, 2003, 3013, 2075, 1012])
# 并没有开头和结尾的标记:[cls]、[sep]
- 从例子中可以看出,encode方法可以一步到位地生成对应模型的输入。
- 相比之下,tokenize只是用于分词,可以分成WordPiece的类型,并且在分词之后还要手动使用convert_tokens_to_ids方法,比较麻烦。
- 通过源码阅读,发现encode方法中调用了tokenize方法,所以在使用的过程中,我们可以通过设置encode方法中的参数,达到转化数据到可训练格式一步到位的目的,下面开始介绍encode的相关参数与具体操作。
2.2 tokenizer.encode()参数介绍
- 上源码:
def encode(
self,
text: str, # 需要转化的句子
text_pair: Optional[str] = None,
add_special_tokens: bool = True,
max_length: Optional[int] = None,
stride: int = 0,
truncation_strategy: str = "longest_first",
pad_to_max_length: bool = False,
return_tensors: Optional[str] = None,
**kwargs
):
"""
Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
Same as doing ``self.convert_tokens_to_ids(self.tokenize(text))``.
Args:
text (:obj:`str` or :obj:`List[str]`):
The first sequence to be encoded. This can be a string, a list of strings (tokenized string using
the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
method)
text_pair (:obj:`str` or :obj:`List[str]`, `optional`, defaults to :obj:`None`):
Optional second sequence to be encoded. This can be a string, a list of strings (tokenized
string using the `tokenize` method) or a list of integers (tokenized string ids using the
`convert_tokens_to_ids` method)
add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):
If set to ``True``, the sequences will be encoded with the special tokens relative
to their model.
max_length (:obj:`int`, `optional`, defaults to :obj:`None`):
If set to a number, will limit the total sequence returned so that it has a maximum length.
If there are overflowing tokens, those will be added to the returned dictionary
stride (:obj:`int`, `optional`, defaults to ``0``):
If set to a number along with max_length, the overflowing tokens returned will contain some tokens
from the main sequence returned. The value of this argument defines the number of additional tokens.
truncation_strategy (:obj:`str`, `optional`, defaults to `longest_first`):
String selected in the following options:
- 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length
starting from the longest one at each token (when there is a pair of input sequences)
- 'only_first': Only truncate the first sequence
- 'only_second': Only truncate the second sequence
- 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)
pad_to_max_length (:obj:`bool`, `optional`, defaults to :obj:`False`):
If set to True, the returned sequences will be padded according to the model's padding side and
padding index, up to their max length. If no max length is specified, the padding is done up to the
model's max length. The tokenizer padding sides are handled by the class attribute `padding_side`
which can be set to the following strings:
- 'left': pads on the left of the sequences
- 'right': pads on the right of the sequences
Defaults to False: no padding.
return_tensors (:obj:`str`, `optional`, defaults to :obj:`None`):
Can be set to 'tf' or 'pt' to return respectively TensorFlow :obj:`tf.constant`
or PyTorch :obj:`torch.Tensor` instead of a list of python integers.
**kwargs: passed to the `self.tokenize()` method
"""
encoded_inputs = self.encode_plus(
text,
text_pair=text_pair,
max_length=max_length,
add_special_tokens=add_special_tokens,
stride=stride,
truncation_strategy=truncation_strategy,
pad_to_max_length=pad_to_max_length,
return_tensors=return_tensors,
**kwargs,
)
return encoded_inputs["input_ids"]
-
add_special_tokens: bool = True将句子转化成对应模型的输入形式,默认开启 -
max_length设置最大长度,如果不设置的话原模型设置的最大长度是512,此时,如果句子长度超过512会报下面的错:
Token indices sequence length is longer than the specified maximum sequence length for this model (5904 > 512). Running this sequence through the model will result in indexing errors
这时候我们需要做切断句子操作,或者启用这个参数,设置我们想要的最大长度,这样函数将只保留长度-2(除去[cls][sep])个token并转化成id。
-
pad_to_max_length: bool = False
是否按照最长长度补齐,默认关闭,此处可以通过tokenizer.padding_side='left'设置补齐的位置在左边插入。 -
truncation_strategy: str = "longest_first"
截断机制,有四种方式来读取句子内容:- ‘longest_first’(默认):一直迭代,读到不能再读,读满为止
- ‘only_first’: 只读入第一个序列
- ‘only_second’: 只读入第二个序列
- ‘do_not_truncate’: 不做截取,长了就报错
-
return_tensors: Optional[str] = None
返回的数据类型,默认是None,可以选择tensorflow版本('tf')和pytorch版本('pt')。
3 最后
- 读了一个api的之后,我们今后的代码中便有了那个api的影子。
- 当今NLP的门槛越来越低,像Hugging Face的Transformers 这样的模型快速搭建包只会越来越多,并且使用起来也会越来越便捷。我认为,应用型的NLP工程师的核心竞争力除了储备大量模型算法外,还要如同老中医一般,对任务“望闻问切”后能马上给出实验方案,快速的算法落地,完成项目上线。既然要速度,我们就必须要准备一些这样“简单粗暴”的“菜刀型武器”(
武功再高,也怕菜刀)。
huggingface/transformers: 是一个基于 Python 的自然语言处理库,它使用了 PostgreSQL 数据库存储数据。适合用于自然语言处理任务的开发和实现,特别是对于需要使用 Python 和 PostgreSQL 数据库的场景。特点是自然语言处理库、Python、PostgreSQL 数据库。
最近提交(Master分支:4 个月前 )
6e0515e9
* added changes from 32905
* fixed mistakes caused by select all paste
* rename diff_dinov2...
* ran tests
* Fix modular
* Fix tests
* Use new init
* Simplify drop path
* Convert all checkpoints
* Add figure and summary
* Update paths
* Update docs
* Update docs
* Update toctree
* Update docs
---------
Co-authored-by: BernardZach <bernardzach00@gmail.com>
Co-authored-by: Zach Bernard <132859071+BernardZach@users.noreply.github.com> 5 天前
d8c1db2f
Signed-off-by: jiqing-feng <jiqing.feng@intel.com> 5 天前
更多推荐

所有评论(0)