利用transformers包加载预训练好的Bert模型

transformers

huggingface/transformers: 是一个基于 Python 的自然语言处理库，它使用了 PostgreSQL 数据库存储数据。适合用于自然语言处理任务的开发和实现，特别是对于需要使用 Python 和 PostgreSQL 数据库的场景。特点是自然语言处理库、Python、PostgreSQL 数据库。

项目地址：https://gitcode.com/gh_mirrors/tra/transformers

免费下载资源

Laura_Wangzx

2822人浏览 · 2021-12-28 21:20:29

Laura_Wangzx · 2021-12-28 21:20:29 发布

利用transformers包加载预训练好的Bert模型得到句子Embedding

1. transformers包加载预训练好的Bert模型
2. 得到句子Embedding
- （1）encode()方法：仅返回input_ids
- （2）encode_plus()方法:返回所有的编码信息
3. Eg：以上代码整理，可跑

1. transformers包加载预训练好的Bert模型

# 1. 导入包
import torch
from transformers import BertTokenizer

# 2. 所需要的预训练好的model
model_name = 'bert-base-uncased'

# 3. 通过词典导入分词器
tokenizer = BertTokenizer.from_pretrained(model_name)
sentence = "A very clean and well decorated empty bathroom."

2. 得到句子Embedding

（1）encode()方法：仅返回input_ids

def encode(
        self,
        text: Union[TextInput, PreTokenizedInput, EncodedInput],
        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,
        add_special_tokens: bool = True,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = False,
        max_length: Optional[int] = None,
        stride: int = 0,
        return_tensors: Optional[Union[str, TensorType]] = None,
        **kwargs
    )

（2）encode_plus()方法:返回所有的编码信息

 def encode_plus(
        self,
        text: Union[TextInput, PreTokenizedInput, EncodedInput],
        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,
        add_special_tokens: bool = True,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = False,
        max_length: Optional[int] = None,
        stride: int = 0,
        is_split_into_words: bool = False,
        pad_to_multiple_of: Optional[int] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        return_token_type_ids: Optional[bool] = None,
        return_attention_mask: Optional[bool] = None,
        return_overflowing_tokens: bool = False,
        return_special_tokens_mask: bool = False,
        return_offsets_mapping: bool = False,
        return_length: bool = False,
        verbose: bool = True,
        **kwargs
    )

一般返回参数如下：

input_ids：表示单词在词典中的编码id
token_type_ids：区分两个句子的编码id（上句全为0，下句全为1）
attention_mask：指定对哪些单词id进行self-Attention操作

举例如下：
（1）encode()方法

# 4. 得到Embedding
# （1）encode()方法
print(tokenizer.encode(sentence))# encode仅返回input_ids
# [101, 7592, 1010, 2026, 2365, 2003, 5870, 1012, 102]

（2）encode_plus()方法

# （2）encode_plus()方法
print(tokenizer.encode_plus(sentence))# encode_plus返回所有的编码信息

（3）encode_plus()里面参数自定义

# （3）encode_plus()里面参数自定义
print(tokenizer.encode_plus(sentence, max_length=15, padding='max_length', return_attention_mask=True, return_token_type_ids=True, truncation=True))

3. Eg：以上代码整理，可跑

import torch
from transformers import BertTokenizer

model_name = 'bert-base-uncased'

# a.通过词典导入分词器
tokenizer = BertTokenizer.from_pretrained(model_name)
sentence = "A very clean and well decorated empty bathroom."

print(tokenizer.encode(sentence))# encode仅返回input_ids
# [101, 7592, 1010, 2026, 2365, 2003, 5870, 1012, 102]
print(tokenizer.encode_plus(sentence))# encode_plus返回所有的编码信息
print(tokenizer.encode_plus(sentence, max_length=15, padding='max_length', return_attention_mask=True, return_token_type_ids=True, truncation=True))
# {'input_ids': [101, 1037, 2200, 4550, 1998, 2092, 7429, 4064, 5723, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# {'input_ids': [101, 1037, 2200, 4550, 1998, 2092, 7429, 4064, 5723, 1012, 102, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]}

GitHub 加速计划 / tra / transformers

130.23 K

25.88 K

下载

最近提交(Master分支：1 个月前 )

8bd2b1e8 * initial commit * gloups * updates * work * weights match * nits * nits * updates to support the tokenizer :) * updates * Pixtral processor (#33454) * rough outline * Add in image break and end tokens * Fix * Udo some formatting changes * Set patch_size default * Fix * Fix token expansion * nit in conversion script * Fix image token list creation * done * add expected results * Process list of list of images (#33465) * updates * working image and processor * this is the expected format * some fixes * push current updated * working mult images! * add a small integration test * Uodate configuration docstring * Formatting * Config docstring fix * simplify model test * fixup modeling and etests * Return BatchMixFeature in image processor * fix some copies * update * nits * Update model docstring * Apply suggestions from code review * Fix up * updates * revert modeling changes * update * update * fix load safe * addd liscence * update * use pixel_values as required by the model * skip some tests and refactor * Add pixtral image processing tests (#33476) * Image processing tests * Add processing tests * woops * defaults reflect pixtral image processor * fixup post merge * images -> pixel values * oups sorry Mr docbuilder * isort * fix * fix processor tests * small fixes * nit * update * last nits * oups this was really breaking! * nits * is composition needs to be true --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> 5 天前

7bb1c998 docs: update grammar in comment in tokenization_utils_base.py small grammar update in tokenization_utils_base.py comment 5 天前

GitCode 开源社区

旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

[转载]在Windows环境下安装GNU Radio

转自：在Windows环境下安装GNURadio_恐弱智_新浪博客GNU Radio是用Python开发的，大部分开源的工程能够在Linux环境下运行良好，而Windows下却运行的很勉强，而且安装配置都很复杂。GNU Radio算是个例外了，不光提供了Windows的二进制安装，还有比较详细的说明。我是Python小白，所以折腾了好久才弄好，特意记录下来，免得以后再装还折腾。GNU Radio的

GitCode 开源社区

centOS 8 使用dnf安装Docker

DNF是什么？CentOS 8使用YUM软件包管理器版本v4.0.4。现在，该版本使用DNF(已删除YUM)。DNF是软件包管理器。它会在Linux发行版上安装，执行更新并删除软件包。使用DNF安装Docker跳过具有损坏依赖性的程序包一个有效的解决方案是使您的CentOS 8系统使用以下--nobest命令安装最符合条件的版本：sudo dnf install docker...

GitCode 开源社区

定时同步数据库表(mysql+linux+crontab)

sync.sh里面的参数需要改变，ip/username/password/database/tablesync.sh#!/bin/sh# Please change the IP and password of the data source db.# Then change the table name.filename=/home/nington/db/$(date +%Y-%m