在Hugging Face上下载并使用Bert-base-Chinese

在Hugging Face上下载Bert_Base_Chinese，并使用该模型完成一些基础任务。

文章共12,686字 · 阅读需要大约43分钟

一键AI生成摘要，助你高效阅读

问答

little pierce

10517人浏览 · 2023-07-25 16:55:58

little pierce · 2023-07-25 16:55:58 发布

Hugging Face

Hugging face 起初是一家总部位于纽约的聊天机器人初创服务商，他们本来打算创业做聊天机器人，然后在github上开源了一个Transformers库，虽然聊天机器人业务没搞起来，但是他们的这个库在机器学习社区迅速大火起来。目前已经共享了超100,000个预训练模型，10,000个数据集，变成了机器学习界的github。

huggingface的官方网站：http://www.huggingface.co. 在这里主要有以下大家需要的资源：

Datasets：数据集，以及数据集的下载地址
Models：各个预训练模型
course：免费的nlp课程，可惜都是英文的
docs：文档

Bert_Base_Chinese

我们都知道BERT是一个强大的预训练模型，它可以基于谷歌发布的预训练参数在各个下游任务中进行微调。Bert_Base_Chinese已经针对中文进行了预训练，训练和随机输入掩蔽已独立应用于单词片段（如原始 BERT 论文中所示）。值得注意的是，Bert_Base_Chinese的单词token是一个一个中文汉字，而非词语。其他Bert相关内容非本文目的，不再赘述。

Bert的embedding原理

Bert_Tokenizer

简单编码器

Bert_Base_Chinese的下载地址：https://huggingface.co/bert-base-chinese

自动下载模型与分词器方法：

from transformers import BertModel, BertTokenizer

# 联网下载模型与分词器使用，地址存放在C:\Users\admin\.cache\huggingface\hub\models--bert-base-chinese

model_name = 'bert-base-chinese'
tokenizer = BertTokenizer.from_pretrained(model_name)

sents = ["白日依山尽", "黄河入海流"]
out = tokenizer.encode(
    # 传入的两个句子
    text=sents[0],
    text_pair=sents[1],
    # 长度大于设置是否截断
    truncation=True,
    # 一律补齐，如果长度不够
    padding='max_length',
    add_special_tokens=True,
    max_length=30,
    return_tensors=None,
)
print(out)
print(tokenizer.decode(out))

若网络不佳，该方法可能不适用，可使用前述网址手动下载，按下述方法手动导入，效果相同，其中手动导入时文件路径如图：

model文件路径

from transformers import BertModel, BertTokenizer

# 这里手动下载模型与分词器，根据目录加载使用

vocab_file = 'model/vocab.txt'
tokenizer = BertTokenizer(vocab_file)
bert = BertModel.from_pretrained("model/bert-base-chinese/")

# 这里只是用到分词器，上一句调用模型不需要亦可
sents = ["白日依山尽", "黄河入海流"]
out = tokenizer.encode(
    # 传入的两个句子
    text=sents[0],
    text_pair=sents[1],
    # 长度大于设置是否截断
    truncation=True,
    # 一律补齐，如果长度不够
    padding='max_length',
    add_special_tokens=True,
    max_length=30,
    return_tensors=None,
)
print(out)
print(tokenizer.decode(out))

两种方法下的运行结果如下图，可以看到由于人为设置每句应有30个字，故而在后面填充[PAD]，同时每一句的开头标识为[CLS]，中间分隔与结尾标识为[SEP]。如果使用len()，可知tokenizer长度共为21128。Bert接受一次输入一个句子或两个句子，此为常识，此不赘述。

在这里插入图片描述

增强编码器

有时使用简单编码器encode()不能满足需求，可以使用encode_plus()，相较前者该编码器可设置返回tensor类型，token标识类型，mask类型与特殊标识类型与返回长度等。

from transformers import BertModel, BertTokenizer

# 这里使用增强的编码器函数encode_plus

vocab_file = 'model/vocab.txt'
tokenizer = BertTokenizer(vocab_file)
bert = BertModel.from_pretrained("model/bert-base-chinese/")

# 这里只是用到分词器，上一句调用模型不需要亦可
sents = ["白日依山尽", "黄河入海流"]
out = tokenizer.encode_plus(
    # 传入的两个句子
    text=sents[0],
    text_pair=sents[1],
    # 长度大于设置是否截断
    truncation=True,
    # 一律补齐，如果长度不够
    padding='max_length',
    add_special_tokens=True,
    max_length=30,
    # 可取值tf,pt,np,（tensorflow,pytorch,numpy）默认返回list
    return_tensors=None,
    # 返回token_type_ids,第一句与特殊符号是0，第二句是1
    return_token_type_ids=True,
    # 返回attention_mask，填充是0，其他是1
    return_attention_mask=True,
    # 返回special_tokens_mask特殊符号标识，特殊是1，其他是0
    return_special_tokens_mask=True,
    # 返回长度
    return_length=True
)
for k, v in out.items():
    print(k, ':', v)

print(tokenizer.decode(out['input_ids']))

可以看到token_type_ids属性的返回值第一句全为0，第二句全为1，填充全为0；special_tokens_mask属性的返回值正常字符为0，开头，分隔，结尾，填充等特殊token全为1；attention_mask返回值关注有效部分，有效部分全为1，填充部分全为0；最后长度返回的是设置长度，即有效部分+填充部分总长度30。

在这里插入图片描述

批量编码器

更多情况下，并非一句一句将语句送入编码器，而是批量送入。故而最常用的编码器函数为batch_encode_plus()，实现如下，可以看到主要区别是句子输入的属性变为batch_text_or_text_pairs，是一个list属性，内部可以是一个句子也可以是一个tuple。

from transformers import BertModel, BertTokenizer

# 这里使用批量的的编码器函数batch_encode_plus

vocab_file = 'model/vocab.txt'
tokenizer = BertTokenizer(vocab_file)
bert = BertModel.from_pretrained("model/bert-base-chinese/")

# 这里只是用到分词器，上一句调用模型不需要亦可
sents = ["白日依山尽", "黄河入海流", "欲穷千里目", "更上一层楼"]
out = tokenizer.batch_encode_plus(
    # 传入的所有句子，单句
    # batch_text_or_text_pairs=sents,
    # 传入的所有句子，有成对句子
    batch_text_or_text_pairs=[sents[0], (sents[1], sents[2]), sents[3]],
    # 长度大于设置是否截断
    truncation=True,
    # 一律补齐，如果长度不够
    padding='max_length',
    add_special_tokens=True,
    max_length=30,
    # 可取值tf,pt,np,（tensorflow,pytorch,numpy）默认返回list
    return_tensors=None,
    # 返回token_type_ids,第一句与特殊符号是0，第二句是1
    return_token_type_ids=True,
    # 返回attention_mask，填充是0，其他是1
    return_attention_mask=True,
    # 返回special_tokens_mask特殊符号标识，特殊是1，其他是0
    return_special_tokens_mask=True,
    # 返回长度,这里的长度是真实长度，而非设置的长度30了
    return_length=True
)
for k, v in out.items():
    print(k, ':', v)
for i in range(len(out['input_ids'])):
    print(tokenizer.decode(out['input_ids'][i]))

运行结果如下，可以看到input_ids变为二维数组，同时length不再返回总长度，而是返回每个句子的实际长度。

在这里插入图片描述

字典

字典并非一成不变，可以自己在字典中新增token。新增token有add_tokens()与add_special_tokens()两个函数，分别用于新增字符token与新增特殊token。

from transformers import BertModel, BertTokenizer

# 这里使用字典

vocab_file = 'model/vocab.txt'
tokenizer = BertTokenizer(vocab_file)
bert = BertModel.from_pretrained("model/bert-base-chinese/")

zidian = tokenizer.get_vocab()
# bert_base_chinese以字分词
print(type(zidian), ' ', len(zidian), ' ', '月光' in zidian, ' ', '月' in zidian, ' ', '光' in zidian)

tokenizer.add_tokens(new_tokens=['月光'])
tokenizer.add_special_tokens({'eos_token': '[EOS]'})
zidian = tokenizer.get_vocab()
print('月光' in zidian, ' ', '[EOS]' in zidian, ' ', len(zidian))

运行效果如下图，可见字典长度原先为21128，与之前tokenizer长度相同。并且由于Bert_Base_Chinese的分词是按字分词，所以”月光“不在字典中。当手动加入”月光“与特殊标识”[EOS]“后，字典的长度相应加2。

在这里插入图片描述

数据集

本文使用数据集为ChnSentiCorp，且后续实验皆使用该数据集，数据集可在Hugging Face上下载，地址https://huggingface.co/datasets/seamew/ChnSentiCorp/tree/main。

ChnSentiCorp是一个用于做情感分类的数据集，手动下载后文件路径如下：

数据集文件路径

from datasets import load_dataset

dataset = load_dataset(path='datasets', split='train')
print(len(dataset))
print(dataset[0])

验证数据集是否成功导入，查看数据集长度与第一条数据内容：

在这里插入图片描述

可以看到该数据集共有9600条数据，每条数据包括text与label，label部分是对text部分情感的判断。

中文分类任务

现在尝试使用之前所述内容开始第一个实验，中文分类。实验直接使用已经训练好的Bert_Base_Chinese，故而设置torch.no_grad()不进行梯度下降，而是在后面加一个全连接层，对16个句子进行二分类任务。为了节省时间，实验跑到300轮结束。

CPU版本：

import torch
from datasets import load_dataset
from transformers import BertTokenizer, BertModel, AdamW

# 中文分类


# 定义数据集
class Dataset(torch.utils.data.Dataset):
    def __init__(self, split):
        self.dataset = load_dataset(path='datasets', split=split)

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, i):
        text = self.dataset[i]['text']
        label = self.dataset[i]['label']

        return text, label


dataset = Dataset('train')

# len(dataset), dataset[0]

# 加载字典和分词工具
token = BertTokenizer.from_pretrained('bert-base-chinese')


def collate_fn(data):
    sents = [i[0] for i in data]
    labels = [i[1] for i in data]
    #编码
    data = token.batch_encode_plus(batch_text_or_text_pairs=sents,
                                   truncation=True,
                                   padding='max_length',
                                   max_length=500,
                                   return_tensors='pt',
                                   return_length=True)
    #input_ids:编码之后的数字
    #attention_mask:是补零的位置是0,其他位置是1
    input_ids = data['input_ids']
    attention_mask = data['attention_mask']
    token_type_ids = data['token_type_ids']
    labels = torch.LongTensor(labels)
    #print(data['length'], data['length'].max())
    return input_ids, attention_mask, token_type_ids, labels


#数据加载器
loader = torch.utils.data.DataLoader(dataset=dataset,
                                     batch_size=16,
                                     collate_fn=collate_fn,
                                     shuffle=True,
                                     drop_last=True)

for i, (input_ids, attention_mask, token_type_ids,
        labels) in enumerate(loader):
    break
print(len(loader))
print(input_ids.shape, attention_mask.shape, token_type_ids.shape, labels)

# 加载中文bert模型
# 加载预训练模型
pretrained = BertModel.from_pretrained('model/bert-base-chinese')
# 不训练,不需要计算梯度
for param in pretrained.parameters():
    param.requires_grad_(False)
# 模型试算
out = pretrained(input_ids=input_ids,
           attention_mask=attention_mask,
           token_type_ids=token_type_ids)
# 16个句子，500个词每句，768维度每词
print(out.last_hidden_state.shape)


# 定义下游任务模型
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        with torch.no_grad():
            out = pretrained(input_ids=input_ids,
                       attention_mask=attention_mask,
                       token_type_ids=token_type_ids)
        out = self.fc(out.last_hidden_state[:, 0])
        out = out.softmax(dim=1)
        return out


model = Model()
# 输出16，2，意为16句话分为2分类
# print(model(input_ids=input_ids,
#      attention_mask=attention_mask,
#      token_type_ids=token_type_ids).shape)

# 训练
optimizer = AdamW(model.parameters(), lr=5e-4)
criterion = torch.nn.CrossEntropyLoss()
model.train()
for i, (input_ids, attention_mask, token_type_ids,
        labels) in enumerate(loader):
    out = model(input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids)
    loss = criterion(out, labels)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    if i % 5 == 0:
        out = out.argmax(dim=1)
        accuracy = (out == labels).sum().item() / len(labels)
        print(i, loss.item(), accuracy)
    if i == 300:
        break


# 测试
def test():
    model.eval()
    correct = 0
    total = 0

    loader_test = torch.utils.data.DataLoader(dataset=Dataset('validation'),
                                              batch_size=32,
                                              collate_fn=collate_fn,
                                              shuffle=True,
                                              drop_last=True)

    for i, (input_ids, attention_mask, token_type_ids,
            labels) in enumerate(loader_test):
        if i == 5:
            break
        print(i)
        with torch.no_grad():
            out = model(input_ids=input_ids,
                        attention_mask=attention_mask,
                        token_type_ids=token_type_ids)
        out = out.argmax(dim=1)
        correct += (out == labels).sum().item()
        total += len(labels)
    print(correct / total)


test()

GPU版本，内容与CPU版本一致，效果相同，速度更快：

import torch
from transformers import BertModel, BertTokenizer, AdamW
from datasets import load_dataset

# 中文分类的cuda版本

# 快速演示
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('device=', device)

# 加载预训练模型
pretrained = BertModel.from_pretrained('model/bert-base-chinese/')
# 需要移动到cuda上
pretrained.to(device)

# 不训练,不需要计算梯度
for param in pretrained.parameters():
    param.requires_grad_(False)


# 定义下游任务模型
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        with torch.no_grad():
            out = pretrained(input_ids=input_ids,
                             attention_mask=attention_mask,
                             token_type_ids=token_type_ids)
        out = self.fc(out.last_hidden_state[:, 0])
        out = out.softmax(dim=1)
        return out


model = Model()
# 同样要移动到cuda
model.to(device)


# 后面的计算和中文分类完全一样，只是放在了cuda上计算
# 定义数据集
class Dataset(torch.utils.data.Dataset):

    def __init__(self, split):
        self.dataset = load_dataset('datasets')[split]

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, i):
        text = self.dataset[i]['text']
        label = self.dataset[i]['label']
        return text, label


dataset = Dataset('train')

# 加载字典和分词工具
token = BertTokenizer.from_pretrained('bert-base-chinese')


def collate_fn(data):
    sents = [i[0] for i in data]
    labels = [i[1] for i in data]

    # 编码
    data = token.batch_encode_plus(batch_text_or_text_pairs=sents,
                                   truncation=True,
                                   padding='max_length',
                                   max_length=500,
                                   return_tensors='pt',
                                   return_length=True)
    # input_ids:编码之后的数字
    # attention_mask:是补零的位置是0,其他位置是1
    input_ids = data['input_ids'].to(device)
    attention_mask = data['attention_mask'].to(device)
    token_type_ids = data['token_type_ids'].to(device)
    labels = torch.LongTensor(labels).to(device)
    #print(data['length'], data['length'].max())
    return input_ids, attention_mask, token_type_ids, labels


# 数据加载器
loader = torch.utils.data.DataLoader(dataset=dataset,
                                     batch_size=16,
                                     collate_fn=collate_fn,
                                     shuffle=True,
                                     drop_last=True)
for i, (input_ids, attention_mask, token_type_ids,
        labels) in enumerate(loader):
    break

# print(len(loader))
# print(input_ids.shape, attention_mask.shape, token_type_ids.shape, labels)

# 训练
optimizer = AdamW(model.parameters(), lr=5e-4)
criterion = torch.nn.CrossEntropyLoss()
model.train()
for i, (input_ids, attention_mask, token_type_ids,
        labels) in enumerate(loader):
    out = model(input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids)
    loss = criterion(out, labels)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    if i % 5 == 0:
        out = out.argmax(dim=1)
        accuracy = (out == labels).sum().item() / len(labels)
        print(i, loss.item(), accuracy)
    if i == 300:
        break


# 测试
def test():
    model.eval()
    correct = 0
    total = 0
    loader_test = torch.utils.data.DataLoader(dataset=Dataset('validation'),
                                              batch_size=32,
                                              collate_fn=collate_fn,
                                              shuffle=True,
                                              drop_last=True)
    for i, (input_ids, attention_mask, token_type_ids,
            labels) in enumerate(loader_test):
        if i == 5:
            break
        print(i)
        with torch.no_grad():
            out = model(input_ids=input_ids,
                        attention_mask=attention_mask,
                        token_type_ids=token_type_ids)
        out = out.argmax(dim=1)
        correct += (out == labels).sum().item()
        total += len(labels)
    print(correct / total)


test()

实验结果：

0 0.7015336751937866 0.375
5 0.6688764095306396 0.625
10 0.6294441223144531 0.6875
15 0.6134305596351624 0.8125
20 0.5924522876739502 0.6875
25 0.5516412258148193 0.9375
30 0.5142663717269897 0.9375
35 0.5576078295707703 0.75
40 0.49155497550964355 0.9375
45 0.5733523964881897 0.75
50 0.47572264075279236 0.875
55 0.4810750186443329 0.875
60 0.584780216217041 0.6875
65 0.4616132080554962 0.9375
70 0.46279385685920715 0.9375
75 0.5439828038215637 0.875
80 0.5376919507980347 0.75
85 0.44682416319847107 0.8125
90 0.5576794743537903 0.75
95 0.5322697162628174 0.75
100 0.538276195526123 0.8125
105 0.5001559853553772 0.875
110 0.5196012258529663 0.8125
115 0.4310966730117798 0.9375
120 0.43399879336357117 0.9375
125 0.5088990926742554 0.8125
130 0.5933794975280762 0.6875
135 0.5021435022354126 0.75
140 0.5618072748184204 0.6875
145 0.4775013327598572 0.875
150 0.44216257333755493 0.875
155 0.5286621451377869 0.75
160 0.4359947443008423 0.875
165 0.3895459473133087 1.0
170 0.5000126957893372 0.8125
175 0.3741750121116638 1.0
180 0.4112277626991272 0.875
185 0.4835755228996277 0.8125
190 0.5347906351089478 0.8125
195 0.47410687804222107 0.8125
200 0.454181969165802 0.875
205 0.5046591758728027 0.875
210 0.37064653635025024 1.0
215 0.4531233012676239 0.9375
220 0.5533742308616638 0.75
225 0.3597880005836487 1.0
230 0.37045764923095703 1.0
235 0.5207308530807495 0.75
240 0.44153541326522827 0.875
245 0.4212343990802765 0.9375
250 0.4749773442745209 0.875
255 0.3914490044116974 0.9375
260 0.4207759201526642 0.9375
265 0.4993809163570404 0.875
270 0.41594651341438293 0.9375
275 0.4802340865135193 0.875
280 0.47708410024642944 0.8125
285 0.4468970000743866 0.9375
290 0.5039204359054565 0.8125
295 0.41882529854774475 0.875
300 0.500678539276123 0.8125
0
1
2
3
4
0.91875

Process finished with exit code 0

可以看到仅仅300轮后准确率已经到了百分之80左右，由于只加了一个全连接层训练速度很快，这个效率是惊人的。在验证集上验证时的准确率更是惊人的来到了百分之92左右。

中文填空任务

这是Bert常见的随机将token转为[mask]，进行类似完形填空的填空任务，预测出被掩盖掉的token内容。代码如下，内容与钱一个任务基本相同，区别是编码时认为将第15个token转变为[mask]，从而符合任务需求。

import torch
from datasets import load_dataset
from transformers import BertTokenizer, BertModel, AdamW

# 中文填空


# 定义数据集
class Dataset(torch.utils.data.Dataset):
    def __init__(self, split):
        dataset = load_dataset(path='datasets', split=split)
        def f(data):
            return len(data['text']) > 30
        self.dataset = dataset.filter(f)
    def __len__(self):
        return len(self.dataset)
    def __getitem__(self, i):
        text = self.dataset[i]['text']
        return text


dataset = Dataset('train')
# len(dataset), dataset[0]

# 加载字典和分词工具
token = BertTokenizer.from_pretrained('bert-base-chinese')


def collate_fn(data):
    # 编码
    data = token.batch_encode_plus(batch_text_or_text_pairs=data,
                                   truncation=True,
                                   padding='max_length',
                                   max_length=30,
                                   return_tensors='pt',
                                   return_length=True)

    # input_ids:编码之后的数字
    # attention_mask:是补零的位置是0,其他位置是1
    input_ids = data['input_ids']
    attention_mask = data['attention_mask']
    token_type_ids = data['token_type_ids']
    # 把第15个词固定替换为mask
    labels = input_ids[:, 15].reshape(-1).clone()
    input_ids[:, 15] = token.get_vocab()[token.mask_token]
    # print(data['length'], data['length'].max())
    return input_ids, attention_mask, token_type_ids, labels


# 数据加载器
loader = torch.utils.data.DataLoader(dataset=dataset,
                                     batch_size=16,
                                     collate_fn=collate_fn,
                                     shuffle=True,
                                     drop_last=True)
for i, (input_ids, attention_mask, token_type_ids,
        labels) in enumerate(loader):
    break

# print(len(loader))
# print(token.decode(input_ids[0]))
# print(token.decode(labels[0]))
# print(input_ids.shape, attention_mask.shape, token_type_ids.shape, labels.shape)

# 加载预训练模型
pretrained = BertModel.from_pretrained('model/bert-base-chinese')
# 不训练,不需要计算梯度
for param in pretrained.parameters():
    param.requires_grad_(False)
# 模型试算
# out = pretrained(input_ids=input_ids,
#           attention_mask=attention_mask,
#           token_type_ids=token_type_ids)
# print(out.last_hidden_state.shape)


# 定义下游任务模型
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.decoder = torch.nn.Linear(768, token.vocab_size, bias=False)
        self.bias = torch.nn.Parameter(torch.zeros(token.vocab_size))
        self.decoder.bias = self.bias

    def forward(self, input_ids, attention_mask, token_type_ids):
        with torch.no_grad():
            out = pretrained(input_ids=input_ids,
                             attention_mask=attention_mask,
                             token_type_ids=token_type_ids)

        out = self.decoder(out.last_hidden_state[:, 15])
        return out


model = Model()
# print(model(input_ids=input_ids,
#       attention_mask=attention_mask,
#       token_type_ids=token_type_ids).shape)

# 训练
optimizer = AdamW(model.parameters(), lr=5e-4)
criterion = torch.nn.CrossEntropyLoss()
model.train()
for epoch in range(5):
    for i, (input_ids, attention_mask, token_type_ids,
            labels) in enumerate(loader):
        out = model(input_ids=input_ids,
                    attention_mask=attention_mask,
                    token_type_ids=token_type_ids)
        loss = criterion(out, labels)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        if i % 50 == 0:
            out = out.argmax(dim=1)
            accuracy = (out == labels).sum().item() / len(labels)
            print(epoch, i, loss.item(), accuracy)


# 测试
def test():
    model.eval()
    correct = 0
    total = 0
    loader_test = torch.utils.data.DataLoader(dataset=Dataset('test'),
                                              batch_size=32,
                                              collate_fn=collate_fn,
                                              shuffle=True,
                                              drop_last=True)
    for i, (input_ids, attention_mask, token_type_ids,
            labels) in enumerate(loader_test):
        if i == 15:
            break
        print(i)
        with torch.no_grad():
            out = model(input_ids=input_ids,
                        attention_mask=attention_mask,
                        token_type_ids=token_type_ids)
        out = out.argmax(dim=1)
        correct += (out == labels).sum().item()
        total += len(labels)
        print(token.decode(input_ids[0]))
        print(token.decode(labels[0]), token.decode(labels[0]))
    print(correct / total)


test()

中文句子相关性任务

这是另一个常见的Bert训练任务，对给出的两个句子，判断是否相关。代码如下，内容亦基本一致，改变不过sents属性改为一次送入两个句子，其他甚至全连接层都与第一个任务相同：

import torch
from datasets import load_dataset
import random
from transformers import BertTokenizer, BertModel, AdamW

# 中文句子相关性


# 定义数据集
class Dataset(torch.utils.data.Dataset):
    def __init__(self, split):
        dataset = load_dataset(path='datasets', split=split)

        def f(data):
            return len(data['text']) > 40
        self.dataset = dataset.filter(f)

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, i):
        text = self.dataset[i]['text']
        # 切分一句话为前半句和后半句
        sentence1 = text[:20]
        sentence2 = text[20:40]
        label = 0
        # 有一半的概率把后半句替换为一句无关的话
        if random.randint(0, 1) == 0:
            j = random.randint(0, len(self.dataset) - 1)
            sentence2 = self.dataset[j]['text'][20:40]
            label = 1
        return sentence1, sentence2, label


dataset = Dataset('train')

# 加载字典和分词工具
token = BertTokenizer.from_pretrained('bert-base-chinese')


def collate_fn(data):
    sents = [i[:2] for i in data]
    labels = [i[2] for i in data]

    # 编码
    data = token.batch_encode_plus(batch_text_or_text_pairs=sents,
                                   truncation=True,
                                   padding='max_length',
                                   max_length=45,
                                   return_tensors='pt',
                                   return_length=True,
                                   add_special_tokens=True)
    # input_ids:编码之后的数字
    # attention_mask:是补零的位置是0,其他位置是1
    # token_type_ids:第一个句子和特殊符号的位置是0,第二个句子的位置是1
    input_ids = data['input_ids']
    attention_mask = data['attention_mask']
    token_type_ids = data['token_type_ids']
    labels = torch.LongTensor(labels)
    # print(data['length'], data['length'].max())
    return input_ids, attention_mask, token_type_ids, labels


# 数据加载器
loader = torch.utils.data.DataLoader(dataset=dataset,
                                     batch_size=8,
                                     collate_fn=collate_fn,
                                     shuffle=True,
                                     drop_last=True)
for i, (input_ids, attention_mask, token_type_ids,
        labels) in enumerate(loader):
    break

# print(len(loader))
# print(token.decode(input_ids[0]))
# print(input_ids.shape, attention_mask.shape, token_type_ids.shape, labels)

# 加载预训练模型
pretrained = BertModel.from_pretrained('model/bert-base-chinese')
# 不训练,不需要计算梯度
for param in pretrained.parameters():
    param.requires_grad_(False)
# 模型试算
out = pretrained(input_ids=input_ids,
           attention_mask=attention_mask,
           token_type_ids=token_type_ids)
# print(out.last_hidden_state.shape)


# 定义下游任务模型
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        with torch.no_grad():
            out = pretrained(input_ids=input_ids,
                             attention_mask=attention_mask,
                             token_type_ids=token_type_ids)

        out = self.fc(out.last_hidden_state[:, 0])
        out = out.softmax(dim=1)
        return out


model = Model()
# print(model(input_ids=input_ids,
#       attention_mask=attention_mask,
#       token_type_ids=token_type_ids).shape)

# 训练
optimizer = AdamW(model.parameters(), lr=5e-4)
criterion = torch.nn.CrossEntropyLoss()

model.train()
for i, (input_ids, attention_mask, token_type_ids,
        labels) in enumerate(loader):
    out = model(input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids)
    loss = criterion(out, labels)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    if i % 5 == 0:
        out = out.argmax(dim=1)
        accuracy = (out == labels).sum().item() / len(labels)
        print(i, loss.item(), accuracy)
    if i == 300:
        break


# 测试
def test():
    model.eval()
    correct = 0
    total = 0

    loader_test = torch.utils.data.DataLoader(dataset=Dataset('test'),
                                              batch_size=32,
                                              collate_fn=collate_fn,
                                              shuffle=True,
                                              drop_last=True)
    for i, (input_ids, attention_mask, token_type_ids,
            labels) in enumerate(loader_test):
        if i == 5:
            break
        print(i)
        with torch.no_grad():
            out = model(input_ids=input_ids,
                        attention_mask=attention_mask,
                        token_type_ids=token_type_ids)
        pred = out.argmax(dim=1)
        correct += (pred == labels).sum().item()
        total += len(labels)
    print(correct / total)


test()