【大模型】LoRA模型微调-基于transformers框架的示例,零基础入门到精通,收藏这一篇就够了
本文将完整介绍使用LoRA进行模型微调。使用transformers(version 4.41.2)
框架。
代码可以在 我的github[1] 上找到。
安转依赖包
pip -q install datasets evaluate transformers[sentencepiece] peft
其中peft
(参数高效微调)是我们重点要使用的,里面包含了lora
的实现,详见上文。
加载必要的依赖包
from datasets import load_dataset, DatasetDict, Dataset from transformers import ( AutoTokenizer, AutoConfig, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer, EvalPrediction ) from peft import ( PeftModel, PeftConfig, get_peft_model, LoraConfig ) import evaluate import torch import numpy as np
加载基础模型
1. 加载基础模型配置
集成到huggingface
hub 的模型都有其对应的配置,这些配置定义了该模型的一些参数,如embedding向量长度,词库大小,注意力头的个数,层数等。作为示例,我们在这里使用distilbert-base-uncased
基础模型。该模型相对较小,如果要使用transformers
提供的其他基础模型,可以在这里查看[2]。
model_checkpoint = "distilbert-base-uncased" config = AutoConfig.from_pretrained(model_checkpoint)
这里打印config
结果,可以发现embedding向量大小是768,词库大小是30522等信息。
DistilBertConfig { "_name_or_path": "distilbert-base-uncased", "activation": "gelu", "architectures": [ "DistilBertForMaskedLM" ], "attention_dropout": 0.1, "dim": 768, "dropout": 0.1, "hidden_dim": 3072, "initializer_range": 0.02, "max_position_embeddings": 512, "model_type": "distilbert", "n_heads": 12, "n_layers": 6, "pad_token_id": 0, "qa_dropout": 0.1, "seq_classif_dropout": 0.2, "sinusoidal_pos_embds": false, "tie_weights_": true, "transformers_version": "4.41.2", "vocab_size": 30522 }
2. 加载基础模型
我们使用distilbert-base-uncased
作为基础模型进行**「序列分类」**,有关序列分类的说明详见BertForSequenceClassification[3]。
config.num_labels = 2 # we have 2 classes config.id2label = {0: "Negative", 1: "Positive"} config.label2id = {"Negative": 0, "Positive": 1} model = AutoModelForSequenceClassification.from_pretrained( model_checkpoint, config=config)
这里,id2label
和 label2id
是 PretrainedConfig
用于微调的参数,详见PretrainedConfig[4]。
微调数据集
1. 加载微调数据集
dataset = load_dataset("shawhin/imdb-truncated") dataset
输出:
DatasetDict({ train: Dataset({ features: ['label', 'text'], num_rows: 1000 }) validation: Dataset({ features: ['label', 'text'], num_rows: 1000 }) })
我们打印train数据集前3行数据:
dataset['train'][0]
输出:
{‘label’: 1, ‘text’: ‘. . . or type on a computer keyboard, they’d probably give this eponymous film a rating of “10.” After all, no elephants are shown being killed during the movie; it is not even implied that any are hurt. To the contrary, the master of ELEPHANT WALK, John Wiley (Peter Finch), complains that he cannot shoot any of the pachyderms–no matter how menacing–without a permit from the government (and his tone suggests such permits are not within the realm of probability). Furthermore, the elements conspire–in the form of an unusual drought and a human cholera epidemic–to leave the Wiley plantation house vulnerable to total destruction by the Elephant People (as the natives dub them) to close the story. If you happen to see the current release EARTH, you’ll detect the Elephant People are faring less well today.’}
2. tokenize数据集
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)
查看特殊token
special_tokens = ( tokenizer.bos_token, tokenizer.eos_token, tokenizer.pad_token, tokenizer.unk_token, tokenizer.mask_token, tokenizer.sep_token, tokenizer.cls_token ) special_token_ids = ( tokenizer.bos_token_id, tokenizer.eos_token_id, tokenizer.pad_token_id, tokenizer.unk_token_id, tokenizer.mask_token_id, tokenizer.sep_token_id, tokenizer.cls_token_id) dict(zip(special_tokens, special_token_ids))
打印特殊token
{None: None, '[PAD]': 0, '[UNK]': 100, '[MASK]': 103, '[SEP]': 102, '[CLS]': 101}
tokenize过程
使用.map()
函数进行tokenization的函数映射。
# create tokenize function def tokenize_function(examples): # extract text text = examples["text"] #tokenize and truncate text tokenizer.truncation_side = "left" tokenized_inputs = tokenizer( text, return_tensors="np", truncation=True, max_length=512 ) return tokenized_inputs # add pad token if none exists if tokenizer.pad_token is None: tokenizer.add_special_tokens({'pad_token': '[PAD]'}) model.resize_token_embeddings(len(tokenizer)) # tokenize training and validation datasets tokenized_dataset = dataset.map(tokenize_function, batched=True) tokenized_dataset
说明:
-
这里的
max_length
与config
里的max_position_embeddings
的值保持一致,都是512。 -
这里设置了从
left
进行truncation
。 -
len(tokenizer)
的结果就是vocab_size
(30522)。
输出结果:
DatasetDict({ train: Dataset({ features: ['label', 'text', 'input_ids', 'attention_mask'], num_rows: 1000 }) validation: Dataset({ features: ['label', 'text', 'input_ids', 'attention_mask'], num_rows: 1000 }) })
下面我们来看看tokenize的具体结果是怎样的。
token_item = tokenized_dataset['train'][0] len(token_item['input_ids'])
这里我们可以看见第一个样本的input_ids
结果是183,说明对原始文本进行tokenze时,并没有对原始文本进行padding。于是需要注意的是:
-
即使在上面的
tokenize_function()
函数中return_tensors="np"
,但是tokenized_dataset['train']['input_ids']
返回的是**「2d list」**,其中每一行是一个样本的text,而每个样本对应的input_ids
也是list
,而且这些list
的长度是不一样的。这也说明这里为什么是用list
来存储,因为其他数据结构如np.ndarray
或torch.tensor
都要求每个样本的input_ids
长度是一定的。 -
tokenize后,
distilbert-base-uncased
的tokenizer在text的前后分别**「自动加上」**了[CLS]
和[SEP]
的特殊token。见下面的代码:
token_item['input_ids'][:10], token_item['attention_mask'][:10] token_item['input_ids'][-10:], token_item['attention_mask'][-10:]
输出:
([101, 1012, 1012, 1012, 2030, 2828, 2006, 1037, 3274, 9019], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) ([10777, 2111, 2024, 2521, 2075, 2625, 2092, 2651, 1012, 102], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
可以看见101就是[CLS]
的id,而102是[SEP]
的id。
数据规整data collator
既然 input_ids
是list
,而且每个样本的长度还不一样。那么input_ids
是如何传给模型的forward()
呢?这就要通过实现collator_fn
来对数据进行规整,包括padding每个样本,和把数据类型从list
转换为torch.tensor
。代码如下:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
DataCollatorWithPadding
就是进行数据规整的,详见DataCollatorWithPadding[5]。注意,「DataCollatorWithPadding
不能处理string
类型的数据,所以在tokenized_dataset
中我们需要删除text
列」。
于是,我们进行如下测试
tokenized_dataset_trn = tokenized_dataset['train'].remove_columns(['text']) x = data_collator(tokenized_dataset_trn[:3]) x
可以得到结果:
{'input_ids': tensor([[ 101, 1012, 1012, ..., 0, 0, 0], [ 101, 2076, 4537, ..., 0, 0, 0], [ 101, 2292, 2033, ..., 5487, 23872, 102]]), 'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0], [1, 1, 1, ..., 0, 0, 0], [1, 1, 1, ..., 1, 1, 1]]), 'labels': tensor([1, 1, 0])}
而且打印x['input_ids'].shape
可以得到:
torch.Size([3, 487])
这里为什么长度不是512而是487呢?这是因为DataCollatorWithPadding
是对每个batch
进行规整,那么这里的487就是tokenized_dataset_trn[:3]
这个batch中最大的样本的长度(也就是第三个样本的长度),这个batch一种就3个样本。
加载评价指标
# import accuracy evaluation metric accuracy = evaluate.load("accuracy") # define an evaluation function to pass into trainer later def compute_metrics(p:EvalPrediction): predictions, labels = p predictions = np.argmax(predictions, axis=1) return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}
测试
zero-shoting性能测试
我们先模拟一个测试数据集:
# define list of examples text_list = ["It was good.", "Not a fan, don't recommed.", "Better than the first one.", "This is not worth watching even once.", "This one is a pass."]
我们对测试数据集进行tokenize:
x = tokenizer.encode(text_list[0], return_tensors="pt") x
得到结果:
tensor([[ 101, 2009, 2001, 2204, 1012, 102]])
对这个结果进行解码:
tokenizer.decode(x[0])
得到的结果:
[CLS] it was good. [SEP]
知道了模型是如何编码解码的,我们开始进行zero-shoting的测试:
print("Untrained model predictions:") print("----------------------------") for text in text_list: # tokenize text inputs = tokenizer.encode(text, return_tensors="pt") # compute logits logits = model(inputs).logits # convert logits to label predictions = torch.argmax(logits) print(text + " - " + config.id2label[predictions.tolist()])
得到结果:
Untrained model predictions: ---------------------------- It was good. - Positive Not a fan, don't recommed. - Positive Better than the first one. - Positive This is not worth watching even once. - Positive This one is a pass. - Positive
使用LoRA进行微调
要使用LoRA进行微调,首先要对LoRA进行配置,transformers
提供了LoRA的配置结构,详见Huggingface LoRA[6]和Github lora/config.py[7]。
peft_config = LoraConfig( task_type="SEQ_CLS", # sequence classification r=4, # intrinsic rank of trainable weight matrix lora_alpha=32, # this is like a learning rate lora_dropout=0.01, # probablity of dropout target_modules = ['q_lin']) # we apply lora to query layer only
这里,我们target_modules = ['q_lin']
。那么到底有哪些module需要微调呢?一般是nn.Linear
和nn.Conv1D
等modules。我们打印distilbert-base-uncased
模型就可以看见q_lin
module了:
DistilBertForSequenceClassification( (distilbert): DistilBertModel( (embeddings): Embeddings( (word_embeddings): Embedding(30522, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (transformer): Transformer( (layer): ModuleList( (0-5): 6 x TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) ) ) ) (pre_classifier): Linear(in_features=768, out_features=768, bias=True) (classifier): Linear(in_features=768, out_features=2, bias=True) (dropout): Dropout(p=0.2, inplace=False) )
另外,这里task_type="SEQ_CLS"
,那么有哪些task_type
呢?详见peft/utils/peft_types.py[8]
class TaskType(str, enum.Enum): """ Enum class for the different types of tasks supported by PEFT. Overview of the supported task types: - SEQ_CLS: Text classification. - SEQ_2_SEQ_LM: Sequence-to-sequence language modeling. - CAUSAL_LM: Causal language modeling. - TOKEN_CLS: Token classification. - QUESTION_ANS: Question answering. - FEATURE_EXTRACTION: Feature extraction. Provides the hidden states which can be used as embeddings or features for downstream tasks. """ SEQ_CLS = "SEQ_CLS" SEQ_2_SEQ_LM = "SEQ_2_SEQ_LM" CAUSAL_LM = "CAUSAL_LM" TOKEN_CLS = "TOKEN_CLS" QUESTION_ANS = "QUESTION_ANS" FEATURE_EXTRACTION = "FEATURE_EXTRACTION"
不同的task_type
对应不同的模型,详见peft/mapping.py[9]:
MODEL_TYPE_TO_PEFT_MODEL_MAPPING: dict[str, type[PeftModel]] = { "SEQ_CLS": PeftModelForSequenceClassification, "SEQ_2_SEQ_LM": PeftModelForSeq2SeqLM, "CAUSAL_LM": PeftModelForCausalLM, "TOKEN_CLS": PeftModelForTokenClassification, "QUESTION_ANS": PeftModelForQuestionAnswering, "FEATURE_EXTRACTION": PeftModelForFeatureExtraction, }
配置完lora后,我们使用peft
加载模型:
model = get_peft_model(model, peft_config) model.print_trainable_parameters()
打印出的可训练的参数量为: trainable params: 628,994 || all params: 67,584,004 || trainable%: 0.9307
进行微调训练
# hyperparameters lr = 1e-3 # size of optimization step batch_size = 4 # number of examples processed per optimziation step num_epochs = 10 # number of times model runs through training data # define training arguments training_args = TrainingArguments( output_dir= model_checkpoint + "-lora-text-classification", learning_rate=lr, per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, num_train_epochs=num_epochs, weight_decay=0.01, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, ) # creater trainer object trainer = Trainer( model=model, # our peft model args=training_args, # hyperparameters train_dataset=tokenized_dataset["train"], # training data eval_dataset=tokenized_dataset["validation"], # validation data tokenizer=tokenizer, # define tokenizer data_collator=data_collator, # this will dynamically pad examples in each batch to be equal length compute_metrics=compute_metrics, # evaluates model using compute_metrics() function from before ) # train model trainer.train()
训练结果如下
模型训练完后,模型还在GPU上,我们需要把他转移到CPU上:
model.device # 输出 # device(type='cuda', index=0) model.to('cpu') # or 'mps' for Mac model.device # 输出 # device(type='cpu')
模型微调后进行测试
print("Trained model predictions:") print("--------------------------") for text in text_list: inputs = tokenizer.encode(text, return_tensors="pt").to("cpu") # moving to mps for Mac (mps) (can alternatively do 'cpu') logits = model(inputs).logits predictions = torch.max(logits,1).indices print(text + " - " + config.id2label[predictions.tolist()[0]])
结果为:
Trained model predictions: -------------------------- It was good. - Positive Not a fan, don't recommed. - Negative Better than the first one. - Positive This is not worth watching even once. - Negative This one is a pass. - Negative
至此,我们完成了使用lora进行模型微调的全部过程。
AI大模型学习福利
作为一名热心肠的互联网老兵,我决定把宝贵的AI知识分享给大家。 至于能学习到多少就看你的学习毅力和能力了 。我已将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。
因篇幅有限,仅展示部分资料,需要点击下方链接即可前往获取
2024最新版CSDN大礼包:《AGI大模型学习资源包》免费分享
一、全套AGI大模型学习路线
AI大模型时代的学习之旅:从基础到前沿,掌握人工智能的核心技能!
因篇幅有限,仅展示部分资料,需要点击下方链接即可前往获取
2024最新版CSDN大礼包:《AGI大模型学习资源包》免费分享
二、640套AI大模型报告合集
这套包含640份报告的合集,涵盖了AI大模型的理论研究、技术实现、行业应用等多个方面。无论您是科研人员、工程师,还是对AI大模型感兴趣的爱好者,这套报告合集都将为您提供宝贵的信息和启示。
因篇幅有限,仅展示部分资料,需要点击下方链接即可前往获取
2024最新版CSDN大礼包:《AGI大模型学习资源包》免费分享
三、AI大模型经典PDF籍
随着人工智能技术的飞速发展,AI大模型已经成为了当今科技领域的一大热点。这些大型预训练模型,如GPT-3、BERT、XLNet等,以其强大的语言理解和生成能力,正在改变我们对人工智能的认识。 那以下这些PDF籍就是非常不错的学习资源。
因篇幅有限,仅展示部分资料,需要点击下方链接即可前往获取
2024最新版CSDN大礼包:《AGI大模型学习资源包》免费分享
四、AI大模型商业化落地方案
因篇幅有限,仅展示部分资料,需要点击下方链接即可前往获取
2024最新版CSDN大礼包:《AGI大模型学习资源包》免费分享
作为普通人,入局大模型时代需要持续学习和实践,不断提高自己的技能和认知水平,同时也需要有责任感和伦理意识,为人工智能的健康发展贡献力量。
更多推荐
所有评论(0)