本文将完整介绍使用LoRA进行模型微调。使用transformers(version 4.41.2)框架。

代码可以在 我的github[1] 上找到。

安转依赖包

pip -q install datasets evaluate transformers[sentencepiece] peft   

其中peft(参数高效微调)是我们重点要使用的,里面包含了lora的实现,详见上文

加载必要的依赖包

from datasets import load_dataset, DatasetDict, Dataset      from transformers import (       AutoTokenizer,       AutoConfig,       AutoModelForSequenceClassification,       DataCollatorWithPadding,       TrainingArguments,       Trainer,       EvalPrediction   )      from peft import (     PeftModel,      PeftConfig,      get_peft_model,      LoraConfig   )      import evaluate   import torch   import numpy as np      

加载基础模型

1. 加载基础模型配置

集成到huggingface hub 的模型都有其对应的配置,这些配置定义了该模型的一些参数,如embedding向量长度,词库大小,注意力头的个数,层数等。作为示例,我们在这里使用distilbert-base-uncased基础模型。该模型相对较小,如果要使用transformers提供的其他基础模型,可以在这里查看[2]。

    model_checkpoint = "distilbert-base-uncased"   config = AutoConfig.from_pretrained(model_checkpoint)       

这里打印config结果,可以发现embedding向量大小是768,词库大小是30522等信息。

DistilBertConfig {     "_name_or_path": "distilbert-base-uncased",     "activation": "gelu",     "architectures": [       "DistilBertForMaskedLM"     ],     "attention_dropout": 0.1,     "dim": 768,     "dropout": 0.1,     "hidden_dim": 3072,     "initializer_range": 0.02,     "max_position_embeddings": 512,     "model_type": "distilbert",     "n_heads": 12,     "n_layers": 6,     "pad_token_id": 0,     "qa_dropout": 0.1,     "seq_classif_dropout": 0.2,     "sinusoidal_pos_embds": false,     "tie_weights_": true,     "transformers_version": "4.41.2",     "vocab_size": 30522   }   

2. 加载基础模型

我们使用distilbert-base-uncased作为基础模型进行**「序列分类」**,有关序列分类的说明详见BertForSequenceClassification[3]。

config.num_labels = 2 # we have 2 classes   config.id2label = {0: "Negative", 1: "Positive"}   config.label2id = {"Negative": 0, "Positive": 1}   model = AutoModelForSequenceClassification.from_pretrained(       model_checkpoint,        config=config)   

这里,id2labellabel2idPretrainedConfig用于微调的参数,详见PretrainedConfig[4]。

微调数据集

1. 加载微调数据集

dataset = load_dataset("shawhin/imdb-truncated")   dataset   

输出:

DatasetDict({     train: Dataset({            features: ['label', 'text'],            num_rows: 1000        })        validation: Dataset({            features: ['label', 'text'],            num_rows: 1000        })   })   

我们打印train数据集前3行数据:

dataset['train'][0]   

输出:

{‘label’: 1, ‘text’: ‘. . . or type on a computer keyboard, they’d probably give this eponymous film a rating of “10.” After all, no elephants are shown being killed during the movie; it is not even implied that any are hurt. To the contrary, the master of ELEPHANT WALK, John Wiley (Peter Finch), complains that he cannot shoot any of the pachyderms–no matter how menacing–without a permit from the government (and his tone suggests such permits are not within the realm of probability). Furthermore, the elements conspire–in the form of an unusual drought and a human cholera epidemic–to leave the Wiley plantation house vulnerable to total destruction by the Elephant People (as the natives dub them) to close the story. If you happen to see the current release EARTH, you’ll detect the Elephant People are faring less well today.’}

2. tokenize数据集

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)   
查看特殊token
special_tokens = (       tokenizer.bos_token,        tokenizer.eos_token,        tokenizer.pad_token,        tokenizer.unk_token,        tokenizer.mask_token,        tokenizer.sep_token,        tokenizer.cls_token   )   special_token_ids = (       tokenizer.bos_token_id,        tokenizer.eos_token_id,        tokenizer.pad_token_id,        tokenizer.unk_token_id,        tokenizer.mask_token_id,        tokenizer.sep_token_id,        tokenizer.cls_token_id)      dict(zip(special_tokens, special_token_ids))   

打印特殊token

{None: None,   '[PAD]': 0,   '[UNK]': 100,   '[MASK]': 103,   '[SEP]': 102,   '[CLS]': 101}   
tokenize过程

使用.map()函数进行tokenization的函数映射。

# create tokenize function   def tokenize_function(examples):       # extract text       text = examples["text"]          #tokenize and truncate text       tokenizer.truncation_side = "left"       tokenized_inputs = tokenizer(           text,           return_tensors="np",           truncation=True,           max_length=512       )          return tokenized_inputs      # add pad token if none exists   if tokenizer.pad_token is None:       tokenizer.add_special_tokens({'pad_token': '[PAD]'})       model.resize_token_embeddings(len(tokenizer))      # tokenize training and validation datasets   tokenized_dataset = dataset.map(tokenize_function, batched=True)   tokenized_dataset   

说明:

  • 这里的max_lengthconfig里的max_position_embeddings的值保持一致,都是512。

  • 这里设置了从left进行truncation

  • len(tokenizer)的结果就是vocab_size(30522)。

输出结果:

DatasetDict({       train: Dataset({           features: ['label', 'text', 'input_ids', 'attention_mask'],           num_rows: 1000       })       validation: Dataset({           features: ['label', 'text', 'input_ids', 'attention_mask'],           num_rows: 1000       })   })   

下面我们来看看tokenize的具体结果是怎样的。

token_item = tokenized_dataset['train'][0]   len(token_item['input_ids'])   

这里我们可以看见第一个样本的input_ids结果是183,说明对原始文本进行tokenze时,并没有对原始文本进行padding。于是需要注意的是:

  • 即使在上面的tokenize_function()函数中return_tensors="np",但是tokenized_dataset['train']['input_ids']返回的是**「2d list」**,其中每一行是一个样本的text,而每个样本对应的input_ids也是list,而且这些list的长度是不一样的。这也说明这里为什么是用list来存储,因为其他数据结构如np.ndarraytorch.tensor都要求每个样本的input_ids长度是一定的。

  • tokenize后,distilbert-base-uncased的tokenizer在text的前后分别**「自动加上」**了[CLS][SEP]的特殊token。见下面的代码:

token_item['input_ids'][:10], token_item['attention_mask'][:10]      token_item['input_ids'][-10:], token_item['attention_mask'][-10:]   

输出:

([101, 1012, 1012, 1012, 2030, 2828, 2006, 1037, 3274, 9019],    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1])        ([10777, 2111, 2024, 2521, 2075, 2625, 2092, 2651, 1012, 102],    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1])   

可以看见101就是[CLS]的id,而102是[SEP]的id。

数据规整data collator

既然 input_idslist,而且每个样本的长度还不一样。那么input_ids是如何传给模型的forward()呢?这就要通过实现collator_fn来对数据进行规整,包括padding每个样本,和把数据类型从list转换为torch.tensor。代码如下:

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)   

DataCollatorWithPadding就是进行数据规整的,详见DataCollatorWithPadding[5]。注意,DataCollatorWithPadding不能处理string类型的数据,所以在tokenized_dataset中我们需要删除text列」

于是,我们进行如下测试

tokenized_dataset_trn = tokenized_dataset['train'].remove_columns(['text'])   x = data_collator(tokenized_dataset_trn[:3])   x   

可以得到结果:

{'input_ids': tensor([[  101,  1012,  1012,  ...,     0,     0,     0],           [  101,  2076,  4537,  ...,     0,     0,     0],           [  101,  2292,  2033,  ...,  5487, 23872,   102]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],           [1, 1, 1,  ..., 0, 0, 0],           [1, 1, 1,  ..., 1, 1, 1]]), 'labels': tensor([1, 1, 0])}   

而且打印x['input_ids'].shape可以得到:

torch.Size([3, 487])   

这里为什么长度不是512而是487呢?这是因为DataCollatorWithPadding是对每个batch进行规整,那么这里的487就是tokenized_dataset_trn[:3]这个batch中最大的样本的长度(也就是第三个样本的长度),这个batch一种就3个样本。

加载评价指标

# import accuracy evaluation metric   accuracy = evaluate.load("accuracy")      # define an evaluation function to pass into trainer later   def compute_metrics(p:EvalPrediction):       predictions, labels = p       predictions = np.argmax(predictions, axis=1)          return {"accuracy": accuracy.compute(predictions=predictions,                                              references=labels)}   

测试

zero-shoting性能测试

我们先模拟一个测试数据集:

# define list of examples   text_list = ["It was good.", "Not a fan, don't recommed.",    "Better than the first one.", "This is not worth watching even once.",    "This one is a pass."]   

我们对测试数据集进行tokenize:

x = tokenizer.encode(text_list[0], return_tensors="pt")   x   

得到结果:

tensor([[ 101, 2009, 2001, 2204, 1012,  102]])   

对这个结果进行解码:

tokenizer.decode(x[0])   

得到的结果:

[CLS] it was good. [SEP]   

知道了模型是如何编码解码的,我们开始进行zero-shoting的测试:

print("Untrained model predictions:")   print("----------------------------")   for text in text_list:       # tokenize text       inputs = tokenizer.encode(text, return_tensors="pt")       # compute logits       logits = model(inputs).logits       # convert logits to label       predictions = torch.argmax(logits)          print(text + " - " + config.id2label[predictions.tolist()])   

得到结果:

Untrained model predictions:   ----------------------------   It was good. - Positive   Not a fan, don't recommed. - Positive   Better than the first one. - Positive   This is not worth watching even once. - Positive   This one is a pass. - Positive   

使用LoRA进行微调

要使用LoRA进行微调,首先要对LoRA进行配置,transformers提供了LoRA的配置结构,详见Huggingface LoRA[6]和Github lora/config.py[7]。

peft_config = LoraConfig(       task_type="SEQ_CLS", # sequence classification       r=4, # intrinsic rank of trainable weight matrix       lora_alpha=32, # this is like a learning rate       lora_dropout=0.01, # probablity of dropout       target_modules = ['q_lin']) # we apply lora to query layer only   

这里,我们target_modules = ['q_lin']。那么到底有哪些module需要微调呢?一般是nn.Linearnn.Conv1D等modules。我们打印distilbert-base-uncased模型就可以看见q_linmodule了:

DistilBertForSequenceClassification(     (distilbert): DistilBertModel(       (embeddings): Embeddings(         (word_embeddings): Embedding(30522, 768, padding_idx=0)         (position_embeddings): Embedding(512, 768)         (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)         (dropout): Dropout(p=0.1, inplace=False)       )       (transformer): Transformer(         (layer): ModuleList(           (0-5): 6 x TransformerBlock(             (attention): MultiHeadSelfAttention(               (dropout): Dropout(p=0.1, inplace=False)               (q_lin): Linear(in_features=768, out_features=768, bias=True)               (k_lin): Linear(in_features=768, out_features=768, bias=True)               (v_lin): Linear(in_features=768, out_features=768, bias=True)               (out_lin): Linear(in_features=768, out_features=768, bias=True)             )             (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)             (ffn): FFN(               (dropout): Dropout(p=0.1, inplace=False)               (lin1): Linear(in_features=768, out_features=3072, bias=True)               (lin2): Linear(in_features=3072, out_features=768, bias=True)               (activation): GELUActivation()             )             (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)           )         )       )     )     (pre_classifier): Linear(in_features=768, out_features=768, bias=True)     (classifier): Linear(in_features=768, out_features=2, bias=True)     (dropout): Dropout(p=0.2, inplace=False)   )   

另外,这里task_type="SEQ_CLS",那么有哪些task_type呢?详见peft/utils/peft_types.py[8]

class TaskType(str, enum.Enum):       """       Enum class for the different types of tasks supported by PEFT.          Overview of the supported task types:       - SEQ_CLS: Text classification.       - SEQ_2_SEQ_LM: Sequence-to-sequence language modeling.       - CAUSAL_LM: Causal language modeling.       - TOKEN_CLS: Token classification.       - QUESTION_ANS: Question answering.       - FEATURE_EXTRACTION: Feature extraction. Provides the hidden states which can be used as embeddings or features         for downstream tasks.       """          SEQ_CLS = "SEQ_CLS"       SEQ_2_SEQ_LM = "SEQ_2_SEQ_LM"       CAUSAL_LM = "CAUSAL_LM"       TOKEN_CLS = "TOKEN_CLS"       QUESTION_ANS = "QUESTION_ANS"       FEATURE_EXTRACTION = "FEATURE_EXTRACTION"   

不同的task_type对应不同的模型,详见peft/mapping.py[9]:

MODEL_TYPE_TO_PEFT_MODEL_MAPPING: dict[str, type[PeftModel]] = {       "SEQ_CLS": PeftModelForSequenceClassification,       "SEQ_2_SEQ_LM": PeftModelForSeq2SeqLM,       "CAUSAL_LM": PeftModelForCausalLM,       "TOKEN_CLS": PeftModelForTokenClassification,       "QUESTION_ANS": PeftModelForQuestionAnswering,       "FEATURE_EXTRACTION": PeftModelForFeatureExtraction,   }   

配置完lora后,我们使用peft加载模型:

model = get_peft_model(model, peft_config)   model.print_trainable_parameters()   

打印出的可训练的参数量为: trainable params: 628,994 || all params: 67,584,004 || trainable%: 0.9307

进行微调训练

# hyperparameters   lr = 1e-3 # size of optimization step    batch_size = 4 # number of examples processed per optimziation step   num_epochs = 10 # number of times model runs through training data      # define training arguments   training_args = TrainingArguments(       output_dir= model_checkpoint + "-lora-text-classification",       learning_rate=lr,       per_device_train_batch_size=batch_size,        per_device_eval_batch_size=batch_size,       num_train_epochs=num_epochs,       weight_decay=0.01,       evaluation_strategy="epoch",       save_strategy="epoch",       load_best_model_at_end=True,   )      # creater trainer object   trainer = Trainer(       model=model, # our peft model       args=training_args, # hyperparameters       train_dataset=tokenized_dataset["train"], # training data       eval_dataset=tokenized_dataset["validation"], # validation data       tokenizer=tokenizer, # define tokenizer       data_collator=data_collator, # this will dynamically pad examples in each batch to be equal length       compute_metrics=compute_metrics, # evaluates model using compute_metrics() function from before   )      # train model   trainer.train()   

训练结果如下

模型训练完后,模型还在GPU上,我们需要把他转移到CPU上:

model.device   # 输出   # device(type='cuda', index=0)      model.to('cpu') # or 'mps' for Mac   model.device   # 输出   # device(type='cpu')   

模型微调后进行测试

print("Trained model predictions:")   print("--------------------------")   for text in text_list:       inputs = tokenizer.encode(text, return_tensors="pt").to("cpu") # moving to mps for Mac (mps) (can alternatively do 'cpu')          logits = model(inputs).logits       predictions = torch.max(logits,1).indices          print(text + " - " + config.id2label[predictions.tolist()[0]])   

结果为:

Trained model predictions:   --------------------------   It was good. - Positive   Not a fan, don't recommed. - Negative   Better than the first one. - Positive   This is not worth watching even once. - Negative   This one is a pass. - Negative   

至此,我们完成了使用lora进行模型微调的全部过程。

AI大模型学习福利

作为一名热心肠的互联网老兵,我决定把宝贵的AI知识分享给大家。 至于能学习到多少就看你的学习毅力和能力了 。我已将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

因篇幅有限,仅展示部分资料,需要点击下方链接即可前往获取

2024最新版CSDN大礼包:《AGI大模型学习资源包》免费分享

一、全套AGI大模型学习路线

AI大模型时代的学习之旅:从基础到前沿,掌握人工智能的核心技能!

img
因篇幅有限,仅展示部分资料,需要点击下方链接即可前往获取

2024最新版CSDN大礼包:《AGI大模型学习资源包》免费分享

二、640套AI大模型报告合集

这套包含640份报告的合集,涵盖了AI大模型的理论研究、技术实现、行业应用等多个方面。无论您是科研人员、工程师,还是对AI大模型感兴趣的爱好者,这套报告合集都将为您提供宝贵的信息和启示。

img

因篇幅有限,仅展示部分资料,需要点击下方链接即可前往获取

2024最新版CSDN大礼包:《AGI大模型学习资源包》免费分享

三、AI大模型经典PDF籍

随着人工智能技术的飞速发展,AI大模型已经成为了当今科技领域的一大热点。这些大型预训练模型,如GPT-3、BERT、XLNet等,以其强大的语言理解和生成能力,正在改变我们对人工智能的认识。 那以下这些PDF籍就是非常不错的学习资源。

img
因篇幅有限,仅展示部分资料,需要点击下方链接即可前往获取

2024最新版CSDN大礼包:《AGI大模型学习资源包》免费分享

四、AI大模型商业化落地方案

img

因篇幅有限,仅展示部分资料,需要点击下方链接即可前往获取

2024最新版CSDN大礼包:《AGI大模型学习资源包》免费分享

作为普通人,入局大模型时代需要持续学习和实践,不断提高自己的技能和认知水平,同时也需要有责任感和伦理意识,为人工智能的健康发展贡献力量。

GitHub 加速计划 / tra / transformers
130.24 K
25.88 K
下载
huggingface/transformers: 是一个基于 Python 的自然语言处理库,它使用了 PostgreSQL 数据库存储数据。适合用于自然语言处理任务的开发和实现,特别是对于需要使用 Python 和 PostgreSQL 数据库的场景。特点是自然语言处理库、Python、PostgreSQL 数据库。
最近提交(Master分支:2 个月前 )
33868a05 * [i18n-HI] Translated accelerate page to Hindi * Update docs/source/hi/accelerate.md Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com> * Update docs/source/hi/accelerate.md Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com> * Update docs/source/hi/accelerate.md Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com> * Update docs/source/hi/accelerate.md Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com> --------- Co-authored-by: Kay <kay@Kays-MacBook-Pro.local> Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com> 1 天前
e2ac16b2 * rework converter * Update modular_model_converter.py * Update modular_model_converter.py * Update modular_model_converter.py * Update modular_model_converter.py * cleaning * cleaning * finalize imports * imports * Update modular_model_converter.py * Better renaming to avoid visiting same file multiple times * start converting files * style * address most comments * style * remove unused stuff in get_needed_imports * style * move class dependency functions outside class * Move main functions outside class * style * Update modular_model_converter.py * rename func * add augmented dependencies * Update modular_model_converter.py * Add types_to_file_type + tweak annotation handling * Allow assignment dependency mapping + fix regex * style + update modular examples * fix modular_roberta example (wrong redefinition of __init__) * slightly correct order in which dependencies will appear * style * review comments * Performance + better handling of dependencies when they are imported * style * Add advanced new classes capabilities * style * add forgotten check * Update modeling_llava_next_video.py * Add prority list ordering in check_conversion as well * Update check_modular_conversion.py * Update configuration_gemma.py 2 天前
Logo

旨在为数千万中国开发者提供一个无缝且高效的云端环境,以支持学习、使用和贡献开源项目。

更多推荐