00132 因果语言建模

大语言模型

发布日期: 2024-06-18

更新日期: 2025-01-27

前言

本文介绍了如何进行因果语言建模。

Hugging Face Github 主页: https://github.com/huggingface

语言建模有两种类型，因果和掩码。本指南阐述了因果语言建模。因果语言模型经常用于文本生成。因果语言建模预测序列中下一个标记，模型只能关注左侧的标记。这意味着模型无法看到未来的标记。 GPT-2 是一个因果语言模型的例子。

本指南将向您展示如何做到以下几点：

Finetune DistilGPT2 on the r/askscience subset of the ELI5 dataset.
Use your finetuned model for inference.

您可以按照本指南中的相同步骤，对其他架构进行因果语言建模的微调。选择以下架构之一：

BART, BERT, Bert-Generation, BigBird, BigBird-Pegasus, BioGpt, Blenderbot, BlenderbotSmall, BLOOM, CamemBERT, CodeLlama, CodeGen, Cohere, CPM-Ant, CTRL, Data2VecText, ELECTRA, ERNIE, Falcon, Fuyu, Gemma, GIT, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, GPT NeoX Japanese, GPT-J, LLaMA, Mamba, Marian, mBART, MEGA, Megatron-BERT, Mistral, Mixtral, MPT, MusicGen, MusicGen Melody, MVP, OpenLlama, OpenAI GPT, OPT, Pegasus, Persimmon, Phi, PLBart, ProphetNet, QDQBert, Qwen2, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, RWKV, Speech2Text2, StableLm, Starcoder2, Transformer-XL, TrOCR, Whisper, XGLM, XLM, XLM-ProphetNet, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD

在开始之前，请确保您已安装所有必要的库：

pip install transformers datasets evaluate

我们建议您登录您的 Hugging Face 账户，以便您能够上传并与社区分享您的模型。当系统提示时，输入您的令牌以登录：

>>> from huggingface_hub import notebook_login

>>> notebook_login()

操作系统：Windows 11 家庭中文版

参考文档

Causal language modeling

Load ELI5 dataset

首先，使用 🤗 Datasets 库从 ELI5-Category 数据集中加载前5000个示例。这将给您一个机会在花费更多时间在完整数据集上训练之前进行实验并确保一切正常工作。

>>> from datasets import load_dataset

>>> eli5 = load_dataset("eli5_category", split="train[:5000]")

使用 [~datasets.Dataset.train_test_split] 方法将数据集的 train 分割成一个训练和测试集：

>>> eli5 = eli5.train_test_split(test_size=0.2)

然后查看一个示例：

>>> eli5["train"][0]
{'q_id': '7h191n',
 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?',
 'selftext': '',
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'],
  'text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.",
   'None yet. It has to be reconciled with a vastly different house bill and then passed again.',
   'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?',
   'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'],
  'score': [21, 19, 5, 3],
  'text_urls': [[],
   [],
   [],
   ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']]},
 'title_urls': ['url'],
 'selftext_urls': ['url']}

虽然这可能看起来很多，但您真正感兴趣的只是 text 字段。 语言建模任务酷的地方在于您不需要标签（也称为无监督任务），因为下一个词就是标签。

Preprocess

下一步是加载 DistilGPT2 分词器 来处理 text 子字段：

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

您从上面的示例中会注意到，**text 字段实际上嵌套在 answers 中。** 这意味着您需要使用 flatten 方法从其嵌套结构中提取 text 子字段：

>>> eli5 = eli5.flatten()
>>> eli5["train"][0]
{'q_id': '7h191n',
 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?',
 'selftext': '',
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'],
 'answers.text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.",
  'None yet. It has to be reconciled with a vastly different house bill and then passed again.',
  'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?',
  'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'],
 'answers.score': [21, 19, 5, 3],
 'answers.text_urls': [[],
  [],
  [],
  ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']],
 'title_urls': ['url'],
 'selftext_urls': ['url']}

现在，每个子字段都是一个单独的列，如 answers 前缀所示，text 字段现在是一个列表。 与其分别对每个句子进行分词，不如将列表转换为字符串，这样您就可以一起对它们进行分词了。

这里有一个第一个预处理函数，用于连接每个示例的字符串列表并对结果进行分词：

>>> def preprocess_function(examples):
...     return tokenizer([" ".join(x) for x in examples["answers.text"]])

要在这个数据集上应用这个预处理函数，使用 🤗 Datasets 的 [~datasets.Dataset.map] 方法。通过设置 batched=True，您可以加快 map 函数的速度，以同时处理数据集中的多个元素，并且通过增加 num_proc 的值来增加进程数。移除任何您不需要的列：

>>> tokenized_eli5 = eli5.map(
...     preprocess_function,
...     batched=True,
...     num_proc=4,
...     remove_columns=eli5["train"].column_names,
... )

这个数据集包含了标记序列，但其中一些序列的长度超过了模型的最大输入长度。现在，你可以使用第二个预处理函数来

连接所有序列
将连接后的序列按照block_size分割成更短的片段，这个片段的长度应该既小于最大输入长度，又足够小以适应你的GPU内存。

>>> block_size = 128


>>> def group_texts(examples):
...     # Concatenate all texts.
...     concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
...     total_length = len(concatenated_examples[list(examples.keys())[0]])
...     # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
...     # customize this part to your needs.
...     if total_length >= block_size:
...         total_length = (total_length // block_size) * block_size
...     # Split by chunks of block_size.
...     result = {
...         k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
...         for k, t in concatenated_examples.items()
...     }
...     result["labels"] = result["input_ids"].copy()
...     return result

在整个数据集上应用group_texts函数：

>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

现在使用[DataCollatorForLanguageModeling]创建一个示例批次。在整理过程中，将句子动态填充到批次中最长长度，而不是将整个数据集填充到最大长度，这样做效率更高。使用序列结束标记作为填充标记，并将mlm设置为False。这将使用输入作为标签，向右移动一个元素：

>>> from transformers import DataCollatorForLanguageModeling

>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Train

现在你准备好开始训练你的模型了！使用[AutoModelForCausalLM]加载DistilGPT2：

>>> from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

在这一点上，只剩下三个步骤：

在[TrainingArguments]中定义你的训练超参数。唯一需要的参数是output_dir，它指定了保存你的模型的路径。 通过设置push_to_hub=True，你可以将这个模型推送到Hub（你需要登录Hugging Face来上传你的模型）。
将训练参数传递给[Trainer]，同时传递模型，数据集和数据整理器。
调用[~Trainer.train]来微调你的模型。

>>> training_args = TrainingArguments(
...     output_dir="my_awesome_eli5_clm-model",
...     evaluation_strategy="epoch",
...     learning_rate=2e-5,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=lm_dataset["train"],
...     eval_dataset=lm_dataset["test"],
...     data_collator=data_collator,
... )

>>> trainer.train()

一旦训练完成，使用[~transformers.Trainer.evaluate]方法来评估你的模型并获取其困惑度（perplexity）：

>>> import math

>>> eval_results = trainer.evaluate()
>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Perplexity: 49.61

然后，使用[~transformers.Trainer.push_to_hub]方法将你的模型分享到Hub，以便每个人都可以使用你的模型：

>>> trainer.push_to_hub()

想要更深入地了解如何为因果语言建模微调模型，可以查看相应的PyTorch笔记本或TensorFlow笔记本。

Inference

请创建一个你想要生成文本的提示。

>>> prompt = "Somatic hypermutation allows the immune system to"

尝试你微调后的模型进行推理的最简单方法是使用它在一个[pipeline]中。用你的模型实例化一个文本生成的pipeline，并将你的文本传递给它：

>>> from transformers import pipeline

>>> generator = pipeline("text-generation", model="username/my_awesome_eli5_clm-model")
>>> generator(prompt)
[{'generated_text': "Somatic hypermutation allows the immune system to be able to effectively reverse the damage caused by an infection.\n\n\nThe damage caused by an infection is caused by the immune system's ability to perform its own self-correcting tasks."}]

对文本进行标记化，并返回作为PyTorch张量的input_ids：

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_clm-model")
>>> inputs = tokenizer(prompt, return_tensors="pt").input_ids

使用[~transformers.generation_utils.GenerationMixin.generate]方法来生成文本。
有关不同文本生成策略和控制生成的参数的更多详细信息，请查看文本生成策略页面。

>>> from transformers import AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("username/my_awesome_eli5_clm-model")
>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

将生成的标记ID解码回文本：

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
["Somatic hypermutation allows the immune system to react to drugs with the ability to adapt to a different environmental situation. In other words, a system of 'hypermutation' can help the immune system to adapt to a different environmental situation or in some cases even a single life. In contrast, researchers at the University of Massachusetts-Boston have found that 'hypermutation' is much stronger in mice than in humans but can be found in humans, and that it's not completely unknown to the immune system. A study on how the immune system"]