00279 摘要


前言

本文介绍了如何进行摘要。

Hugging Face Github 主页: https://github.com/huggingface

摘要会创建一个文档或文章的简短版本,捕捉所有重要的信息。与翻译一样,它也是一个可以表述为序列到序列任务的例子。摘要可以是:

  • 提取式:从文档中提取最相关的信息。
  • 抽象式:生成新的文本,捕捉最相关的信息。

本指南将向您展示如何:

  1. BillSum数据集的加利福尼亚州议案子集上对T5进行微调,以进行抽象式摘要。
  2. 使用您微调的模型进行推理。

要查看与此任务兼容的所有架构和检查点,我们建议查看任务页面。

在开始之前,请确保您已安装所有必要的库:

pip install transformers datasets evaluate rouge_score

我们鼓励您登录您的Hugging Face账户,这样您就可以上传并与社区分享您的模型。当被提示时,输入您的令牌以登录:

>>> from huggingface_hub import notebook_login

>>> notebook_login()

操作系统:Windows 11 家庭中文版

参考文档

  1. Summarization

加载BillSum数据集

首先,从🤗 Datasets库加载BillSum数据集的较小子集——加利福尼亚州议案:

>>> from datasets import load_dataset

>>> billsum = load_dataset("billsum", split="ca_test")

使用train_test_split方法将数据集分割为训练集和测试集:

>>> billsum = billsum.train_test_split(test_size=0.2)

然后查看一个示例:

>>> billsum["train"][0]
{'summary': 'Existing law authorizes state agencies to enter into contracts for the acquisition of goods or services upon approval by the Department of General Services. Existing law sets forth various requirements and prohibitions for those contracts, including, but not limited to, a prohibition on entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between spouses and domestic partners or same-sex and different-sex couples in the provision of benefits. Existing law provides that a contract entered into in violation of those requirements and prohibitions is void and authorizes the state or any person acting on behalf of the state to bring a civil action seeking a determination that a contract is in violation and therefore void. Under existing law, a willful violation of those requirements and prohibitions is a misdemeanor.\nThis bill would also prohibit a state agency from entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between employees on the basis of gender identity in the provision of benefits, as specified. By expanding the scope of a crime, this bill would impose a state-mandated local program.\nThe California Constitution requires the state to reimburse local agencies and school districts for certain costs mandated by the state. Statutory provisions establish procedures for making that reimbursement.\nThis bill would provide that no reimbursement is required by this act for a specified reason.',
 'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 10295.35 is added to the Public Contract Code, to read:\n10295.35.\n(a) (1) Notwithstanding any other law, a state agency shall not enter into any contract for the acquisition of goods or services in the amount of one hundred thousand dollars ($100,000) or more with a contractor that, in the provision of benefits, discriminates between employees on the basis of an employee’s or dependent’s actual or perceived gender identity, including, but not limited to, the employee’s or dependent’s identification as transgender.\n(2) For purposes of this section, “contract” includes contracts with a cumulative amount of one hundred thousand dollars ($100,000) or more per contractor in each fiscal year.\n(3) For purposes of this section, an employee health plan is discriminatory if the plan is not consistent with Section 1365.5 of the Health and Safety Code and Section 10140 of the Insurance Code.\n(4) The requirements of this section shall apply only to those portions of a contractor’s operations that occur under any of the following conditions:\n(A) Within the state.\n(B) On real property outside the state if the property is owned by the state or if the state has a right to occupy the property, and if the contractor’s presence at that location is connected to a contract with the state.\n(C) Elsewhere in the United States where work related to a state contract is being performed.\n(b) Contractors shall treat as confidential, to the maximum extent allowed by law or by the requirement of the contractor’s insurance provider, any request by an employee or applicant for employment benefits or any documentation of eligibility for benefits submitted by an employee or applicant for employment.\n(c) After taking all reasonable measures to find a contractor that complies with this section, as determined by the state agency, the requirements of this section may be waived under any of the following circumstances:\n(1) There is only one prospective contractor willing to enter into a specific contract with the state agency.\n(2) The contract is necessary to respond to an emergency, as determined by the state agency, that endangers the public health, welfare, or safety, or the contract is necessary for the provision of essential services, and no entity that complies with the requirements of this section capable of responding to the emergency is immediately available.\n(3) The requirements of this section violate, or are inconsistent with, the terms or conditions of a grant, subvention, or agreement, if the agency has made a good faith attempt to change the terms or conditions of any grant, subvention, or agreement to authorize application of this section.\n(4) The contractor is providing wholesale or bulk water, power, or natural gas, the conveyance or transmission of the same, or ancillary services, as required for ensuring reliable services in accordance with good utility practice, if the purchase of the same cannot practically be accomplished through the standard competitive bidding procedures and the contractor is not providing direct retail services to end users.\n(d) (1) A contractor shall not be deemed to discriminate in the provision of benefits if the contractor, in providing the benefits, pays the actual costs incurred in obtaining the benefit.\n(2) If a contractor is unable to provide a certain benefit, despite taking reasonable measures to do so, the contractor shall not be deemed to discriminate in the provision of benefits.\n(e) (1) Every contract subject to this chapter shall contain a statement by which the contractor certifies that the contractor is in compliance with this section.\n(2) The department or other contracting agency shall enforce this section pursuant to its existing enforcement powers.\n(3) (A) If a contractor falsely certifies that it is in compliance with this section, the contract with that contractor shall be subject to Article 9 (commencing with Section 10420), unless, within a time period specified by the department or other contracting agency, the contractor provides to the department or agency proof that it has complied, or is in the process of complying, with this section.\n(B) The application of the remedies or penalties contained in Article 9 (commencing with Section 10420) to a contract subject to this chapter shall not preclude the application of any existing remedies otherwise available to the department or other contracting agency under its existing enforcement powers.\n(f) Nothing in this section is intended to regulate the contracting practices of any local jurisdiction.\n(g) This section shall be construed so as not to conflict with applicable federal laws, rules, or regulations. In the event that a court or agency of competent jurisdiction holds that federal law, rule, or regulation invalidates any clause, sentence, paragraph, or section of this code or the application thereof to any person or circumstances, it is the intent of the state that the court or agency sever that clause, sentence, paragraph, or section so that the remainder of this section shall remain in effect.\nSEC. 2.\nSection 10295.35 of the Public Contract Code shall not be construed to create any new enforcement authority or responsibility in the Department of General Services or any other contracting agency.\nSEC. 3.\nNo reimbursement is required by this act pursuant to Section 6 of Article XIII\u2009B of the California Constitution because the only costs that may be incurred by a local agency or school district will be incurred because this act creates a new crime or infraction, eliminates a crime or infraction, or changes the penalty for a crime or infraction, within the meaning of Section 17556 of the Government Code, or changes the definition of a crime within the meaning of Section 6 of Article XIII\u2009B of the California Constitution.',
 'title': 'An act to add Section 10295.35 to the Public Contract Code, relating to public contracts.'}

您将需要使用两个字段:

  1. text:议案的文本,将是模型的输入。
  2. summary:text的浓缩版本,将是模型的目标。

预处理

下一步是加载T5分词器来处理文本和摘要:

>>> from transformers import AutoTokenizer

>>> checkpoint = "google-t5/t5-small"
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)

您需要创建的预处理函数需要执行以下操作:

  1. 在输入前加上提示,这样T5就知道这是一个摘要任务。一些能够处理多种NLP任务的模型需要特定任务的提示。
  2. 在对标签进行分词时使用关键字text_target参数。
  3. 截断序列,使其不超过max_length参数设置的最大长度。
>>> prefix = "summarize: "


>>> def preprocess_function(examples):
...     inputs = [prefix + doc for doc in examples["text"]]
...     model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

...     labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

...     model_inputs["labels"] = labels["input_ids"]
...     return model_inputs

要在整个数据集上应用预处理函数,请使用🤗 Datasets的map方法。通过设置batched=True,您可以加快map函数的速度,以同时处理数据集中的多个元素:

>>> tokenized_billsum = billsum.map(preprocess_function, batched=True)

现在使用DataCollatorForSeq2Seq创建一个示例批次。在整理期间,将句子动态填充到批次中最长的长度,而不是将整个数据集填充到最大长度,这样做更高效。

>>> from transformers import DataCollatorForSeq2Seq

>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

评估

在训练过程中包含一个指标通常有助于评估您的模型性能。您可以使用🤗 Evaluate库快速加载评估方法。对于这个任务,加载ROUGE指标(请参阅🤗 Evaluate快速入门,了解有关如何加载和计算指标的更多信息):

>>> import evaluate

>>> rouge = evaluate.load("rouge")

然后创建一个函数,将您的预测和标签传递给compute来计算ROUGE指标:

>>> import numpy as np


>>> def compute_metrics(eval_pred):
...     predictions, labels = eval_pred
...     decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
...     labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
...     decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

...     result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

...     prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
...     result["gen_len"] = np.mean(prediction_lens)

...     return {k: round(v, 4) for k, v in result.items()}

现在您的compute_metrics函数已经准备好了,当您设置训练时,您将回到这个函数。

训练

如果您不熟悉使用Trainer微调模型,请查看这里的基本教程!

现在您已经准备好开始训练您的模型了!使用AutoModelForSeq2SeqLM加载T5:

>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

此时,只剩下三个步骤:

  1. Seq2SeqTrainingArguments中定义您的训练超参数。唯一需要的参数是output_dir,它指定了保存模型的地点。通过设置push_to_hub=True,您将此模型推送到Hub(您需要登录Hugging Face以上传您的模型)。在每次epoch结束时,Trainer将评估ROUGE指标并保存训练检查点。
  2. 将训练参数传递给Seq2SeqTrainer,以及模型、数据集、分词器、数据整理器和compute_metrics函数。
  3. 调用train()来微调您的模型。
>>> training_args = Seq2SeqTrainingArguments(
...     output_dir="my_awesome_billsum_model",
...     eval_strategy="epoch",
...     learning_rate=2e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     weight_decay=0.01,
...     save_total_limit=3,
...     num_train_epochs=4,
...     predict_with_generate=True,
...     fp16=True, #change to bf16=True for XPU
...     push_to_hub=True,
... )

>>> trainer = Seq2SeqTrainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_billsum["train"],
...     eval_dataset=tokenized_billsum["test"],
...     processing_class=tokenizer,
...     data_collator=data_collator,
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

一旦训练完成,使用push_to_hub()方法将您的模型分享到Hub,以便每个人都可以使用您的模型:

>>> trainer.push_to_hub()

要深入了解如何为概括任务微调模型,请查看相应的PyTorch笔记本TensorFlow笔记本

推理

太好了,现在您已经微调了一个模型,可以用来进行推理了!

想出一些您想要概括的文本。对于T5,您需要根据您正在进行的任务为输入加上前缀。对于概括任务,您应该像下面这样为输入加上前缀:

>>> text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

尝试您微调后的模型进行推理的最简单方法是使用pipeline()。使用您的模型实例化一个概括任务的管道,并将您的文本传递给它:

>>> from transformers import pipeline

>>> summarizer = pipeline("summarization", model="username/my_awesome_billsum_model")
>>> summarizer(text)
[{"summary_text": "The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country."}]

如果您愿意,您也可以手动复制管道的结果:

对文本进行分词,并返回input_ids作为PyTorch张量:

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_billsum_model")
>>> inputs = tokenizer(text, return_tensors="pt").input_ids

使用 generate() 方法来创建概要。关于不同的文本生成策略和控制生成的参数的更多细节,请查看文本生成API

>>> from transformers import AutoModelForSeq2SeqLM

>>> model = AutoModelForSeq2SeqLM.from_pretrained("username/my_awesome_billsum_model")
>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

将生成的令牌ID解码回文本。

>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'the inflation reduction act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in american history. it will ask the ultra-wealthy and corporations to pay their fair share.'

结语

第二百七十九篇博文写完,开心!!!!

今天,也是充满希望的一天。


文章作者: LuYF-Lemon-love
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 LuYF-Lemon-love !
  目录