Transformers 文件

摘要

Hugging Face's logo
加入 Hugging Face 社群

並獲得增強的文件體驗

開始使用

摘要

摘要建立文件或文章的更短版本,其中包含所有重要資訊。與翻譯一樣,它也是一個可以表述為序列到序列任務的示例。摘要可以是

  • 提取式:從文件中提取最相關的資訊。
  • 抽象式:生成捕獲最相關資訊的新文字。

本指南將向您展示如何:

  1. BillSum 資料集的加州法案子集上對 T5 進行微調以進行抽象式摘要。
  2. 使用您的微調模型進行推理。

要檢視與此任務相容的所有架構和檢查點,我們建議檢視任務頁面

在開始之前,請確保您已安裝所有必要的庫

pip install transformers datasets evaluate rouge_score

我們鼓勵您登入 Hugging Face 賬戶,以便您可以上傳並與社群分享您的模型。出現提示時,輸入您的令牌進行登入

>>> from huggingface_hub import notebook_login

>>> notebook_login()

載入 BillSum 資料集

首先從 🤗 Datasets 庫中載入 BillSum 資料集較小的加州法案子集

>>> from datasets import load_dataset

>>> billsum = load_dataset("billsum", split="ca_test")

使用 train_test_split 方法將資料集拆分為訓練集和測試集

>>> billsum = billsum.train_test_split(test_size=0.2)

然後檢視一個示例

>>> billsum["train"][0]
{'summary': 'Existing law authorizes state agencies to enter into contracts for the acquisition of goods or services upon approval by the Department of General Services. Existing law sets forth various requirements and prohibitions for those contracts, including, but not limited to, a prohibition on entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between spouses and domestic partners or same-sex and different-sex couples in the provision of benefits. Existing law provides that a contract entered into in violation of those requirements and prohibitions is void and authorizes the state or any person acting on behalf of the state to bring a civil action seeking a determination that a contract is in violation and therefore void. Under existing law, a willful violation of those requirements and prohibitions is a misdemeanor.\nThis bill would also prohibit a state agency from entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between employees on the basis of gender identity in the provision of benefits, as specified. By expanding the scope of a crime, this bill would impose a state-mandated local program.\nThe California Constitution requires the state to reimburse local agencies and school districts for certain costs mandated by the state. Statutory provisions establish procedures for making that reimbursement.\nThis bill would provide that no reimbursement is required by this act for a specified reason.',
 'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 10295.35 is added to the Public Contract Code, to read:\n10295.35.\n(a) (1) Notwithstanding any other law, a state agency shall not enter into any contract for the acquisition of goods or services in the amount of one hundred thousand dollars ($100,000) or more with a contractor that, in the provision of benefits, discriminates between employees on the basis of an employee’s or dependent’s actual or perceived gender identity, including, but not limited to, the employee’s or dependent’s identification as transgender.\n(2) For purposes of this section, “contract” includes contracts with a cumulative amount of one hundred thousand dollars ($100,000) or more per contractor in each fiscal year.\n(3) For purposes of this section, an employee health plan is discriminatory if the plan is not consistent with Section 1365.5 of the Health and Safety Code and Section 10140 of the Insurance Code.\n(4) The requirements of this section shall apply only to those portions of a contractor’s operations that occur under any of the following conditions:\n(A) Within the state.\n(B) On real property outside the state if the property is owned by the state or if the state has a right to occupy the property, and if the contractor’s presence at that location is connected to a contract with the state.\n(C) Elsewhere in the United States where work related to a state contract is being performed.\n(b) Contractors shall treat as confidential, to the maximum extent allowed by law or by the requirement of the contractor’s insurance provider, any request by an employee or applicant for employment benefits or any documentation of eligibility for benefits submitted by an employee or applicant for employment.\n(c) After taking all reasonable measures to find a contractor that complies with this section, as determined by the state agency, the requirements of this section may be waived under any of the following circumstances:\n(1) There is only one prospective contractor willing to enter into a specific contract with the state agency.\n(2) The contract is necessary to respond to an emergency, as determined by the state agency, that endangers the public health, welfare, or safety, or the contract is necessary for the provision of essential services, and no entity that complies with the requirements of this section capable of responding to the emergency is immediately available.\n(3) The requirements of this section violate, or are inconsistent with, the terms or conditions of a grant, subvention, or agreement, if the agency has made a good faith attempt to change the terms or conditions of any grant, subvention, or agreement to authorize application of this section.\n(4) The contractor is providing wholesale or bulk water, power, or natural gas, the conveyance or transmission of the same, or ancillary services, as required for ensuring reliable services in accordance with good utility practice, if the purchase of the same cannot practically be accomplished through the standard competitive bidding procedures and the contractor is not providing direct retail services to end users.\n(d) (1) A contractor shall not be deemed to discriminate in the provision of benefits if the contractor, in providing the benefits, pays the actual costs incurred in obtaining the benefit.\n(2) If a contractor is unable to provide a certain benefit, despite taking reasonable measures to do so, the contractor shall not be deemed to discriminate in the provision of benefits.\n(e) (1) Every contract subject to this chapter shall contain a statement by which the contractor certifies that the contractor is in compliance with this section.\n(2) The department or other contracting agency shall enforce this section pursuant to its existing enforcement powers.\n(3) (A) If a contractor falsely certifies that it is in compliance with this section, the contract with that contractor shall be subject to Article 9 (commencing with Section 10420), unless, within a time period specified by the department or other contracting agency, the contractor provides to the department or agency proof that it has complied, or is in the process of complying, with this section.\n(B) The application of the remedies or penalties contained in Article 9 (commencing with Section 10420) to a contract subject to this chapter shall not preclude the application of any existing remedies otherwise available to the department or other contracting agency under its existing enforcement powers.\n(f) Nothing in this section is intended to regulate the contracting practices of any local jurisdiction.\n(g) This section shall be construed so as not to conflict with applicable federal laws, rules, or regulations. In the event that a court or agency of competent jurisdiction holds that federal law, rule, or regulation invalidates any clause, sentence, paragraph, or section of this code or the application thereof to any person or circumstances, it is the intent of the state that the court or agency sever that clause, sentence, paragraph, or section so that the remainder of this section shall remain in effect.\nSEC. 2.\nSection 10295.35 of the Public Contract Code shall not be construed to create any new enforcement authority or responsibility in the Department of General Services or any other contracting agency.\nSEC. 3.\nNo reimbursement is required by this act pursuant to Section 6 of Article XIII\u2009B of the California Constitution because the only costs that may be incurred by a local agency or school district will be incurred because this act creates a new crime or infraction, eliminates a crime or infraction, or changes the penalty for a crime or infraction, within the meaning of Section 17556 of the Government Code, or changes the definition of a crime within the meaning of Section 6 of Article XIII\u2009B of the California Constitution.',
 'title': 'An act to add Section 10295.35 to the Public Contract Code, relating to public contracts.'}

您將要使用兩個欄位

  • text:法案的文字,它將作為模型的輸入。
  • summarytext的精簡版本,它將作為模型的目標。

預處理

下一步是載入 T5 tokenizer 來處理 textsummary

>>> from transformers import AutoTokenizer

>>> checkpoint = "google-t5/t5-small"
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)

您需要建立的預處理函式需要

  1. 在輸入前加上提示,以便 T5 知道這是一個摘要任務。一些能夠執行多個 NLP 任務的模型需要針對特定任務進行提示。
  2. 在標記標籤時使用關鍵字 text_target 引數。
  3. 將序列截斷,使其長度不超過 max_length 引數設定的最大長度。
>>> prefix = "summarize: "


>>> def preprocess_function(examples):
...     inputs = [prefix + doc for doc in examples["text"]]
...     model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

...     labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

...     model_inputs["labels"] = labels["input_ids"]
...     return model_inputs

要將預處理函式應用於整個資料集,請使用 🤗 Datasets map 方法。您可以透過設定 batched=True 來一次處理資料集的多個元素來加速 map 函式

>>> tokenized_billsum = billsum.map(preprocess_function, batched=True)

現在使用 DataCollatorForSeq2Seq 建立一批示例。在整理過程中,將句子動態填充到批次中最長的長度,而不是將整個資料集填充到最大長度,效率更高。

Pytorch
隱藏 Pytorch 內容
>>> from transformers import DataCollatorForSeq2Seq

>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)
TensorFlow
隱藏 TensorFlow 內容
>>> from transformers import DataCollatorForSeq2Seq

>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")

評估

在訓練期間包含指標通常有助於評估模型的效能。您可以使用 🤗 Evaluate 庫快速載入評估方法。對於此任務,載入 ROUGE 指標(請參閱 🤗 Evaluate 快速入門以瞭解如何載入和計算指標)

>>> import evaluate

>>> rouge = evaluate.load("rouge")

然後建立一個函式,將您的預測和標籤傳遞給 compute 以計算 ROUGE 指標

>>> import numpy as np


>>> def compute_metrics(eval_pred):
...     predictions, labels = eval_pred
...     decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
...     labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
...     decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

...     result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

...     prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
...     result["gen_len"] = np.mean(prediction_lens)

...     return {k: round(v, 4) for k, v in result.items()}

您的 compute_metrics 函式現在可以使用了,您將在設定訓練時再次用到它。

訓練

Pytorch
隱藏 Pytorch 內容

如果您不熟悉如何使用 Trainer 對模型進行微調,請參閱此處的基本教程!

您現在已經準備好開始訓練模型了!使用 AutoModelForSeq2SeqLM 載入 T5

>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

此時,只剩下三個步驟

  1. Seq2SeqTrainingArguments 中定義您的訓練超引數。唯一必需的引數是 output_dir,它指定儲存模型的位置。您將透過設定 push_to_hub=True 將此模型推送到 Hub(您需要登入 Hugging Face 才能上傳模型)。在每個 epoch 結束時,Trainer 將評估 ROUGE 指標並儲存訓練檢查點。
  2. 將訓練引數以及模型、資料集、分詞器、資料收集器和 compute_metrics 函式傳遞給 Seq2SeqTrainer
  3. 呼叫 train() 來微調您的模型。
>>> training_args = Seq2SeqTrainingArguments(
...     output_dir="my_awesome_billsum_model",
...     eval_strategy="epoch",
...     learning_rate=2e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     weight_decay=0.01,
...     save_total_limit=3,
...     num_train_epochs=4,
...     predict_with_generate=True,
...     fp16=True, #change to bf16=True for XPU
...     push_to_hub=True,
... )

>>> trainer = Seq2SeqTrainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_billsum["train"],
...     eval_dataset=tokenized_billsum["test"],
...     processing_class=tokenizer,
...     data_collator=data_collator,
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

訓練完成後,使用 push_to_hub() 方法將您的模型分享到 Hub,以便所有人都可以使用您的模型。

>>> trainer.push_to_hub()
TensorFlow
隱藏 TensorFlow 內容

如果您不熟悉如何使用 Keras 對模型進行微調,請參閱此處的基本教程!

要在 TensorFlow 中對模型進行微調,首先要設定最佳化器函式、學習率排程和一些訓練超引數
>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然後您可以使用 TFAutoModelForSeq2SeqLM 載入 T5

>>> from transformers import TFAutoModelForSeq2SeqLM

>>> model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

使用 prepare_tf_dataset() 將資料集轉換為 tf.data.Dataset 格式

>>> tf_train_set = model.prepare_tf_dataset(
...     tokenized_billsum["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = model.prepare_tf_dataset(
...     tokenized_billsum["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用 compile 配置模型進行訓練。請注意,Transformers 模型都具有預設的任務相關損失函式,因此除非您想要,否則無需指定一個

>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)  # No loss argument!

在開始訓練之前,最後需要設定兩件事:從預測中計算 ROUGE 分數,並提供一種將模型推送到 Hub 的方法。這兩件事都可以透過使用 Keras callbacks 來完成。

將您的 compute_metrics 函式傳遞給 KerasMetricCallback

>>> from transformers.keras_callbacks import KerasMetricCallback

>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_test_set)

PushToHubCallback 中指定將模型和分詞器推送到何處

>>> from transformers.keras_callbacks import PushToHubCallback

>>> push_to_hub_callback = PushToHubCallback(
...     output_dir="my_awesome_billsum_model",
...     tokenizer=tokenizer,
... )

然後將回調函式捆綁在一起

>>> callbacks = [metric_callback, push_to_hub_callback]

最後,您準備好開始訓練模型了!呼叫 fit,傳入您的訓練和驗證資料集、epoch 數量以及回撥函式,以對模型進行微調

>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=callbacks)

訓練完成後,您的模型會自動上傳到 Hub,供所有人使用!

有關如何微調模型進行摘要的更深入示例,請參閱相應的 PyTorch notebookTensorFlow notebook

推理

太棒了,現在您已經微調了模型,您可以將其用於推理了!

想一些您想摘要的文字。對於 T5,您需要根據您正在處理的任務來給輸入新增字首。對於摘要,您應該按照如下所示給輸入新增字首

>>> text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

嘗試微調模型進行推理的最簡單方法是在 pipeline() 中使用它。使用您的模型例項化一個用於摘要的 pipeline,並將您的文字傳遞給它

>>> from transformers import pipeline

>>> summarizer = pipeline("summarization", model="username/my_awesome_billsum_model")
>>> summarizer(text)
[{"summary_text": "The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country."}]

如果需要,您也可以手動複製 pipeline 的結果

Pytorch
隱藏 Pytorch 內容

對文字進行標記並返回 input_ids 作為 PyTorch 張量

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_billsum_model")
>>> inputs = tokenizer(text, return_tensors="pt").input_ids

使用 generate() 方法建立摘要。有關不同的文字生成策略和控制生成的引數的更多詳細資訊,請檢視 文字生成 API。

>>> from transformers import AutoModelForSeq2SeqLM

>>> model = AutoModelForSeq2SeqLM.from_pretrained("username/my_awesome_billsum_model")
>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

將生成的 token ID 解碼迴文本

>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'the inflation reduction act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in american history. it will ask the ultra-wealthy and corporations to pay their fair share.'
TensorFlow
隱藏 TensorFlow 內容

對文字進行標記,並將 input_ids 作為 TensorFlow 張量返回

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_billsum_model")
>>> inputs = tokenizer(text, return_tensors="tf").input_ids

使用 ~transformers.generation_tf_utils.TFGenerationMixin.generate 方法建立摘要。有關不同的文字生成策略和控制生成的引數的更多詳細資訊,請檢視 文字生成 API。

>>> from transformers import TFAutoModelForSeq2SeqLM

>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("username/my_awesome_billsum_model")
>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

將生成的 token ID 解碼迴文本

>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'the inflation reduction act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in american history. it will ask the ultra-wealthy and corporations to pay their fair share.'
< > 在 GitHub 上更新

© . This site is unofficial and not affiliated with Hugging Face, Inc.