因果語言模型

語言模型有兩種型別：因果語言模型和掩碼語言模型。本指南主要介紹因果語言模型。因果語言模型常用於文字生成。你可以用這些模型進行創意應用，例如“選擇你自己的文字冒險”或像 Copilot 或 CodeParrot 這樣的智慧編碼助手。

因果語言模型預測序列中的下一個標記，模型只能關注左側的標記。這意味著模型無法看到未來的標記。GPT-2 是因果語言模型的一個示例。

本指南將向您展示如何：

在 ELI5 資料集的 r/askscience 子集上微調 DistilGPT2。
使用您的微調模型進行推理。

要檢視所有與此任務相容的架構和檢查點，我們建議檢視任務頁面

在開始之前，請確保您已安裝所有必要的庫

pip install transformers datasets evaluate

我們鼓勵您登入到 Hugging Face 帳戶，以便您可以將模型上傳並與社群共享。當出現提示時，輸入您的令牌進行登入。

>>> from huggingface_hub import notebook_login

>>> notebook_login()

載入 ELI5 資料集

首先使用 🤗 Datasets 庫載入 ELI5-Category 資料集的前 5000 個示例。這將讓你有機會在花費更多時間訓練完整資料集之前進行實驗並確保一切正常。

>>> from datasets import load_dataset

>>> eli5 = load_dataset("eli5_category", split="train[:5000]")

使用 train_test_split 方法將資料集的 `train` 拆分為訓練集和測試集

>>> eli5 = eli5.train_test_split(test_size=0.2)

然後檢視一個示例

>>> eli5["train"][0]
{'q_id': '7h191n',
 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?',
 'selftext': '',
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'],
  'text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.",
   'None yet. It has to be reconciled with a vastly different house bill and then passed again.',
   'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?',
   'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'],
  'score': [21, 19, 5, 3],
  'text_urls': [[],
   [],
   [],
   ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']]},
 'title_urls': ['url'],
 'selftext_urls': ['url']}

雖然這看起來很多，但你真正感興趣的只是 `text` 欄位。語言建模任務的妙處在於你不需要標籤（也稱為無監督任務），因為下一個詞*就是*標籤。

預處理

下一步是載入 DistilGPT2 分詞器來處理 `text` 子欄位

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

你會注意到上面的示例中，`text` 欄位實際上巢狀在 `answers` 中。這意味著你需要使用 `flatten` 方法從其巢狀結構中提取 `text` 子欄位

>>> eli5 = eli5.flatten()
>>> eli5["train"][0]
{'q_id': '7h191n',
 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?',
 'selftext': '',
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'],
 'answers.text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.",
  'None yet. It has to be reconciled with a vastly different house bill and then passed again.',
  'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?',
  'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'],
 'answers.score': [21, 19, 5, 3],
 'answers.text_urls': [[],
  [],
  [],
  ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']],
 'title_urls': ['url'],
 'selftext_urls': ['url']}

現在每個子欄位都成為一個獨立的列，由 `answers` 字首表示，並且 `text` 欄位現在是一個列表。為了避免單獨對每個句子進行分詞，我們將列表轉換為字串，以便可以聯合分詞。

這是一個將每個示例的字串列表連線起來並對結果進行分詞的第一個預處理函式。

>>> def preprocess_function(examples):
...     return tokenizer([" ".join(x) for x in examples["answers.text"]])

要在整個資料集上應用此預處理函式，請使用 🤗 Datasets 的 map 方法。透過設定 `batched=True` 來一次處理資料集的多個元素，並增加 `num_proc` 的程序數，可以加快 `map` 函式的速度。刪除所有不需要的列

>>> tokenized_eli5 = eli5.map(
...     preprocess_function,
...     batched=True,
...     num_proc=4,
...     remove_columns=eli5["train"].column_names,
... )

此資料集包含標記序列，但其中一些序列的長度超過了模型的最大輸入長度。

現在可以使用第二個預處理函式來

連線所有序列
將連線後的序列分割成由 `block_size` 定義的更短的塊，這些塊應該既短於最大輸入長度，又足夠短以適應你的 GPU RAM。

>>> block_size = 128


>>> def group_texts(examples):
...     # Concatenate all texts.
...     concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
...     total_length = len(concatenated_examples[list(examples.keys())[0]])
...     # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
...     # customize this part to your needs.
...     if total_length >= block_size:
...         total_length = (total_length // block_size) * block_size
...     # Split by chunks of block_size.
...     result = {
...         k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
...         for k, t in concatenated_examples.items()
...     }
...     result["labels"] = result["input_ids"].copy()
...     return result

在整個資料集上應用 `group_texts` 函式。

>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

現在，使用 DataCollatorForLanguageModeling 建立一批示例。在整理過程中，將句子動態填充到批次中最長的長度比將整個資料集填充到最大長度更有效。

Pytorch

隱藏 Pytorch 內容

使用序列結束標記作為填充標記，並將 `mlm=False`。這將把輸入作為向右偏移一個元素的標籤。

>>> from transformers import DataCollatorForLanguageModeling

>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

TensorFlow

隱藏 TensorFlow 內容

使用序列結束標記作為填充標記，並將 `mlm=False`。這將把輸入作為向右偏移一個元素的標籤。

>>> from transformers import DataCollatorForLanguageModeling

>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")

訓練

Pytorch

隱藏 Pytorch 內容

如果你不熟悉使用 Trainer 微調模型，請檢視基本教程！

現在你已準備好開始訓練模型！使用 AutoModelForCausalLM 載入 DistilGPT2

>>> from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

此時，只剩下三個步驟

在 TrainingArguments 中定義你的訓練超引數。唯一必需的引數是 `output_dir`，它指定了模型儲存的位置。透過設定 `push_to_hub=True`，你將把此模型推送到 Hub（你需要登入 Hugging Face 才能上傳模型）。
將訓練引數以及模型、資料集和資料整理器傳遞給 Trainer。
呼叫 train() 來微調您的模型。

>>> training_args = TrainingArguments(
...     output_dir="my_awesome_eli5_clm-model",
...     eval_strategy="epoch",
...     learning_rate=2e-5,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=lm_dataset["train"],
...     eval_dataset=lm_dataset["test"],
...     data_collator=data_collator,
...     tokenizer=tokenizer,
... )

>>> trainer.train()

訓練完成後，使用 evaluate() 方法評估你的模型並獲取其困惑度

>>> import math

>>> eval_results = trainer.evaluate()
>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Perplexity: 49.61

然後使用 push_to_hub() 方法將你的模型共享到 Hub，以便所有人都可以使用你的模型。

>>> trainer.push_to_hub()

TensorFlow

隱藏 TensorFlow 內容

如果你不熟悉使用 Keras 微調模型，請檢視基本教程！

要在 TensorFlow 中對模型進行微調，首先要設定最佳化器函式、學習率排程和一些訓練超引數

>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然後你可以使用 TFAutoModelForCausalLM 載入 DistilGPT2

>>> from transformers import TFAutoModelForCausalLM

>>> model = TFAutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

使用 prepare_tf_dataset() 將資料集轉換為 tf.data.Dataset 格式

>>> tf_train_set = model.prepare_tf_dataset(
...     lm_dataset["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = model.prepare_tf_dataset(
...     lm_dataset["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用 `compile` 配置模型進行訓練。請注意，Transformers 模型都具有預設的任務相關損失函式，因此除非你願意，否則無需指定一個。

>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)  # No loss argument!

這可以透過在 PushToHubCallback 中指定要推送模型和分詞器的位置來完成。

>>> from transformers.keras_callbacks import PushToHubCallback

>>> callback = PushToHubCallback(
...     output_dir="my_awesome_eli5_clm-model",
...     tokenizer=tokenizer,
... )

最後，你已準備好開始訓練模型！呼叫 `fit`，傳入你的訓練和驗證資料集、訓練輪數以及你的回撥函式來微調模型。

>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])

訓練完成後，您的模型會自動上傳到 Hub，供所有人使用！

有關如何微調因果語言模型模型的更深入示例，請檢視相應的 PyTorch notebook 或 TensorFlow notebook。

推理

太棒了，現在您已經微調了模型，您可以將其用於推理了！

想出一個你想要生成文字的提示。

>>> prompt = "Somatic hypermutation allows the immune system to"

試用你微調過的模型進行推理的最簡單方法是將其用於 pipeline()。使用你的模型例項化一個文字生成 `pipeline`，並將你的文字傳遞給它。

>>> from transformers import pipeline

>>> generator = pipeline("text-generation", model="username/my_awesome_eli5_clm-model")
>>> generator(prompt)
[{'generated_text': "Somatic hypermutation allows the immune system to be able to effectively reverse the damage caused by an infection.\n\n\nThe damage caused by an infection is caused by the immune system's ability to perform its own self-correcting tasks."}]

Pytorch

隱藏 Pytorch 內容

對文字進行標記並返回 input_ids 作為 PyTorch 張量

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_clm-model")
>>> inputs = tokenizer(prompt, return_tensors="pt").input_ids

使用 generate() 方法生成文字。有關不同文字生成策略和控制生成引數的更多詳細資訊，請檢視文字生成策略頁面。

>>> from transformers import AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("username/my_awesome_eli5_clm-model")
>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

將生成的 token ID 解碼迴文本

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
["Somatic hypermutation allows the immune system to react to drugs with the ability to adapt to a different environmental situation. In other words, a system of 'hypermutation' can help the immune system to adapt to a different environmental situation or in some cases even a single life. In contrast, researchers at the University of Massachusetts-Boston have found that 'hypermutation' is much stronger in mice than in humans but can be found in humans, and that it's not completely unknown to the immune system. A study on how the immune system"]

TensorFlow

隱藏 TensorFlow 內容

對文字進行標記，並將 input_ids 作為 TensorFlow 張量返回

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_clm-model")
>>> inputs = tokenizer(prompt, return_tensors="tf").input_ids

使用 `~transformers.generation_tf_utils.TFGenerationMixin.generate` 方法建立摘要。有關不同文字生成策略和控制生成引數的更多詳細資訊，請檢視文字生成策略頁面。

>>> from transformers import TFAutoModelForCausalLM

>>> model = TFAutoModelForCausalLM.from_pretrained("username/my_awesome_eli5_clm-model")
>>> outputs = model.generate(input_ids=inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

將生成的 token ID 解碼迴文本

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Somatic hypermutation allows the immune system to detect the presence of other viruses as they become more prevalent. Therefore, researchers have identified a high proportion of human viruses. The proportion of virus-associated viruses in our study increases with age. Therefore, we propose a simple algorithm to detect the presence of these new viruses in our samples as a sign of improved immunity. A first study based on this algorithm, which will be published in Science on Friday, aims to show that this finding could translate into the development of a better vaccine that is more effective for']

< > 在 GitHub 上更新