掩碼語言建模

掩碼語言建模預測序列中的掩碼標記，模型可以雙向關注標記。這意味著模型可以完全訪問左側和右側的標記。掩碼語言建模非常適合需要對整個序列進行良好上下文理解的任務。BERT是掩碼語言模型的一個例子。

本指南將向您展示如何：

在 ELI5 資料集的 r/askscience 子集上微調 DistilRoBERTa 模型。
使用您的微調模型進行推理。

要檢視與此任務相容的所有架構和檢查點，建議檢視任務頁面

在開始之前，請確保您已安裝所有必要的庫

pip install transformers datasets evaluate

我們鼓勵您登入到 Hugging Face 帳戶，以便您可以將模型上傳並與社群共享。當出現提示時，輸入您的令牌進行登入。

>>> from huggingface_hub import notebook_login

>>> notebook_login()

載入 ELI5 資料集

首先使用 🤗 Datasets 庫載入 ELI5-Category 資料集中的前 5000 個示例。這會讓你有機會在花更多時間訓練整個資料集之前進行實驗並確保一切正常。

>>> from datasets import load_dataset

>>> eli5 = load_dataset("eli5_category", split="train[:5000]")

使用 train_test_split 方法將資料集的 `train` 拆分為訓練集和測試集

>>> eli5 = eli5.train_test_split(test_size=0.2)

然後檢視一個示例

>>> eli5["train"][0]
{'q_id': '7h191n',
 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?',
 'selftext': '',
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'],
  'text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.",
   'None yet. It has to be reconciled with a vastly different house bill and then passed again.',
   'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?',
   'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'],
  'score': [21, 19, 5, 3],
  'text_urls': [[],
   [],
   [],
   ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']]},
 'title_urls': ['url'],
 'selftext_urls': ['url']}

雖然這看起來很多，但你真正感興趣的只有 `text` 欄位。語言建模任務的酷之處在於你不需要標籤（也稱為無監督任務），因為下一個詞*就是*標籤。

預處理

對於掩碼語言建模，下一步是載入 DistilRoBERTa 分詞器來處理 `text` 子欄位。

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base")

你會注意到，在上面的例子中，`text` 欄位實際上巢狀在 `answers` 中。這意味著你需要使用 `flatten` 方法從其巢狀結構中提取 `text` 子欄位。

>>> eli5 = eli5.flatten()
>>> eli5["train"][0]
{'q_id': '7h191n',
 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?',
 'selftext': '',
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'],
 'answers.text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.",
  'None yet. It has to be reconciled with a vastly different house bill and then passed again.',
  'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?',
  'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'],
 'answers.score': [21, 19, 5, 3],
 'answers.text_urls': [[],
  [],
  [],
  ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']],
 'title_urls': ['url'],
 'selftext_urls': ['url']}

每個子欄位現在都是一個單獨的列，由 `answers` 字首指示，並且 `text` 欄位現在是一個列表。為了不單獨標記每個句子，請將列表轉換為字串，以便您可以共同標記它們。

這是一個第一個預處理函式，用於連線每個示例的字串列表並標記結果。

>>> def preprocess_function(examples):
...     return tokenizer([" ".join(x) for x in examples["answers.text"]])

要將此預處理函式應用於整個資料集，請使用 🤗 Datasets map 方法。透過設定 `batched=True` 來一次處理資料集的多個元素，並增加 `num_proc` 的程序數，可以加速 `map` 函式。刪除所有不需要的列。

>>> tokenized_eli5 = eli5.map(
...     preprocess_function,
...     batched=True,
...     num_proc=4,
...     remove_columns=eli5["train"].column_names,
... )

此資料集包含標記序列，但其中一些序列的長度超過了模型的最大輸入長度。

你現在可以使用第二個預處理函式來

連線所有序列
將連線後的序列分割成由 `block_size` 定義的更短的塊，這些塊應該既短於最大輸入長度，又足夠短以適應你的 GPU RAM。

>>> block_size = 128


>>> def group_texts(examples):
...     # Concatenate all texts.
...     concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
...     total_length = len(concatenated_examples[list(examples.keys())[0]])
...     # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
...     # customize this part to your needs.
...     if total_length >= block_size:
...         total_length = (total_length // block_size) * block_size
...     # Split by chunks of block_size.
...     result = {
...         k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
...         for k, t in concatenated_examples.items()
...     }
...     return result

將 `group_texts` 函式應用於整個資料集。

>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

現在使用 DataCollatorForLanguageModeling 建立一個示例批次。在整理過程中，將句子動態填充到批次中最長長度比將整個資料集填充到最大長度更高效。

Pytorch

隱藏 Pytorch 內容

使用序列結束標記作為填充標記，並指定 `mlm_probability` 在每次迭代資料時隨機掩碼標記。

>>> from transformers import DataCollatorForLanguageModeling

>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

TensorFlow

隱藏 TensorFlow 內容

使用序列結束標記作為填充標記，並指定 `mlm_probability` 在每次迭代資料時隨機掩碼標記。

>>> from transformers import DataCollatorForLanguageModeling

>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf")

訓練

Pytorch

隱藏 Pytorch 內容

如果您不熟悉如何使用 Trainer 對模型進行微調，請參閱此處的基本教程！

你現在已經準備好開始訓練你的模型了！使用 AutoModelForMaskedLM 載入 DistilRoBERTa。

>>> from transformers import AutoModelForMaskedLM

>>> model = AutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base")

此時，只剩下三個步驟

在 TrainingArguments 中定義你的訓練超引數。唯一必需的引數是 `output_dir`，它指定了模型的儲存位置。透過設定 `push_to_hub=True`，你將把這個模型推送到 Hub（你需要登入 Hugging Face 才能上傳你的模型）。
將訓練引數以及模型、資料集和資料收集器傳遞給 Trainer。
呼叫 train() 來微調您的模型。

>>> training_args = TrainingArguments(
...     output_dir="my_awesome_eli5_mlm_model",
...     eval_strategy="epoch",
...     learning_rate=2e-5,
...     num_train_epochs=3,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=lm_dataset["train"],
...     eval_dataset=lm_dataset["test"],
...     data_collator=data_collator,
...     tokenizer=tokenizer,
... )

>>> trainer.train()

訓練完成後，使用 evaluate() 方法評估您的模型並獲取其困惑度。

>>> import math

>>> eval_results = trainer.evaluate()
>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Perplexity: 8.76

然後使用 push_to_hub() 方法將你的模型分享到 Hub，以便所有人都可以使用你的模型。

>>> trainer.push_to_hub()

TensorFlow

隱藏 TensorFlow 內容

如果您不熟悉如何使用 Keras 對模型進行微調，請參閱此處的基本教程！

要在 TensorFlow 中對模型進行微調，首先要設定最佳化器函式、學習率排程和一些訓練超引數

>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然後你可以使用 TFAutoModelForMaskedLM 載入 DistilRoBERTa。

>>> from transformers import TFAutoModelForMaskedLM

>>> model = TFAutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base")

使用 prepare_tf_dataset() 將資料集轉換為 tf.data.Dataset 格式

>>> tf_train_set = model.prepare_tf_dataset(
...     lm_dataset["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = model.prepare_tf_dataset(
...     lm_dataset["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用 `compile` 配置模型進行訓練。請注意，Transformers 模型都具有預設的任務相關損失函式，因此除非您需要，否則無需指定一個。

>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)  # No loss argument!

這可以透過在 PushToHubCallback 中指定模型和分詞器的推送位置來完成。

>>> from transformers.keras_callbacks import PushToHubCallback

>>> callback = PushToHubCallback(
...     output_dir="my_awesome_eli5_mlm_model",
...     tokenizer=tokenizer,
... )

最後，您已準備好開始訓練模型！呼叫 `fit` 方法，傳入您的訓練和驗證資料集、訓練輪數以及用於微調模型的回撥。

>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])

訓練完成後，您的模型會自動上傳到 Hub，供所有人使用！

有關如何微調掩碼語言建模模型的更深入示例，請參閱相應的PyTorch notebook 或 TensorFlow notebook。

推理

太棒了，現在您已經微調了模型，您可以將其用於推理了！

想出一些你希望模型填充空白的文字，並使用特殊的 `<mask>` 標記來指示空白。

>>> text = "The Milky Way is a <mask> galaxy."

嘗試微調模型進行推理的最簡單方法是在 pipeline() 中使用它。例項化一個用於填充掩碼的 `pipeline`，傳入你的模型，然後傳入你的文字。如果你願意，可以使用 `top_k` 引數來指定返回多少個預測。

>>> from transformers import pipeline

>>> mask_filler = pipeline("fill-mask", "username/my_awesome_eli5_mlm_model")
>>> mask_filler(text, top_k=3)
[{'score': 0.5150994658470154,
  'token': 21300,
  'token_str': ' spiral',
  'sequence': 'The Milky Way is a spiral galaxy.'},
 {'score': 0.07087188959121704,
  'token': 2232,
  'token_str': ' massive',
  'sequence': 'The Milky Way is a massive galaxy.'},
 {'score': 0.06434620916843414,
  'token': 650,
  'token_str': ' small',
  'sequence': 'The Milky Way is a small galaxy.'}]

Pytorch

隱藏 Pytorch 內容

將文字分詞並以 PyTorch 張量形式返回 `input_ids`。您還需要指定 `<mask>` 標記的位置。

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> inputs = tokenizer(text, return_tensors="pt")
>>> mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

將你的輸入傳遞給模型並返回掩碼標記的 `logits`。

>>> from transformers import AutoModelForMaskedLM

>>> model = AutoModelForMaskedLM.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> logits = model(**inputs).logits
>>> mask_token_logits = logits[0, mask_token_index, :]

然後返回機率最高的三種掩碼標記並打印出來。

>>> top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()

>>> for token in top_3_tokens:
...     print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
The Milky Way is a spiral galaxy.
The Milky Way is a massive galaxy.
The Milky Way is a small galaxy.

TensorFlow

隱藏 TensorFlow 內容

將文字分詞並以 TensorFlow 張量形式返回 `input_ids`。您還需要指定 `<mask>` 標記的位置。

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> inputs = tokenizer(text, return_tensors="tf")
>>> mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]

將你的輸入傳遞給模型並返回掩碼標記的 `logits`。

>>> from transformers import TFAutoModelForMaskedLM

>>> model = TFAutoModelForMaskedLM.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> logits = model(**inputs).logits
>>> mask_token_logits = logits[0, mask_token_index, :]

然後返回機率最高的三種掩碼標記並打印出來。

>>> top_3_tokens = tf.math.top_k(mask_token_logits, 3).indices.numpy()

>>> for token in top_3_tokens:
...     print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
The Milky Way is a spiral galaxy.
The Milky Way is a massive galaxy.
The Milky Way is a small galaxy.

< > 在 GitHub 上更新