多項選擇

多項選擇任務類似於問答，只是提供了幾個候選答案和上下文，模型經過訓練可以選擇正確答案。

本指南將向您展示如何：

在 SWAG 資料集的 regular 配置上微調 BERT，以在給定多個選項和一些上下文的情況下選擇最佳答案。
使用您的微調模型進行推理。

在開始之前，請確保您已安裝所有必要的庫

pip install transformers datasets evaluate

我們鼓勵您登入 Hugging Face 賬戶，以便您可以上傳並與社群分享您的模型。出現提示時，輸入您的令牌進行登入

>>> from huggingface_hub import notebook_login

>>> notebook_login()

載入 SWAG 資料集

首先從 🤗 Datasets 庫中載入 SWAG 資料集的 regular 配置。

>>> from datasets import load_dataset

>>> swag = load_dataset("swag", "regular")

然後檢視一個示例

>>> swag["train"][0]
{'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'fold-ind': '3416',
 'gold-source': 'gold',
 'label': 0,
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'video-id': 'anetv_jkn6uvmqwh4'}

雖然這裡看起來有很多欄位，但實際上相當簡單：

sent1 和 sent2：這些欄位顯示一個句子是如何開始的，如果將兩者放在一起，您將得到 startphrase 欄位。
ending：建議一個句子可能如何結束，但只有一個是正確的。
label：標識正確的句子結尾。

預處理

下一步是載入 BERT 分詞器來處理句子開頭和四個可能的結尾。

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

您需要建立的預處理函式需要

複製四份 sent1 欄位，並將它們分別與 sent2 結合，以重新建立句子開頭。
將 sent2 與四個可能的句子結尾中的每一個結合。
將這些列表展平以便進行分詞，然後再次展平，使每個示例都具有相應的 input_ids、attention_mask 和 labels 欄位。

>>> ending_names = ["ending0", "ending1", "ending2", "ending3"]


>>> def preprocess_function(examples):
...     first_sentences = [[context] * 4 for context in examples["sent1"]]
...     question_headers = examples["sent2"]
...     second_sentences = [
...         [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
...     ]

...     first_sentences = sum(first_sentences, [])
...     second_sentences = sum(second_sentences, [])

...     tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
...     return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}

要在整個資料集上應用預處理函式，請使用 🤗 Datasets 的 map 方法。透過設定 batched=True 以一次處理資料集的多個元素來加速 map 函式。

>>> tokenized_swag = swag.map(preprocess_function, batched=True)

為了建立一批示例，更有效的方法是在整理過程中將句子*動態填充*到批次中最長的長度，而不是將整個資料集填充到最大長度。DataCollatorForMultipleChoice 會展平所有模型輸入，應用填充，然後展平結果。

>>> from transformers import DataCollatorForMultipleChoice
>>> collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)

評估

在訓練期間包含一個指標通常有助於評估模型的效能。您可以使用 🤗 Evaluate 庫快速載入評估方法。對於此任務，載入準確度指標（請參閱 🤗 Evaluate 快速入門，瞭解有關如何載入和計算指標的更多資訊）。

>>> import evaluate

>>> accuracy = evaluate.load("accuracy")

然後建立一個函式，將您的預測和標籤傳遞給 compute 以計算準確度。

>>> import numpy as np


>>> def compute_metrics(eval_pred):
...     predictions, labels = eval_pred
...     predictions = np.argmax(predictions, axis=1)
...     return accuracy.compute(predictions=predictions, references=labels)

您的 compute_metrics 函式現在可以使用了，您將在設定訓練時再次用到它。

訓練

PyTorch

隱藏 Pytorch 內容

如果您不熟悉如何使用 Trainer 對模型進行微調，請參閱此處的基本教程！

現在您已準備好開始訓練模型了！使用 AutoModelForMultipleChoice 載入 BERT。

>>> from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

>>> model = AutoModelForMultipleChoice.from_pretrained("google-bert/bert-base-uncased")

此時，只剩下三個步驟

在 TrainingArguments 中定義您的訓練超引數。唯一必需的引數是 output_dir，它指定模型儲存的位置。您將透過設定 push_to_hub=True 將此模型推送到 Hub（您需要登入 Hugging Face 才能上傳模型）。在每個 epoch 結束時，Trainer 將評估準確性並儲存訓練檢查點。
將訓練引數與模型、資料集、分詞器、資料整理器和 compute_metrics 函式一起傳遞給 Trainer。
呼叫 train() 來微調您的模型。

>>> training_args = TrainingArguments(
...     output_dir="my_awesome_swag_model",
...     eval_strategy="epoch",
...     save_strategy="epoch",
...     load_best_model_at_end=True,
...     learning_rate=5e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     num_train_epochs=3,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_swag["train"],
...     eval_dataset=tokenized_swag["validation"],
...     processing_class=tokenizer,
...     data_collator=collator,
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

訓練完成後，使用 push_to_hub() 方法將您的模型分享到 Hub，以便所有人都可以使用您的模型。

>>> trainer.push_to_hub()

TensorFlow

隱藏 TensorFlow 內容

如果您不熟悉如何使用 Keras 對模型進行微調，請參閱此處的基本教程！

要在 TensorFlow 中對模型進行微調，首先要設定最佳化器函式、學習率排程和一些訓練超引數

>>> from transformers import create_optimizer

>>> batch_size = 16
>>> num_train_epochs = 2
>>> total_train_steps = (len(tokenized_swag["train"]) // batch_size) * num_train_epochs
>>> optimizer, schedule = create_optimizer(init_lr=5e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

然後您可以使用 TFAutoModelForMultipleChoice 載入 BERT。

>>> from transformers import TFAutoModelForMultipleChoice

>>> model = TFAutoModelForMultipleChoice.from_pretrained("google-bert/bert-base-uncased")

使用 prepare_tf_dataset() 將資料集轉換為 tf.data.Dataset 格式

>>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
>>> tf_train_set = model.prepare_tf_dataset(
...     tokenized_swag["train"],
...     shuffle=True,
...     batch_size=batch_size,
...     collate_fn=data_collator,
... )

>>> tf_validation_set = model.prepare_tf_dataset(
...     tokenized_swag["validation"],
...     shuffle=False,
...     batch_size=batch_size,
...     collate_fn=data_collator,
... )

使用 compile 配置模型進行訓練。請注意，Transformers 模型都帶有一個預設的任務相關損失函式，因此除非您需要，否則無需指定它。

>>> model.compile(optimizer=optimizer)  # No loss argument!

在開始訓練之前，需要設定的最後兩件事是計算預測的準確度，並提供一種將模型推送到 Hub 的方法。這兩者都透過使用 Keras 回撥完成。

將您的 compute_metrics 函式傳遞給 KerasMetricCallback

>>> from transformers.keras_callbacks import KerasMetricCallback

>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

在 PushToHubCallback 中指定將模型和分詞器推送到何處

>>> from transformers.keras_callbacks import PushToHubCallback

>>> push_to_hub_callback = PushToHubCallback(
...     output_dir="my_awesome_model",
...     tokenizer=tokenizer,
... )

然後將回調函式捆綁在一起

>>> callbacks = [metric_callback, push_to_hub_callback]

最後，您已準備好開始訓練模型！呼叫 fit 並傳入您的訓練和驗證資料集、時期數以及您的回撥函式，以微調模型。

>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=2, callbacks=callbacks)

訓練完成後，您的模型會自動上傳到 Hub，供所有人使用！

有關如何微調多項選擇模型的更深入示例，請參閱相應的 PyTorch notebook 或 TensorFlow notebook。

推理

太棒了，現在您已經微調了模型，您可以將其用於推理了！

想出一些文字和兩個候選答案

>>> prompt = "France has a bread law, Le Décret Pain, with strict rules on what is allowed in a traditional baguette."
>>> candidate1 = "The law does not apply to croissants and brioche."
>>> candidate2 = "The law applies to baguettes."

PyTorch

隱藏 Pytorch 內容

對每個提示和候選答案對進行分詞，並返回 PyTorch 張量。您還應該建立一些 labels。

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_swag_model")
>>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="pt", padding=True)
>>> labels = torch.tensor(0).unsqueeze(0)

將輸入和標籤傳遞給模型，並返回 logits。

>>> from transformers import AutoModelForMultipleChoice

>>> model = AutoModelForMultipleChoice.from_pretrained("username/my_awesome_swag_model")
>>> outputs = model(**{k: v.unsqueeze(0) for k, v in inputs.items()}, labels=labels)
>>> logits = outputs.logits

獲取機率最高的類別。

>>> predicted_class = logits.argmax().item()
>>> predicted_class
0

TensorFlow

隱藏 TensorFlow 內容

對每個提示和候選答案對進行分詞，並返回 TensorFlow 張量。

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_swag_model")
>>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="tf", padding=True)

將您的輸入傳遞給模型並返回 logits。

>>> from transformers import TFAutoModelForMultipleChoice

>>> model = TFAutoModelForMultipleChoice.from_pretrained("username/my_awesome_swag_model")
>>> inputs = {k: tf.expand_dims(v, 0) for k, v in inputs.items()}
>>> outputs = model(inputs)
>>> logits = outputs.logits

獲取機率最高的類別。

>>> predicted_class = int(tf.math.argmax(logits, axis=-1)[0])
>>> predicted_class
0

< > 在 GitHub 上更新