問答

問答任務是在給定問題的情況下返回答案。如果你曾經問過 Alexa、Siri 或 Google 等虛擬助手天氣情況，那麼你以前就使用過問答模型。問答任務有兩種常見型別：

抽取式：從給定上下文中抽取答案。
抽象式：從上下文中生成一個能正確回答問題的答案。

本指南將向您展示如何：

在 SQuAD 資料集上對 DistilBERT 進行微調，以實現抽取式問答。
使用您的微調模型進行推理。

要檢視所有與此任務相容的架構和檢查點，建議檢視任務頁面

在開始之前，請確保您已安裝所有必要的庫

pip install transformers datasets evaluate

我們鼓勵您登入 Hugging Face 賬戶，以便您可以上傳並與社群分享您的模型。出現提示時，輸入您的令牌進行登入

>>> from huggingface_hub import notebook_login

>>> notebook_login()

載入 SQuAD 資料集

首先從 🤗 Datasets 庫中載入 SQuAD 資料集的一個較小子集。這將為你提供一個試驗機會，確保一切正常，然後再花更多時間在完整資料集上進行訓練。

>>> from datasets import load_dataset

>>> squad = load_dataset("squad", split="train[:5000]")

使用 train_test_split 方法將資料集的 `train` 拆分為訓練集和測試集

>>> squad = squad.train_test_split(test_size=0.2)

然後檢視一個示例

>>> squad["train"][0]
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'
}

這裡有幾個重要的欄位：

answers：答案標記的起始位置和答案文字。
context：模型需要從中提取答案的背景資訊。
question：模型應回答的問題。

預處理

下一步是載入 DistilBERT 分詞器來處理 `question` 和 `context` 欄位

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

你需要注意問答任務特有的幾個預處理步驟：

資料集中有些示例可能包含非常長的 `context`，超出了模型的最大輸入長度。為了處理更長的序列，請透過設定 `truncation="only_second"` 僅截斷 `context`。
接下來，透過設定 `return_offset_mapping=True` 將答案的起始和結束位置對映到原始 `context`。
有了對映，現在你可以找到答案的起始和結束標記。使用 `sequence_ids` 方法查詢偏移量中哪一部分對應於 `question`，哪一部分對應於 `context`。

以下是如何建立一個函式來截斷並對映 `answer` 的起始和結束標記到 `context`

>>> def preprocess_function(examples):
...     questions = [q.strip() for q in examples["question"]]
...     inputs = tokenizer(
...         questions,
...         examples["context"],
...         max_length=384,
...         truncation="only_second",
...         return_offsets_mapping=True,
...         padding="max_length",
...     )

...     offset_mapping = inputs.pop("offset_mapping")
...     answers = examples["answers"]
...     start_positions = []
...     end_positions = []

...     for i, offset in enumerate(offset_mapping):
...         answer = answers[i]
...         start_char = answer["answer_start"][0]
...         end_char = answer["answer_start"][0] + len(answer["text"][0])
...         sequence_ids = inputs.sequence_ids(i)

...         # Find the start and end of the context
...         idx = 0
...         while sequence_ids[idx] != 1:
...             idx += 1
...         context_start = idx
...         while sequence_ids[idx] == 1:
...             idx += 1
...         context_end = idx - 1

...         # If the answer is not fully inside the context, label it (0, 0)
...         if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
...             start_positions.append(0)
...             end_positions.append(0)
...         else:
...             # Otherwise it's the start and end token positions
...             idx = context_start
...             while idx <= context_end and offset[idx][0] <= start_char:
...                 idx += 1
...             start_positions.append(idx - 1)

...             idx = context_end
...             while idx >= context_start and offset[idx][1] >= end_char:
...                 idx -= 1
...             end_positions.append(idx + 1)

...     inputs["start_positions"] = start_positions
...     inputs["end_positions"] = end_positions
...     return inputs

要對整個資料集應用預處理函式，請使用 🤗 Datasets 的 map 函式。你可以透過設定 `batched=True` 來加快 `map` 函式的速度，以便一次處理資料集的多個元素。刪除所有不需要的列

>>> tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

現在使用 DefaultDataCollator 建立一批示例。與其他 🤗 Transformers 中的資料收集器不同，DefaultDataCollator 不會應用任何額外的預處理，例如填充。

Pytorch

隱藏 Pytorch 內容

>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator()

TensorFlow

隱藏 TensorFlow 內容

>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator(return_tensors="tf")

訓練

Pytorch

隱藏 Pytorch 內容

如果您不熟悉如何使用 Trainer 對模型進行微調，請參閱此處的基本教程！

你現在可以開始訓練模型了！使用 AutoModelForQuestionAnswering 載入 DistilBERT

>>> from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

>>> model = AutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased")

此時，只剩下三個步驟

在 TrainingArguments 中定義你的訓練超引數。唯一必需的引數是 `output_dir`，它指定了模型儲存的位置。透過設定 `push_to_hub=True`，你將模型推送到 Hub（你需要登入 Hugging Face 才能上傳模型）。
將訓練引數與模型、資料集、分詞器和資料收集器一起傳遞給 Trainer。
呼叫 train() 來微調您的模型。

>>> training_args = TrainingArguments(
...     output_dir="my_awesome_qa_model",
...     eval_strategy="epoch",
...     learning_rate=2e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     num_train_epochs=3,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_squad["train"],
...     eval_dataset=tokenized_squad["test"],
...     processing_class=tokenizer,
...     data_collator=data_collator,
... )

>>> trainer.train()

訓練完成後，使用 push_to_hub() 方法將您的模型分享到 Hub，以便所有人都可以使用您的模型。

>>> trainer.push_to_hub()

TensorFlow

隱藏 TensorFlow 內容

如果您不熟悉如何使用 Keras 對模型進行微調，請參閱此處的基本教程！

要在 TensorFlow 中對模型進行微調，首先要設定最佳化器函式、學習率排程和一些訓練超引數

>>> from transformers import create_optimizer

>>> batch_size = 16
>>> num_epochs = 2
>>> total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
>>> optimizer, schedule = create_optimizer(
...     init_lr=2e-5,
...     num_warmup_steps=0,
...     num_train_steps=total_train_steps,
... )

然後你可以使用 TFAutoModelForQuestionAnswering 載入 DistilBERT

>>> from transformers import TFAutoModelForQuestionAnswering

>>> model = TFAutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased")

使用 prepare_tf_dataset() 將資料集轉換為 tf.data.Dataset 格式

>>> tf_train_set = model.prepare_tf_dataset(
...     tokenized_squad["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_validation_set = model.prepare_tf_dataset(
...     tokenized_squad["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用 compile 配置模型進行訓練

>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)

在開始訓練之前，最後需要設定的是將模型推送到 Hub 的方法。這可以透過在 PushToHubCallback 中指定模型和分詞器要推送到的位置來完成

>>> from transformers.keras_callbacks import PushToHubCallback

>>> callback = PushToHubCallback(
...     output_dir="my_awesome_qa_model",
...     tokenizer=tokenizer,
... )

最後，你準備好開始訓練你的模型了！呼叫 fit，傳入你的訓練和驗證資料集、epoch 數量以及回撥函式來微調模型

>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=[callback])

訓練完成後，您的模型會自動上傳到 Hub，供所有人使用！

有關如何微調問答模型的更深入示例，請檢視相應的 PyTorch notebook 或 TensorFlow notebook。

評估

問答評估需要大量的後處理。為了避免佔用你太多時間，本指南跳過了評估步驟。Trainer 仍然會在訓練期間計算評估損失，因此你不會完全不瞭解模型的效能。

如果你有更多時間並對如何評估問答模型感興趣，請檢視 🤗 Hugging Face 課程中的問答章節！

推理

太棒了，現在您已經微調了模型，您可以將其用於推理了！

提出一個問題和一些你想讓模型預測的上下文

>>> question = "How many programming languages does BLOOM support?"
>>> context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."

嘗試微調模型的推理最簡單的方法是在 pipeline() 中使用它。用你的模型例項化一個問答 `pipeline`，並將你的文字傳遞給它

>>> from transformers import pipeline

>>> question_answerer = pipeline("question-answering", model="my_awesome_qa_model")
>>> question_answerer(question=question, context=context)
{'score': 0.2058267742395401,
 'start': 10,
 'end': 95,
 'answer': '176 billion parameters and can generate text in 46 languages natural languages and 13'}

如果需要，您也可以手動複製 pipeline 的結果

Pytorch

隱藏 Pytorch 內容

對文字進行分詞並返回 PyTorch 張量

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
>>> inputs = tokenizer(question, context, return_tensors="pt")

將您的輸入傳遞給模型並返回 logits。

>>> import torch
>>> from transformers import AutoModelForQuestionAnswering

>>> model = AutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
>>> with torch.no_grad():
...     outputs = model(**inputs)

從模型輸出中獲取起始和結束位置的最高機率

>>> answer_start_index = outputs.start_logits.argmax()
>>> answer_end_index = outputs.end_logits.argmax()

解碼預測的標記以獲取答案

>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
>>> tokenizer.decode(predict_answer_tokens)
'176 billion parameters and can generate text in 46 languages natural languages and 13'

TensorFlow

隱藏 TensorFlow 內容

對文字進行分詞並返回 TensorFlow 張量

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
>>> inputs = tokenizer(question, context, return_tensors="tf")

將您的輸入傳遞給模型並返回 logits。

>>> from transformers import TFAutoModelForQuestionAnswering

>>> model = TFAutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
>>> outputs = model(**inputs)

從模型輸出中獲取起始和結束位置的最高機率

>>> answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
>>> answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])

解碼預測的標記以獲取答案

>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
>>> tokenizer.decode(predict_answer_tokens)
'176 billion parameters and can generate text in 46 languages natural languages and 13'

< > 在 GitHub 上更新