QA 管道中的快速分詞器

我們現在將深入研究 問答 管道，看看如何利用偏移量從上下文中獲取問題的答案，有點像我們在上一節中對分組實體所做的那樣。然後我們將看到如何處理最終被截斷的非常長的上下文。如果你對問答任務不感興趣，可以跳過本節。

使用問答管道

正如我們在第 1 章中看到的，我們可以像這樣使用 問答 管道來獲取問題的答案

from transformers import pipeline

question_answerer = pipeline("question-answering")
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)

{'score': 0.97773,
 'start': 78,
 'end': 105,
 'answer': 'Jax, PyTorch and TensorFlow'}

與其他管道不同的是，其他管道無法截斷和拆分超過模型可接受的最大長度的文字（因此可能會錯過文件末尾的資訊），而此管道可以處理非常長的上下文，即使答案在最後也會返回答案

long_context = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question_answerer(question=question, context=long_context)

{'score': 0.97149,
 'start': 1892,
 'end': 1919,
 'answer': 'Jax, PyTorch and TensorFlow'}

讓我們看看它是如何做到的！

使用問答模型

與任何其他管道一樣，我們首先對輸入進行分詞，然後將其傳送到模型。問答 管道預設使用的檢查點是 distilbert-base-cased-distilled-squad（名稱中的“squad”來自模型經過微調的資料集；我們將在第 7 章中詳細介紹 SQuAD 資料集）

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

請注意，我們將問題和上下文作為一對進行分詞，問題在前。

An example of tokenization of question and context

用於問答的模型的工作方式與我們迄今為止看到的模型略有不同。以上面的圖片為例，該模型經過訓練可以預測答案開始的標記的索引（這裡是 21）和答案結束的標記的索引（這裡是 24）。這就是這些模型不返回一個 logits 張量而是返回兩個的原因：一個用於對應答案開始標記的 logits，另一個用於對應答案結束標記的 logits。由於在本例中，我們只有一個包含 66 個標記的輸入，因此我們得到

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

torch.Size([1, 66]) torch.Size([1, 66])

要將這些 logits 轉換為機率，我們將應用 softmax 函式 - 但在此之前，我們需要確保我們遮蔽了不屬於上下文的索引。我們的輸入是 [CLS] question [SEP] context [SEP]，因此我們需要遮蔽問題的標記以及 [SEP] 標記。但是，我們將保留 [CLS] 標記，因為某些模型使用它來指示答案不在上下文中。

由於我們將在之後應用 softmax，因此我們只需要用一個較大的負數替換我們要遮蔽的 logits。在這裡，我們使用 -10000

import torch

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000

現在我們已經正確地遮蔽了對應於我們不想預測的位置的 logits，我們可以應用 softmax

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

在這個階段，我們可以取 start 和 end 機率的 argmax - 但我們可能會得到一個 start 索引大於 end 索引的情況，因此我們需要採取更多預防措施。我們將計算每個可能的 start_index 和 end_index（其中 start_index <= end_index）的機率，然後取具有最高機率的元組 (start_index, end_index)。

假設事件“答案從 start_index 開始”和“答案在 end_index 結束”是獨立的，那麼答案從 start_index 開始並在 end_index 結束的機率為 $\mathrm{start\_probabilities}[\mathrm{start\_index}] \times \mathrm{end\_probabilities}[\mathrm{end\_index}]$

因此，要計算所有分數，我們只需要計算所有乘積。 $\mathrm{start\_probabilities}[\mathrm{start\_index}] \times \mathrm{end\_probabilities}[\mathrm{end\_index}]$ 其中 start_index <= end_index.

首先讓我們計算所有可能的乘積。

scores = start_probabilities[:, None] * end_probabilities[None, :]

然後，我們將對start_index > end_index的值進行掩碼，將它們設定為0（其他機率都是正數）。torch.triu()函式返回作為引數傳遞的2D張量的上三角部分，因此它將為我們執行該掩碼操作。

scores = torch.triu(scores)

現在我們只需要獲取最大值的索引。由於PyTorch將在扁平化的張量中返回索引，因此我們需要使用地板除法//和模運算%來獲取start_index和end_index。

max_index = scores.argmax().item()
start_index = max_index // scores.shape[1]
end_index = max_index % scores.shape[1]
print(scores[start_index, end_index])

我們還沒有完全完成，但至少我們已經獲得了答案的正確分數（您可以透過將其與上一節中的第一個結果進行比較來驗證這一點）。

0.97773

✏️ 試一試！ 計算五個最可能的答案的開始和結束索引。

我們獲得了答案的start_index和end_index，它們以標記形式表示，因此我們現在只需要將它們轉換為上下文中的字元索引。這就是偏移量非常有用的地方。我們可以獲取它們並像在標記分類任務中一樣使用它們。

inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]

現在，我們只需要格式化所有內容以獲得結果。

result = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index],
}
print(result)

{'answer': 'Jax, PyTorch and TensorFlow',
 'start': 78,
 'end': 105,
 'score': 0.97773}

太棒了！這與我們第一個示例中的一樣！

✏️ 試一試！ 使用您之前計算的最佳分數來顯示五個最可能的答案。要檢查您的結果，請返回到第一個管道，並在呼叫它時傳入top_k=5。

處理長上下文

如果我們嘗試對之前用作示例的疑問和長上下文進行標記化，我們將得到一個高於question-answering管道中使用的最大長度（為384）的標記數量。

inputs = tokenizer(question, long_context)
print(len(inputs["input_ids"]))

因此，我們需要將輸入截斷到該最大長度。我們可以透過多種方式做到這一點，但我們不想截斷疑問，而只想截斷上下文。由於上下文是第二句話，我們將使用"only_second"截斷策略。然後出現的問題是，問題的答案可能不在截斷的上下文中。例如，我們在這裡選擇了一個問題的答案位於上下文的末尾，當我們對其進行截斷時，該答案就不存在了。

inputs = tokenizer(question, long_context, max_length=384, truncation="only_second")
print(tokenizer.decode(inputs["input_ids"]))

"""
[CLS] Which deep learning libraries back [UNK] Transformers? [SEP] [UNK] Transformers : State of the Art NLP

[UNK] Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

[UNK] Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internal [SEP]
"""

這意味著模型將難以選擇正確的答案。為了解決這個問題，question-answering管道允許我們將上下文拆分為更小的塊，並指定最大長度。為了確保我們不會在完全錯誤的位置拆分上下文，使其無法找到答案，它還包含塊之間的某些重疊。

我們可以透過新增return_overflowing_tokens=True讓標記器（快速或慢速）為我們完成此操作，並且可以使用stride引數指定所需的重疊。以下是一個使用較短句子的示例。

sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

'[CLS] This sentence is not [SEP]'
'[CLS] is not too long [SEP]'
'[CLS] too long but we [SEP]'
'[CLS] but we are going [SEP]'
'[CLS] are going to split [SEP]'
'[CLS] to split it anyway [SEP]'
'[CLS] it anyway. [SEP]'

正如我們所見，句子已被拆分為塊，inputs["input_ids"]中的每個條目最多包含6個標記（我們需要新增填充以使最後一個條目與其他條目的大小相同），並且每個條目之間有2個標記的重疊。

讓我們仔細看看標記化結果。

print(inputs.keys())

dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])

正如預期的那樣，我們得到了輸入ID和注意力掩碼。最後一個鍵overflow_to_sample_mapping是一個對映，它告訴我們每個結果對應於哪句話 - 在這裡，我們有7個結果都來自我們傳遞給標記器的（唯一）句子。

print(inputs["overflow_to_sample_mapping"])

[0, 0, 0, 0, 0, 0, 0]

當我們一起標記化多個句子時，這更有用。例如，這個。

sentences = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
]
inputs = tokenizer(
    sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

print(inputs["overflow_to_sample_mapping"])

會得到我們。

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

這意味著第一個句子被拆分為7個塊，就像之前一樣，接下來的4個塊來自第二個句子。

現在讓我們回到我們的長上下文。預設情況下，question-answering管道使用最大長度為384，正如我們之前提到的，步長為128，這對應於模型的微調方式（您可以透過在呼叫管道時傳遞max_seq_len和stride引數來調整這些引數）。因此，我們在標記化時將使用這些引數。我們還將新增填充（使樣本具有相同的長度，以便我們可以構建張量），並請求偏移量。

inputs = tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding="longest",
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

這些inputs將包含模型期望的輸入ID和注意力掩碼，以及我們剛剛討論過的偏移量和overflow_to_sample_mapping。由於這兩個不是模型使用的引數，因此我們將它們從inputs中彈出（並且我們不會儲存對映，因為它在這裡沒有用），然後再將其轉換為張量。

_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs = inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape)

torch.Size([2, 384])

我們的長上下文被拆分為兩部分，這意味著它經過模型後，我們將獲得兩組開始和結束logits。

outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

torch.Size([2, 384]) torch.Size([2, 384])

與之前一樣，我們首先對不在上下文中的標記進行掩碼，然後進行softmax。我們還對所有填充標記（由注意力掩碼標記）進行掩碼。

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
# Mask all the [PAD] tokens
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))

start_logits[mask] = -10000
end_logits[mask] = -10000

然後，我們可以使用softmax將我們的logits轉換為機率。

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)

下一步類似於我們對小型上下文所做的操作，但我們對我們的兩個塊中的每一個都重複此操作。我們將分數分配給所有可能的答案跨度，然後取分數最高的跨度。

candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = torch.triu(scores).argmax().item()

    start_idx = idx // scores.shape[1]
    end_idx = idx % scores.shape[1]
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)

[(0, 18, 0.33867), (173, 184, 0.97149)]

這兩個候選對應於模型能夠在每個塊中找到的最佳答案。模型對正確的答案在第二部分中的信心更高（這是一個好兆頭！）。現在，我們只需要將這兩個標記跨度對映到上下文中字元的跨度（我們只需要對映第二個跨度以獲得答案，但看看模型在第一個塊中選擇了什麼很有意思）。

✏️ 試一試！ 調整上面的程式碼以返回五個最可能答案（總共，而不是每個塊）的分數和跨度。

我們之前獲取的offsets實際上是一個偏移量列表，每個文字塊對應一個列表。

for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char]
    result = {"answer": answer, "start": start_char, "end": end_char, "score": score}
    print(result)

{'answer': '\n🤗 Transformers: State of the Art NLP', 'start': 0, 'end': 37, 'score': 0.33867}
{'answer': 'Jax, PyTorch and TensorFlow', 'start': 1892, 'end': 1919, 'score': 0.97149}

如果我們忽略第一個結果，我們會得到與針對此長上下文的管道相同的結果 - 太棒了！

✏️ 試一試！ 使用您之前計算的最佳分數來顯示五個最可能的答案（針對整個上下文，而不是每個塊）。要檢查您的結果，請返回到第一個管道，並在呼叫它時傳入top_k=5。

這結束了我們對標記器功能的深入探討。我們將在下一章中再次將所有這些付諸實踐，向您展示如何在一個系列的常見NLP任務上微調模型。

NLP 課程

QA 管道中的快速分詞器

使用問答管道

使用問答模型

處理長上下文