問答管道中的快速分詞器

現在我們將深入研究`問答`管道，看看如何利用偏移量從上下文中獲取問題的答案，這有點像我們在上一節中對分組實體所做的那樣。然後我們將看到如何處理最終被截斷的超長上下文。如果您對問答任務不感興趣，可以跳過本節。

使用問答管道

正如我們在第一章中看到的，我們可以像這樣使用`問答`管道來獲取問題的答案

from transformers import pipeline

question_answerer = pipeline("question-answering")
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)

{'score': 0.97773,
 'start': 78,
 'end': 105,
 'answer': 'Jax, PyTorch and TensorFlow'}

與其他無法截斷和分割超出模型接受的最大長度的文字（因此可能會遺漏文件末尾的資訊）的管道不同，此管道可以處理超長上下文，即使答案在文件末尾也能返回問題的答案。

long_context = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question_answerer(question=question, context=long_context)

{'score': 0.97149,
 'start': 1892,
 'end': 1919,
 'answer': 'Jax, PyTorch and TensorFlow'}

讓我們看看它是如何做到這一切的！

使用模型進行問答

與其他管道一樣，我們首先對輸入進行分詞，然後將其送入模型。`問答`管道預設使用的檢查點是`distilbert-base-cased-distilled-squad`（名稱中的“squad”來自模型微調時使用的資料集；我們將在第七章中詳細討論 SQuAD 資料集）。

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

請注意，我們成對地對問題和上下文進行分詞，問題在前。

An example of tokenization of question and context

用於問答的模型與我們目前看到的模型略有不同。以上圖為例，模型已經過訓練，可以預測答案的起始標記索引（此處為 21）和答案結束的標記索引（此處為 24）。這就是為什麼這些模型不返回一個 logits 張量，而是兩個：一個用於與答案起始標記對應的 logits，另一個用於與答案結束標記對應的 logits。由於在這種情況下，我們只有一個包含 66 個標記的輸入，所以我們得到

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

torch.Size([1, 66]) torch.Size([1, 66])

為了將這些 logits 轉換為機率，我們將應用 softmax 函式——但在此之前，我們需要確保遮蔽不屬於上下文的索引。我們的輸入是 `[CLS] question [SEP] context [SEP]`，因此我們需要遮蔽問題標記以及 `[SEP]` 標記。但是，我們將保留 `[CLS]` 標記，因為有些模型使用它來指示答案不在上下文中。

由於我們之後會應用 softmax，我們只需要將我們想要遮蔽的 logits 替換為一個很大的負數。這裡，我們使用 `—10000`

import torch

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000

現在我們已經正確地遮蔽了我們不想預測的位置所對應的 logits，我們可以應用 softmax

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

在此階段，我們可以取起始和結束機率的 argmax——但我們可能會得到一個起始索引大於結束索引的結果，因此我們需要採取一些額外的預防措施。我們將計算所有可能的 `start_index` 和 `end_index` 的機率，其中 `start_index <= end_index`，然後選擇具有最高機率的元組 `(start_index, end_index)`。

假設事件“答案從start_index開始”和“答案在end_index結束”是獨立的，那麼答案從start_index開始並在end_index結束的機率為 $\mathrm{start\_probabilities}[\mathrm{start\_index}] \times \mathrm{end\_probabilities}[\mathrm{end\_index}]$

因此，要計算所有分數，我們只需計算所有乘積 $\mathrm{start\_probabilities}[\mathrm{start\_index}] \times \mathrm{end\_probabilities}[\mathrm{end\_index}]$ 其中 start_index <= end_index。

首先我們計算所有可能的乘積

scores = start_probabilities[:, None] * end_probabilities[None, :]

然後我們將 `start_index > end_index` 的值遮蔽掉，將其設定為 `0`（其他機率都是正數）。`torch.triu()` 函式返回作為引數傳遞的 2D 張量的上三角部分，因此它將為我們完成此遮蔽操作

scores = torch.triu(scores)

現在我們只需獲取最大值的索引。由於 PyTorch 將返回扁平張量中的索引，我們需要使用整除 `//` 和模數 `%` 運算來獲取 `start_index` 和 `end_index`

max_index = scores.argmax().item()
start_index = max_index // scores.shape[1]
end_index = max_index % scores.shape[1]
print(scores[start_index, end_index])

我們還沒完全完成，但至少我們已經得到了答案的正確分數（您可以將其與上一節的第一個結果進行比較以進行驗證）

0.97773

✏️ 嘗試一下！ 計算五個最有可能答案的起始和結束索引。

我們得到了答案的 `start_index` 和 `end_index`（以 token 為單位），所以現在我們只需要將其轉換為上下文中的字元索引。這就是偏移量非常有用的地方。我們可以獲取它們並像在 token 分類任務中那樣使用它們

inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]

現在我們只需格式化所有內容即可得到結果

result = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index],
}
print(result)

{'answer': 'Jax, PyTorch and TensorFlow',
 'start': 78,
 'end': 105,
 'score': 0.97773}

太棒了！這和我們的第一個例子一樣！

✏️ 嘗試一下！ 使用您之前計算的最佳分數來顯示五個最可能的答案。要檢查您的結果，請回到第一個管道並在呼叫它時傳入 `top_k=5`。

處理長上下文

如果我們嘗試對之前用作示例的問題和長上下文進行分詞，我們將得到比問答管道中使用的最大長度（384）更高的 token 數量

inputs = tokenizer(question, long_context)
print(len(inputs["input_ids"]))

因此，我們需要將輸入截斷到那個最大長度。有幾種方法可以做到這一點，但我們不想截斷問題，只截斷上下文。由於上下文是第二個句子，我們將使用`"only_second"`截斷策略。隨之而來的問題是，問題的答案可能不在截斷的上下文中。例如，我們選擇了一個答案在上下文末尾的問題，當我們截斷它時，答案就不存在了。

inputs = tokenizer(question, long_context, max_length=384, truncation="only_second")
print(tokenizer.decode(inputs["input_ids"]))

"""
[CLS] Which deep learning libraries back [UNK] Transformers? [SEP] [UNK] Transformers : State of the Art NLP

[UNK] Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

[UNK] Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internal [SEP]
"""

這意味著模型將很難選擇正確的答案。為了解決這個問題，`問答`管道允許我們將上下文分割成更小的塊，並指定最大長度。為了確保我們不會在錯誤的地方分割上下文導致找不到答案，它還包括塊之間的一些重疊。

我們可以讓分詞器（快速或慢速）透過新增 `return_overflowing_tokens=True` 來為我們完成此操作，並且我們可以使用 `stride` 引數指定我們想要的重疊。下面是一個使用較短句子的示例

sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

'[CLS] This sentence is not [SEP]'
'[CLS] is not too long [SEP]'
'[CLS] too long but we [SEP]'
'[CLS] but we are going [SEP]'
'[CLS] are going to split [SEP]'
'[CLS] to split it anyway [SEP]'
'[CLS] it anyway. [SEP]'

正如我們所看到的，句子被分割成塊，使得 `inputs["input_ids"]` 中的每個條目最多包含 6 個標記（我們需要新增填充以使最後一個條目與其他條目大小相同），並且每個條目之間有 2 個標記的重疊。

讓我們仔細看看分詞結果

print(inputs.keys())

dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])

正如預期的那樣，我們得到了輸入 ID 和注意力掩碼。最後一個鍵 `overflow_to_sample_mapping` 是一個對映，它告訴我們每個結果對應哪個句子——這裡我們有 7 個結果都來自我們傳遞給分詞器的（唯一）句子

print(inputs["overflow_to_sample_mapping"])

[0, 0, 0, 0, 0, 0, 0]

當我們同時對多個句子進行分詞時，這會更有用。例如，這個

sentences = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
]
inputs = tokenizer(
    sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

print(inputs["overflow_to_sample_mapping"])

我們得到

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

這意味著第一個句子被分成 7 個塊，如前所述，接下來的 4 個塊來自第二個句子。

現在讓我們回到我們的長上下文。預設情況下，`問答`管道使用最大長度 384（如我們之前提到的）和步長 128，這與模型微調的方式相對應（您可以透過在呼叫管道時傳遞 `max_seq_len` 和 `stride` 引數來調整這些引數）。因此，我們在分詞時將使用這些引數。我們還將新增填充（以使樣本長度相同，以便我們可以構建張量）並請求偏移量。

inputs = tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding="longest",
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

這些`inputs`將包含模型期望的輸入ID和注意力掩碼，以及我們剛剛討論的偏移量和`overflow_to_sample_mapping`。由於這兩個不是模型使用的引數，我們會在將其轉換為張量之前將它們從`inputs`中彈出（我們不會儲存對映，因為它在這裡沒用）

_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs = inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape)

torch.Size([2, 384])

我們的長上下文被分成兩部分，這意味著在它透過我們的模型後，我們將有兩組起始和結束 logits

outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

torch.Size([2, 384]) torch.Size([2, 384])

和之前一樣，我們首先在計算 softmax 之前遮蔽掉不屬於上下文的標記。我們還會遮蔽所有填充標記（由注意力掩碼標記）。

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
# Mask all the [PAD] tokens
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))

start_logits[mask] = -10000
end_logits[mask] = -10000

然後我們可以使用 softmax 將我們的 logits 轉換為機率

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)

下一步與我們對短上下文所做的類似，但我們對我們的兩個塊中的每一個重複此操作。我們為所有可能的答案跨度分配分數，然後選擇分數最高的跨度。

candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = torch.triu(scores).argmax().item()

    start_idx = idx // scores.shape[1]
    end_idx = idx % scores.shape[1]
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)

[(0, 18, 0.33867), (173, 184, 0.97149)]

這兩個候選答案對應於模型在每個塊中能夠找到的最佳答案。模型對正確答案位於第二部分更有信心（這是一個好兆頭！）。現在我們只需要將這兩個 token 跨度對映到上下文中的字元跨度（我們只需要對映第二個來獲取我們的答案，但看看模型在第一個塊中選擇了什麼也很有趣）。

✏️ 嘗試一下！ 修改上面的程式碼以返回五個最可能答案的分數和跨度（總共，而不是每個塊）。

我們之前獲得的`offsets`實際上是一個偏移量列表，每個文字塊對應一個列表。

for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char]
    result = {"answer": answer, "start": start_char, "end": end_char, "score": score}
    print(result)

{'answer': '\n🤗 Transformers: State of the Art NLP', 'start': 0, 'end': 37, 'score': 0.33867}
{'answer': 'Jax, PyTorch and TensorFlow', 'start': 1892, 'end': 1919, 'score': 0.97149}

如果我們忽略第一個結果，我們得到的結果與我們管道中這個長上下文的結果相同——太棒了！

✏️ 嘗試一下！ 使用您之前計算的最佳分數來顯示五個最可能的答案（針對整個上下文，而不是每個塊）。要檢查您的結果，請回到第一個管道並在呼叫它時傳入 `top_k=5`。

至此，我們對分詞器功能的深入探討就結束了。我們將在下一章中再次將所有這些付諸實踐，屆時我們將向您展示如何針對一系列常見的 NLP 任務微調模型。

< > 在 GitHub 上更新

LLM 課程

問答管道中的快速分詞器

使用問答管道

使用模型進行問答

處理長上下文