整合所有內容

在過去幾個部分中，我們一直在盡力手動完成大部分工作。我們探索了分詞器的工作原理，並瞭解了分詞、轉換為輸入 ID、填充、截斷和注意力掩碼。

但是，正如我們在第 2 節中看到的，🤗 Transformers API 可以使用我們將在此處深入探討的高階函式為我們處理所有這些操作。當您直接在句子上呼叫 tokenizer 時，您將獲得可以傳遞給模型的輸入。

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

在這裡，model_inputs 變數包含模型正常執行所需的所有內容。對於 DistilBERT，這包括輸入 ID 以及注意力掩碼。接受其他輸入的其他模型也將透過 tokenizer 物件輸出這些輸入。

正如我們在下面的示例中將看到的，這種方法非常強大。首先，它可以對單個序列進行分詞

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

它還可以處理多個序列，而 API 不變

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

它可以根據多個目標進行填充

# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

它還可以截斷序列

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

tokenizer 物件可以處理轉換為特定框架張量，然後可以直接將其傳送到模型。例如，在下面的程式碼示例中，我們要求分詞器返回來自不同框架的張量——"pt" 返回 PyTorch 張量，"tf" 返回 TensorFlow 張量，"np" 返回 NumPy 陣列

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

特殊標記

如果我們看一下分詞器返回的輸入 ID，我們會發現它們與我們之前看到的略有不同

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

在開頭添加了一個標記 ID，在結尾添加了一個標記 ID。讓我們解碼上面的兩個 ID 序列，看看這是怎麼回事

print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"
"i've been waiting for a huggingface course my whole life."

分詞器在開頭添加了特殊詞 [CLS]，在結尾添加了特殊詞 [SEP]。這是因為模型是在這些詞的訓練的，因此為了獲得相同的推理結果，我們需要同樣新增它們。請注意，一些模型不會新增特殊詞，或新增不同的特殊詞；模型也可能僅在開頭或結尾新增這些特殊詞。在任何情況下，分詞器都知道哪些是預期的，並將為您處理這些問題。

總結：從分詞器到模型

既然我們已經瞭解了 tokenizer 物件在應用於文字時使用的所有單獨步驟，讓我們最後再看一次它如何使用其主要 API 處理多個序列（填充！）、非常長的序列（截斷！）和多種型別的張量

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

NLP 課程

整合所有內容

特殊標記

總結：從分詞器到模型