融會貫通

在過去的幾節中，我們一直努力手動完成大部分工作。我們探索了分詞器的工作原理，並研究了分詞、轉換為輸入 ID、填充、截斷和注意力掩碼。

然而，正如我們在第 2 節中看到的那樣，🤗 Transformers API 可以透過一個高階函式為我們處理所有這些問題，我們將在本節中深入探討。當您直接在句子上呼叫 tokenizer 時，您會得到可以直接傳遞給模型的輸入

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

這裡，`model_inputs` 變數包含模型正常執行所需的一切。對於 DistilBERT，這包括輸入 ID 和注意力掩碼。接受額外輸入的其他模型也將由 `tokenizer` 物件輸出這些內容。

正如我們將在下面的一些示例中看到的那樣，這種方法非常強大。首先，它可以對單個序列進行分詞

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

它還可以一次處理多個序列，API 沒有變化

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

它可以根據幾個目標進行填充

# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

它也可以截斷序列

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

`tokenizer` 物件可以處理轉換為特定框架張量，然後可以直接傳送到模型。例如，在以下程式碼示例中，我們提示分詞器返回不同框架的張量 — `"pt"` 返回 PyTorch 張量，`"np"` 返回 NumPy 陣列

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

特殊標記

如果我們檢視分詞器返回的輸入 ID，我們會發現它們與我們之前得到的略有不同

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

開頭添加了一個標記 ID，末尾添加了一個標記 ID。讓我們解碼上面兩個 ID 序列，看看這是怎麼回事

print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"
"i've been waiting for a huggingface course my whole life."

分詞器在開頭添加了特殊詞 `[CLS]`，在末尾添加了特殊詞 `[SEP]`。這是因為模型是用這些詞進行預訓練的，因此為了獲得相同的推理結果，我們也需要新增它們。請注意，有些模型不新增特殊詞，或新增不同的詞；模型也可能只在開頭或只在末尾新增這些特殊詞。無論如何，分詞器知道預期哪些特殊詞，並將為您處理。

總結：從分詞器到模型

現在我們已經瞭解了 tokenizer 物件應用於文字時使用的所有單個步驟，讓我們最後一次看看它如何使用其主要 API 處理多個序列（填充！）、非常長的序列（截斷！）以及多種型別的張量

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

< > 在 GitHub 上更新

LLM 課程

融會貫通

特殊標記

總結：從分詞器到模型