LLM 課程文件

整合所有部分

Hugging Face's logo
加入 Hugging Face 社群

並獲得增強的文件體驗

開始使用

融會貫通

Ask a Question Open In Colab Open In Studio Lab

在過去的幾節中,我們一直努力手動完成大部分工作。我們探索了分詞器的工作原理,並研究了分詞、轉換為輸入 ID、填充、截斷和注意力掩碼。

然而,正如我們在第 2 節中看到的那樣,🤗 Transformers API 可以透過一個高階函式為我們處理所有這些問題,我們將在本節中深入探討。當您直接在句子上呼叫 tokenizer 時,您會得到可以直接傳遞給模型的輸入

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

這裡,`model_inputs` 變數包含模型正常執行所需的一切。對於 DistilBERT,這包括輸入 ID 和注意力掩碼。接受額外輸入的其他模型也將由 `tokenizer` 物件輸出這些內容。

正如我們將在下面的一些示例中看到的那樣,這種方法非常強大。首先,它可以對單個序列進行分詞

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

它還可以一次處理多個序列,API 沒有變化

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

它可以根據幾個目標進行填充

# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

它也可以截斷序列

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

`tokenizer` 物件可以處理轉換為特定框架張量,然後可以直接傳送到模型。例如,在以下程式碼示例中,我們提示分詞器返回不同框架的張量 — `"pt"` 返回 PyTorch 張量,`"np"` 返回 NumPy 陣列

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

特殊標記

如果我們檢視分詞器返回的輸入 ID,我們會發現它們與我們之前得到的略有不同

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

開頭添加了一個標記 ID,末尾添加了一個標記 ID。讓我們解碼上面兩個 ID 序列,看看這是怎麼回事

print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))
"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"
"i've been waiting for a huggingface course my whole life."

分詞器在開頭添加了特殊詞 `[CLS]`,在末尾添加了特殊詞 `[SEP]`。這是因為模型是用這些詞進行預訓練的,因此為了獲得相同的推理結果,我們也需要新增它們。請注意,有些模型不新增特殊詞,或新增不同的詞;模型也可能只在開頭或只在末尾新增這些特殊詞。無論如何,分詞器知道預期哪些特殊詞,並將為您處理。

總結:從分詞器到模型

現在我們已經瞭解了 tokenizer 物件應用於文字時使用的所有單個步驟,讓我們最後一次看看它如何使用其主要 API 處理多個序列(填充!)、非常長的序列(截斷!)以及多種型別的張量

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
< > 在 GitHub 上更新

© . This site is unofficial and not affiliated with Hugging Face, Inc.