WordPiece 分詞

WordPiece 是 Google 為預訓練 BERT 而開發的分詞演算法。此後，它被 BERT 衍生的許多 Transformer 模型重用，例如 DistilBERT、MobileBERT、Funnel Transformers 和 MPNET。它在訓練方面與 BPE 非常相似，但實際的分詞方式不同。

💡 本節深入探討 WordPiece，甚至展示了完整的實現。如果您只想大致瞭解分詞演算法，可以跳到末尾。

訓練演算法

⚠️ Google 從未開源其 WordPiece 訓練演算法的實現，因此以下內容是我們根據已釋出的文獻做出的最佳猜測。它可能不是 100% 準確。

與 BPE 類似，WordPiece 從一個包含模型使用的特殊標記和初始字母表的小詞彙表開始。由於它透過新增字首（例如 BERT 的 ##）來識別子詞，因此每個詞最初都是透過向詞內所有字元新增該字首來拆分的。例如，"word" 會這樣拆分

w ##o ##r ##d

因此，初始字母表包含詞開頭的所有字元以及詞內前面帶 WordPiece 字首的字元。

然後，再次與 BPE 類似，WordPiece 學習合併規則。主要區別在於選擇要合併的對的方式。WordPiece 不選擇最頻繁的對，而是使用以下公式計算每對的分數 $\mathrm{score} = (\mathrm{freq\_of\_pair}) / (\mathrm{freq\_of\_first\_element} \times \mathrm{freq\_of\_second\_element})$

透過將對的頻率除以其每個部分的頻率乘積，該演算法優先合併詞彙表中各個部分出現頻率較低的對。例如，它不一定會合並 ("un", "##able")，即使該對在詞彙表中出現頻率很高，因為 "un" 和 "##able" 這兩個對很可能會出現在許多其他詞中並具有高頻率。相反，像 ("hu", "##gging") 這樣的對可能會更快合併（假設單詞“hugging”在詞彙表中經常出現），因為 "hu" 和 "##gging" 單獨出現的頻率可能較低。

讓我們看看 BPE 訓練示例中使用的相同詞彙表

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

這裡的拆分將是

("h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##g" "##s", 5)

因此初始詞彙表將是 ["b", "h", "p", "##g", "##n", "##s", "##u"]（如果我們暫時不考慮特殊標記）。最頻繁的對是 ("##u", "##g")（出現 20 次），但 "##u" 的單個頻率非常高，因此其分數不是最高的（為 1 / 36）。所有帶有 "##u" 的對實際上都具有相同的分數（1 / 36），因此最好的分數屬於 ("##g", "##s") 對——唯一一個沒有 "##u" 的對——為 1 / 20，因此學到的第一個合併是 ("##g", "##s") -> ("##gs")。

請注意，當我們合併時，我們會刪除兩個標記之間的 ##，因此我們將 "##gs" 新增到詞彙表中，並在語料庫的單詞中應用合併

Vocabulary: ["b", "h", "p", "##g", "##n", "##s", "##u", "##gs"]
Corpus: ("h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##gs", 5)

此時，"##u" 存在於所有可能的對中，因此它們都最終獲得相同的分數。假設在這種情況下，第一個對被合併，因此 ("h", "##u") -> "hu"。這將我們帶到

Vocabulary: ["b", "h", "p", "##g", "##n", "##s", "##u", "##gs", "hu"]
Corpus: ("hu" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("hu" "##gs", 5)

然後，下一個最佳分數由 ("hu", "##g") 和 ("hu", "##gs") 共享（均為 1/15，而所有其他對為 1/21），因此具有最大分數的第一個對被合併

Vocabulary: ["b", "h", "p", "##g", "##n", "##s", "##u", "##gs", "hu", "hug"]
Corpus: ("hug", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("hu" "##gs", 5)

然後我們繼續這樣操作，直到達到所需的詞彙量大小。

✏️ 現在輪到你了！ 下一個合併規則是什麼？

分詞演算法

WordPiece 和 BPE 的分詞方式不同，WordPiece 只儲存最終詞彙表，而不儲存學到的合併規則。從要分詞的單詞開始，WordPiece 查詢詞彙表中包含的最長子詞，然後根據它進行分割。例如，如果我們在上面的示例中使用學到的詞彙表，對於單詞 "hugs"，從開頭開始的最長子詞是 "hug"，因此我們在那裡進行分割並得到 ["hug", "##s"]。然後我們繼續處理 "##s"，它在詞彙表中，因此 "hugs" 的分詞是 ["hug", "##s"]。

使用 BPE，我們會按順序應用學到的合併規則，並將其分詞為 ["hu", "##gs"]，因此編碼是不同的。

作為另一個例子，讓我們看看單詞 "bugs" 將如何分詞。"b" 是從單詞開頭開始在詞彙表中最長的子詞，所以我們在這裡分割並得到 ["b", "##ugs"]。然後 "##u" 是從 "##ugs" 開頭開始在詞彙表中最長的子詞，所以我們在這裡分割並得到 ["b", "##u, "##gs"]。最後，"##gs" 在詞彙表中，所以這個最後的列表就是 "bugs" 的分詞。

當分詞達到無法在詞彙表中找到子詞的階段時，整個單詞被標記為未知——例如，"mug" 將被標記為 ["[UNK]"]，"bum" 也是（即使我們可以從 "b" 和 "##u" 開始，"##m" 不在詞彙表中，最終的分詞將只是 ["[UNK]"]，而不是 ["b", "##u", "[UNK]"]）。這是與 BPE 的另一個區別，BPE 只會將詞彙表中不存在的單個字元歸類為未知。

✏️ 現在輪到你了！ 單詞 "pugs" 將如何分詞？

實現 WordPiece

現在讓我們看看 WordPiece 演算法的實現。與 BPE 一樣，這只是教學目的，您無法在大型語料庫上使用它。

我們將使用與 BPE 示例相同的語料庫

corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

首先，我們需要將語料庫預分詞為單詞。由於我們正在複製 WordPiece 分詞器（如 BERT），我們將使用 bert-base-cased 分詞器進行預分詞

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

然後我們在預分詞時計算語料庫中每個單詞的頻率

from collections import defaultdict

word_freqs = defaultdict(int)
for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

word_freqs

defaultdict(
    int, {'This': 3, 'is': 2, 'the': 1, 'Hugging': 1, 'Face': 1, 'Course': 1, '.': 4, 'chapter': 1, 'about': 1,
    'tokenization': 1, 'section': 1, 'shows': 1, 'several': 1, 'tokenizer': 1, 'algorithms': 1, 'Hopefully': 1,
    ',': 1, 'you': 1, 'will': 1, 'be': 1, 'able': 1, 'to': 1, 'understand': 1, 'how': 1, 'they': 1, 'are': 1,
    'trained': 1, 'and': 1, 'generate': 1, 'tokens': 1})

正如我們之前看到的，字母表是由所有單詞的首字母和所有其他帶有 ## 字首的字母組成的唯一集合

alphabet = []
for word in word_freqs.keys():
    if word[0] not in alphabet:
        alphabet.append(word[0])
    for letter in word[1:]:
        if f"##{letter}" not in alphabet:
            alphabet.append(f"##{letter}")

alphabet.sort()
alphabet

print(alphabet)

['##a', '##b', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##k', '##l', '##m', '##n', '##o', '##p', '##r', '##s',
 '##t', '##u', '##v', '##w', '##y', '##z', ',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'g', 'h', 'i', 's', 't', 'u',
 'w', 'y']

我們還在詞彙表的開頭添加了模型使用的特殊標記。對於 BERT，它是列表 ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]

vocab = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] + alphabet.copy()

接下來我們需要拆分每個單詞，所有非首字母的字母都帶有 ## 字首

splits = {
    word: [c if i == 0 else f"##{c}" for i, c in enumerate(word)]
    for word in word_freqs.keys()
}

現在我們已準備好進行訓練，讓我們編寫一個函式來計算每對的分數。我們將在訓練的每個步驟中都需要使用它

def compute_pair_scores(splits):
    letter_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1:
            letter_freqs[split[0]] += freq
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            letter_freqs[split[i]] += freq
            pair_freqs[pair] += freq
        letter_freqs[split[-1]] += freq

    scores = {
        pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
        for pair, freq in pair_freqs.items()
    }
    return scores

讓我們看看初始拆分後這個字典的一部分

pair_scores = compute_pair_scores(splits)
for i, key in enumerate(pair_scores.keys()):
    print(f"{key}: {pair_scores[key]}")
    if i >= 5:
        break

('T', '##h'): 0.125
('##h', '##i'): 0.03409090909090909
('##i', '##s'): 0.02727272727272727
('i', '##s'): 0.1
('t', '##h'): 0.03571428571428571
('##h', '##e'): 0.011904761904761904

現在，找到得分最高的對只需要一個快速迴圈

best_pair = ""
max_score = None
for pair, score in pair_scores.items():
    if max_score is None or max_score < score:
        best_pair = pair
        max_score = score

print(best_pair, max_score)

('a', '##b') 0.2

因此，第一個要學習的合併是 ('a', '##b') -> 'ab'，我們將 'ab' 新增到詞彙表中

vocab.append("ab")

為了繼續，我們需要在 splits 字典中應用該合併。讓我們編寫另一個函式來完成此操作

def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue
        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                merge = a + b[2:] if b.startswith("##") else a + b
                split = split[:i] + [merge] + split[i + 2 :]
            else:
                i += 1
        splits[word] = split
    return splits

我們可以看看第一次合併的結果

splits = merge_pair("a", "##b", splits)
splits["about"]

['ab', '##o', '##u', '##t']

現在我們擁有了迴圈所需的一切，直到我們學完所有想要的合併。我們的目標是將詞彙表大小設定為 70

vocab_size = 70
while len(vocab) < vocab_size:
    scores = compute_pair_scores(splits)
    best_pair, max_score = "", None
    for pair, score in scores.items():
        if max_score is None or max_score < score:
            best_pair = pair
            max_score = score
    splits = merge_pair(*best_pair, splits)
    new_token = (
        best_pair[0] + best_pair[1][2:]
        if best_pair[1].startswith("##")
        else best_pair[0] + best_pair[1]
    )
    vocab.append(new_token)

然後我們可以檢視生成的詞彙表

print(vocab)

['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]', '##a', '##b', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##k',
 '##l', '##m', '##n', '##o', '##p', '##r', '##s', '##t', '##u', '##v', '##w', '##y', '##z', ',', '.', 'C', 'F', 'H',
 'T', 'a', 'b', 'c', 'g', 'h', 'i', 's', 't', 'u', 'w', 'y', 'ab', '##fu', 'Fa', 'Fac', '##ct', '##ful', '##full', '##fully',
 'Th', 'ch', '##hm', 'cha', 'chap', 'chapt', '##thm', 'Hu', 'Hug', 'Hugg', 'sh', 'th', 'is', '##thms', '##za', '##zat',
 '##ut']

正如我們所看到的，與 BPE 相比，這個分詞器更快地學習單詞的片段作為標記。

💡 在同一語料庫上使用 train_new_from_iterator() 不會得到完全相同的詞彙表。這是因為 🤗 Tokenizers 庫沒有實現 WordPiece 進行訓練（因為我們對其內部機制不完全確定），而是使用了 BPE。

要對新文字進行分詞，我們首先預分詞，然後拆分，然後對每個單詞應用分詞演算法。也就是說，我們尋找第一個單詞開頭最大的子詞並將其拆分，然後在第二部分重複該過程，依此類推，直到處理完該單詞和文字中的後續單詞

def encode_word(word):
    tokens = []
    while len(word) > 0:
        i = len(word)
        while i > 0 and word[:i] not in vocab:
            i -= 1
        if i == 0:
            return ["[UNK]"]
        tokens.append(word[:i])
        word = word[i:]
        if len(word) > 0:
            word = f"##{word}"
    return tokens

讓我們在一個詞彙表中存在的詞和另一個不存在的詞上進行測試

print(encode_word("Hugging"))
print(encode_word("HOgging"))

['Hugg', '##i', '##n', '##g']
['[UNK]']

現在，讓我們編寫一個函式來分詞文字

def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    encoded_words = [encode_word(word) for word in pre_tokenized_text]
    return sum(encoded_words, [])

我們可以在任何文字上嘗試它

tokenize("This is the Hugging Face course!")

['Th', '##i', '##s', 'is', 'th', '##e', 'Hugg', '##i', '##n', '##g', 'Fac', '##e', 'c', '##o', '##u', '##r', '##s',
 '##e', '[UNK]']

WordPiece 演算法到此結束！現在讓我們看看 Unigram。

< > 在 GitHub 上更新

LLM 課程

WordPiece 分詞

訓練演算法

分詞演算法

實現 WordPiece