WordPiece 分詞

WordPiece 是 Google 開發用於預訓練 BERT 的分詞演算法。此後，它被許多基於 BERT 的 Transformer 模型重複使用，例如 DistilBERT、MobileBERT、Funnel Transformers 和 MPNET。在訓練方面，它與 BPE 非常相似，但實際的分詞方式有所不同。

💡 本節深入介紹 WordPiece，甚至展示了完整的實現。如果您只想瞭解分詞演算法的概述，可以跳到最後。

訓練演算法

⚠️ Google 從未開源其 WordPiece 訓練演算法的實現，因此以下內容是我們根據已發表的文獻做出的最佳猜測。它可能不是 100% 準確的。

與 BPE 類似，WordPiece 從一個小型詞彙表開始，其中包括模型使用的特殊標記和初始字母表。由於它透過新增字首（例如 BERT 的 ##）來識別子詞，因此最初每個單詞都透過在單詞內的所有字元前新增該字首來拆分。例如，"word" 將被這樣拆分

w ##o ##r ##d

因此，初始字母表包含出現在單詞開頭的所有字元，以及出現在單詞內部且前面有 WordPiece 字首的字元。

然後，與 BPE 一樣，WordPiece 學習合併規則。主要區別在於選擇要合併的配對的方式。WordPiece 不是選擇最頻繁的配對，而是使用以下公式為每個配對計算一個分數 $\mathrm{score} = (\mathrm{freq\_of\_pair}) / (\mathrm{freq\_of\_first\_element} \times \mathrm{freq\_of\_second\_element})$

透過將配對的頻率除以其每個部分頻率的乘積，該演算法優先合併詞彙表中各個部分頻率較低的配對。例如，即使配對 ("un", "##able") 在詞彙表中非常頻繁出現，它也不會 обязательно 合併，因為兩個配對 "un" 和 "##able" 可能分別出現在許多其他單詞中，並且具有較高的頻率。相反，像 ("hu", "##gging") 這樣的配對可能更快地合併（假設單詞“hugging”在詞彙表中經常出現），因為 "hu" 和 "##gging" 的單獨頻率可能較低。

讓我們看看我們在 BPE 訓練示例中使用的相同詞彙表

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

這裡的拆分將是

("h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##g" "##s", 5)

因此，初始詞彙表將是 ["b", "h", "p", "##g", "##n", "##s", "##u"]（如果我們暫時忽略特殊標記）。最頻繁的配對是 ("##u", "##g")（出現 20 次），但 "##u" 的單獨頻率非常高，因此它的分數不是最高（為 1 / 36）。所有包含 "##u" 的配對實際上都具有相同的分數（1 / 36），因此最佳分數屬於配對 ("##g", "##s") - 唯一一個不包含 "##u" 的配對 - 為 1 / 20，學習到的第一個合併是 ("##g", "##s") -> ("##gs")。

請注意，當我們合併時，會刪除兩個標記之間的 ##，因此我們將 "##gs" 新增到詞彙表中並在語料庫的單詞中應用合併

Vocabulary: ["b", "h", "p", "##g", "##n", "##s", "##u", "##gs"]
Corpus: ("h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##gs", 5)

此時，"##u" 位於所有可能的配對中，因此它們最終都具有相同的分數。假設在這種情況下，第一個配對被合併，因此 ("h", "##u") -> "hu"。這將我們帶到

Vocabulary: ["b", "h", "p", "##g", "##n", "##s", "##u", "##gs", "hu"]
Corpus: ("hu" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("hu" "##gs", 5)

然後下一個最佳分數由 ("hu", "##g") 和 ("hu", "##gs") 共享（為 1/15，而其他所有配對為 1/21），因此第一個具有最大分數的配對被合併

Vocabulary: ["b", "h", "p", "##g", "##n", "##s", "##u", "##gs", "hu", "hug"]
Corpus: ("hug", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("hu" "##gs", 5)

我們繼續這樣操作，直到達到所需的詞彙表大小。

✏️ 現在輪到你了！下一個合併規則是什麼？

分詞演算法

WordPiece 和 BPE 在分詞方面有所不同，因為 WordPiece 只儲存最終詞彙表，而不儲存學習到的合併規則。從要分詞的單詞開始，WordPiece 查詢詞彙表中存在的最長的子詞，然後在其上進行拆分。例如，如果我們使用上面示例中學習到的詞彙表，對於單詞 "hugs"，從開頭開始的詞彙表中最長的子詞是 "hug"，因此我們在那裡進行拆分並得到 ["hug", "##s"]。然後我們繼續使用 "##s"，它在詞彙表中，因此 "hugs" 的分詞結果為 ["hug", "##s"]。

使用 BPE，我們將按順序應用學習到的合併並將其分詞為 ["hu", "##gs"]，因此編碼不同。

再舉一個例子，讓我們看看單詞"bugs"是如何被分詞的。"b"是單詞開頭最長的子詞，並且在詞彙表中，因此我們在那裡進行分割，得到["b", "##ugs"]。然後"##u"是"##ugs"開頭最長的子詞，並且在詞彙表中，因此我們在那裡進行分割，得到["b", "##u, "##gs"]。最後，"##gs"在詞彙表中，因此這個最後的列表就是"bugs"的分詞結果。

當分詞過程到達無法在詞彙表中找到子詞的階段時，整個單詞會被標記為未知——例如，"mug"會被分詞為["[UNK]"]，"bum"也是如此（即使我們可以以"b"和"##u"開頭，"##m"不在詞彙表中，最終的分詞結果只會是["[UNK]"]，而不是["b", "##u", "[UNK]"]）。這是與BPE的另一個區別，BPE只會將詞彙表中不存在的單個字元分類為未知。

✏️ 現在輪到你了！單詞"pugs"將如何被分詞？

WordPiece實現

現在讓我們看一下WordPiece演算法的實現。與BPE一樣，這只是一種教學方法，你無法在大型語料庫上使用它。

我們將使用與BPE示例中相同的語料庫。

corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

首先，我們需要將語料庫預先分詞成單詞。由於我們正在複製一個WordPiece分詞器（如BERT），因此我們將使用bert-base-cased分詞器進行預分詞。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

然後，我們在進行預分詞的同時計算語料庫中每個單詞的頻率。

from collections import defaultdict

word_freqs = defaultdict(int)
for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

word_freqs

defaultdict(
    int, {'This': 3, 'is': 2, 'the': 1, 'Hugging': 1, 'Face': 1, 'Course': 1, '.': 4, 'chapter': 1, 'about': 1,
    'tokenization': 1, 'section': 1, 'shows': 1, 'several': 1, 'tokenizer': 1, 'algorithms': 1, 'Hopefully': 1,
    ',': 1, 'you': 1, 'will': 1, 'be': 1, 'able': 1, 'to': 1, 'understand': 1, 'how': 1, 'they': 1, 'are': 1,
    'trained': 1, 'and': 1, 'generate': 1, 'tokens': 1})

正如我們之前看到的，字母表是由所有單詞的首字母和單詞字首為##的其他字母組成的唯一集合。

alphabet = []
for word in word_freqs.keys():
    if word[0] not in alphabet:
        alphabet.append(word[0])
    for letter in word[1:]:
        if f"##{letter}" not in alphabet:
            alphabet.append(f"##{letter}")

alphabet.sort()
alphabet

print(alphabet)

['##a', '##b', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##k', '##l', '##m', '##n', '##o', '##p', '##r', '##s',
 '##t', '##u', '##v', '##w', '##y', '##z', ',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'g', 'h', 'i', 's', 't', 'u',
 'w', 'y']

我們還在該詞彙表開頭添加了模型使用的特殊標記。在BERT的情況下，它是列表["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]

vocab = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] + alphabet.copy()

接下來，我們需要拆分每個單詞，並將所有非首字母字首為##。

splits = {
    word: [c if i == 0 else f"##{c}" for i, c in enumerate(word)]
    for word in word_freqs.keys()
}

現在我們準備好了進行訓練，讓我們編寫一個函式來計算每對單詞的分數。我們需要在訓練的每個步驟中使用它。

def compute_pair_scores(splits):
    letter_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1:
            letter_freqs[split[0]] += freq
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            letter_freqs[split[i]] += freq
            pair_freqs[pair] += freq
        letter_freqs[split[-1]] += freq

    scores = {
        pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
        for pair, freq in pair_freqs.items()
    }
    return scores

讓我們看一下初始拆分後該字典的一部分。

pair_scores = compute_pair_scores(splits)
for i, key in enumerate(pair_scores.keys()):
    print(f"{key}: {pair_scores[key]}")
    if i >= 5:
        break

('T', '##h'): 0.125
('##h', '##i'): 0.03409090909090909
('##i', '##s'): 0.02727272727272727
('i', '##s'): 0.1
('t', '##h'): 0.03571428571428571
('##h', '##e'): 0.011904761904761904

現在，找到得分最高的配對只需要一個快速的迴圈。

best_pair = ""
max_score = None
for pair, score in pair_scores.items():
    if max_score is None or max_score < score:
        best_pair = pair
        max_score = score

print(best_pair, max_score)

('a', '##b') 0.2

因此，第一個要學習的合併是('a', '##b') -> 'ab'，我們將'ab'新增到詞彙表中。

vocab.append("ab")

要繼續，我們需要在我們的splits字典中應用該合併。讓我們為此編寫另一個函式。

def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue
        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                merge = a + b[2:] if b.startswith("##") else a + b
                split = split[:i] + [merge] + split[i + 2 :]
            else:
                i += 1
        splits[word] = split
    return splits

我們可以看一下第一次合併的結果。

splits = merge_pair("a", "##b", splits)
splits["about"]

['ab', '##o', '##u', '##t']

現在我們擁有了迴圈直到學習完我們想要的所有合併所需的一切。讓我們目標詞彙量大小為70。

vocab_size = 70
while len(vocab) < vocab_size:
    scores = compute_pair_scores(splits)
    best_pair, max_score = "", None
    for pair, score in scores.items():
        if max_score is None or max_score < score:
            best_pair = pair
            max_score = score
    splits = merge_pair(*best_pair, splits)
    new_token = (
        best_pair[0] + best_pair[1][2:]
        if best_pair[1].startswith("##")
        else best_pair[0] + best_pair[1]
    )
    vocab.append(new_token)

然後我們可以檢視生成的詞彙表。

print(vocab)

['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]', '##a', '##b', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##k',
 '##l', '##m', '##n', '##o', '##p', '##r', '##s', '##t', '##u', '##v', '##w', '##y', '##z', ',', '.', 'C', 'F', 'H',
 'T', 'a', 'b', 'c', 'g', 'h', 'i', 's', 't', 'u', 'w', 'y', 'ab', '##fu', 'Fa', 'Fac', '##ct', '##ful', '##full', '##fully',
 'Th', 'ch', '##hm', 'cha', 'chap', 'chapt', '##thm', 'Hu', 'Hug', 'Hugg', 'sh', 'th', 'is', '##thms', '##za', '##zat',
 '##ut']

我們可以看到，與BPE相比，這個分詞器學習單詞的部分作為標記的速度更快。

💡 在相同語料庫上使用train_new_from_iterator()不會產生完全相同的詞彙表。這是因為🤗 Tokenizers庫沒有實現WordPiece進行訓練（因為我們不完全確定其內部機制），而是使用BPE代替。

要分詞新文字，我們對其進行預分詞、拆分，然後對每個單詞應用分詞演算法。也就是說，我們查詢第一個單詞開頭最大的子詞並進行拆分，然後我們對第二部分重複該過程，依此類推，直到該單詞的其餘部分和文字中的後續單詞。

def encode_word(word):
    tokens = []
    while len(word) > 0:
        i = len(word)
        while i > 0 and word[:i] not in vocab:
            i -= 1
        if i == 0:
            return ["[UNK]"]
        tokens.append(word[:i])
        word = word[i:]
        if len(word) > 0:
            word = f"##{word}"
    return tokens

讓我們在一個詞彙表中的單詞和另一個不在詞彙表中的單詞上進行測試。

print(encode_word("Hugging"))
print(encode_word("HOgging"))

['Hugg', '##i', '##n', '##g']
['[UNK]']

現在，讓我們編寫一個函式來分詞文字。

def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    encoded_words = [encode_word(word) for word in pre_tokenized_text]
    return sum(encoded_words, [])

我們可以在任何文字上嘗試它。

tokenize("This is the Hugging Face course!")

['Th', '##i', '##s', 'is', 'th', '##e', 'Hugg', '##i', '##n', '##g', 'Fac', '##e', 'c', '##o', '##u', '##r', '##s',
 '##e', '[UNK]']

WordPiece演算法就是這樣！現在讓我們看一下Unigram。

NLP 課程

WordPiece 分詞

訓練演算法

分詞演算法

WordPiece實現