如何生成文字：使用不同的解碼方法生成Transformers的文字

釋出於2020年3月1日

在 GitHub 上更新

贊

231

Patrick von Platen

patrickvonplaten

注意：2023年7月編輯，包含最新參考文獻和示例。

引言

近年來，隨著在數百萬網頁上訓練的基於Transformer的大型語言模型的興起，例如OpenAI的ChatGPT和Meta的LLaMA，開放式語言生成引起了越來越多的關注。條件開放式語言生成的結果令人印象深刻，它們被證明能夠泛化到新任務，處理程式碼，或者接受非文字資料作為輸入。除了改進的Transformer架構和海量無監督訓練資料，更好的解碼方法也發揮了重要作用。

這篇部落格文章簡要概述了不同的解碼策略，更重要的是展示了您如何使用流行的transformers庫輕鬆實現它們！

以下所有功能都可用於自迴歸語言生成（此處是回顧）。簡而言之，自迴歸語言生成基於以下假設：詞序列的機率分佈可以分解為條件下一個詞分佈的乘積

$P(w_{1:T} | W_0 ) = \prod_{t=1}^T P(w_{t} | w_{1: t-1}, W_0) \text{ ,with } w_{1: 0} = \emptyset,$

其中 $W_0$ 是初始的上下文詞序列。詞序列的長度 $T$ 通常是動態確定的，並且對應於從 $P(w_{t} | w_{1: t-1}, W_{0})$ 生成EOS token的時間步 $t=T$ 。

我們將介紹當前最主要的解碼方法：貪婪搜尋、束搜尋和取樣。

讓我們快速安裝transformers並載入模型。我們將使用PyTorch中的GPT2進行演示，但該API與TensorFlow和JAX完全相同。

!pip install -q transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

torch_device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to(torch_device)

貪婪搜尋

貪婪搜尋是最簡單的解碼方法。它在每個時間步 $t$ 選擇機率最高的詞作為下一個詞： $w_t = argmax_{w}P(w | w_{1:t-1})$ 。下圖展示了貪婪搜尋。

從詞 $\text{"The"},$ 開始，演算法貪婪地選擇機率最高的下一個詞 $\text{"nice"}$ ，以此類推，最終生成的詞序列是 $(\text{"The"}, \text{"nice"}, \text{"woman"})$ ，總機率為 $0.5 \times 0.4 = 0.2$ 。

接下來我們將使用GPT2在上下文 $(\text{"I"}, \text{"enjoy"}, \text{"walking"}, \text{"with"}, \text{"my"}, \text{"cute"}, \text{"dog"})$ 上生成詞序列。讓我們看看如何在transformers中使用貪婪搜尋

# encode context the generation is conditioned on
model_inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt').to(torch_device)

# generate 40 new tokens
greedy_output = model.generate(**model_inputs, max_new_tokens=40)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure

好的！我們已經用GPT2生成了第一個短文字😊。上下文後的生成詞是合理的，但模型很快就開始重複自己！這是語言生成中一個非常常見的問題，在貪婪搜尋和束搜尋中似乎更是如此——請參閱Vijayakumar et al., 2016和Shao et al., 2017。

然而，貪婪搜尋的主要缺點是它會錯過隱藏在低機率詞後面的高機率詞，如我們上面的草圖所示

詞 $\text{"has"}$ 的條件機率高達 $0.9$ ，卻隱藏在條件機率僅次於最高機率的詞 $\text{"dog"}$ 後面，因此貪婪搜尋錯過了詞序列 $\text{"The"}, \text{"dog"}, \text{"has"}$ 。

幸運的是，我們有束搜尋來緩解這個問題！

束搜尋

束搜尋透過在每個時間步保留最有可能的num_beams個假設，並最終選擇整體機率最高的假設，從而降低了錯過隱藏高機率詞序列的風險。讓我們以num_beams=2為例進行說明

在時間步1，除了最有可能的假設 $(\text{"The"}, \text{"nice"})$ 之外，束搜尋還跟蹤次可能的假設 $(\text{"The"}, \text{"dog"})$ 。在時間步2，束搜尋發現詞序列 $(\text{"The"}, \text{"dog"}, \text{"has"})$ 的機率為 $0.36$ ，高於 $(\text{"The"}, \text{"nice"}, \text{"woman"})$ 的 $0.2$ 。太棒了，它找到了我們玩具示例中最有可能的詞序列！

束搜尋總能找到比貪婪搜尋更高機率的輸出序列，但不能保證找到最有可能的輸出。

讓我們看看如何在transformers中使用束搜尋。我們將num_beams > 1並設定early_stopping=True，這樣當所有束假設都達到EOS標記時，生成就完成了。

# activate beam search and early_stopping
beam_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure

雖然結果可以說更流暢，但輸出仍然包含相同詞序列的重複。一種可用的補救措施是引入N-gram（即N個詞的詞序列）懲罰，如Paulus et al. (2017)和Klein et al. (2017)所介紹的。最常見的N-gram懲罰確保沒有N-gram重複出現兩次，方法是手動將可能建立已見過N-gram的下一個詞的機率設定為0。

讓我們透過設定no_repeat_ngram_size=2來嘗試一下，這樣就不會出現重複的2-gram

# set no_repeat_ngram_size to 2
beam_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to

很好，看起來好多了！我們可以看到重複不再出現。然而，n-gram懲罰必須謹慎使用。一篇關於城市紐約的文章不應該使用2-gram懲罰，否則城市名稱只會出現在整個文字中一次！

束搜尋的另一個重要特點是，我們可以在生成後比較排名靠前的光束，並選擇最符合我們目的的生成光束。

在transformers中，我們只需將引數num_return_sequences設定為應返回的最高分光束的數量。但請確保num_return_sequences <= num_beams！

# set return_num_sequences > 1
beam_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences=5,
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to
1: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to
2: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's a good idea to
3: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time to take a
4: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's a good idea.

可以看出，這五個光束假設彼此之間只有細微差別——在使用僅五個光束時，這並不令人意外。

在開放式生成中，有幾個原因被提出，說明為什麼束搜尋可能不是最佳選擇

束搜尋在所需的生成長度或多或少可預測的任務中非常有效，例如機器翻譯或摘要——參見Murray 等人 (2018)和Yang 等人 (2018)。但這不適用於開放式生成，其中所需的輸出長度可能變化很大，例如對話和故事生成。
我們已經看到束搜尋嚴重受到重複生成的影響。這在故事生成中尤其難以透過 N-gram 或其他懲罰進行控制，因為在抑制重複和重複相同 N-gram 的迴圈之間找到一個好的權衡需要大量的微調。
如Ari Holtzman 等人 (2019)所論證的，高質量的人類語言不遵循高機率下一個詞的分佈。換句話說，作為人類，我們希望生成的文字能讓我們感到驚喜，而不是無聊/可預測。作者透過繪製模型對人類文字的機率與束搜尋所做的對比，很好地展示了這一點。

所以，讓我們停止無聊，引入一些隨機性🤪。

取樣

最基本的取樣形式，意味著根據其條件機率分佈隨機選擇下一個詞 $w_t$

$w_t \sim P(w|w_{1:t-1})$

以上述為例，下圖展示了取樣時的語言生成。

很明顯，使用取樣進行語言生成不再是確定性的。詞 $(\text{"car"})$ 是從條件機率分佈 $P(w | \text{"The"})$ 中取樣的，接著從 $P(w | \text{"The"}, \text{"car"})$ 中取樣了 $(\text{"drives"})$ 。

在transformers中，我們設定do_sample=True並透過top_k=0停用Top-K取樣（稍後會詳細介紹）。接下來，我們將為演示目的固定隨機種子。您可以隨意更改set_seed引數以獲得不同的結果，或將其刪除以實現非確定性。

# set seed to reproduce results. Feel free to change the seed though to get different results
from transformers import set_seed
set_seed(42)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog for the rest of the day, but this had me staying in an unusual room and not going on nights out with friends (which will always be wondered for a mere minute or so at this point).

有意思！文字看起來還不錯——但仔細觀察，它不太連貫，聽起來不像人類寫的。這就是取樣詞序列的一大問題：模型經常生成不連貫的胡言亂語，參看Ari Holtzman 等人 (2019)。

一個技巧是透過降低softmax的所謂temperature來使分佈 $P(w|w_{1:t-1})$ 變得更尖銳（增加高機率詞的可能性，降低低機率詞的可能性）。

將溫度應用於我們上面的示例，可能看起來如下所示。

步驟 $t=1$ 的條件下一個詞分佈變得更加尖銳，幾乎沒有機會選擇詞 $(\text{"car"})$ 。

讓我們看看如何透過設定temperature=0.6來降低庫中分佈的溫度

# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=0,
    temperature=0.6,
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I don't like to chew on it. I like to eat it and not chew on it. I like to be able to walk with my dog."

So how did you decide

好的。奇怪的n-gram減少了，輸出現在更連貫了一些！雖然應用溫度可以使分佈的隨機性降低，但在極限情況下，當設定temperature $\to 0$ 時，溫度標度取樣會等同於貪婪解碼，並會遇到與之前相同的問題。

Top-K 取樣

Fan 等人 (2018)提出了一種簡單但非常強大的取樣方案，稱為Top-K取樣。在Top-K取樣中，過濾掉K個最有可能的下一個詞，並將機率質量重新分配給這K個詞。GPT2採用了這種取樣方案，這也是它在故事生成中取得成功的原因之一。

我們將上述示例中用於兩個取樣步驟的詞範圍從3個詞擴充套件到10個詞，以便更好地說明Top-K取樣。

在設定 $K = 6$ 後，在兩個取樣步驟中，我們將取樣池限制為6個詞。雖然在第一步中，定義為 $V_{\text{top-K}}$ 的6個最有可能的詞僅佔總機率質量的約三分之二，但在第二步中幾乎包含了所有機率質量。儘管如此，我們看到它成功地消除了第二步中那些相當奇怪的候選詞 $(\text{``not"}, \text{``the"}, \text{``small"}, \text{``told"})$ 。

讓我們看看如何在庫中使用Top-K，透過設定top_k=50

# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)

# set top_k to 50
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog for the rest of the day, but this time it was hard for me to figure out what to do with it. (One reason I asked this for a few months back is that I had a

一點也不壞！這段文字可以說是迄今為止最像人類的文字了。然而，Top-K 取樣有一個問題，就是它不能動態地調整從下一個詞機率分佈 $P(w|w_{1:t-1})$ 中過濾的詞數。這可能會有問題，因為有些詞可能來自一個非常尖銳的分佈（上圖右側的分佈），而另一些詞則來自一個更平坦的分佈（上圖左側的分佈）。

在步驟 $t=1$ 中，Top-K消除了取樣 $(\text{"people"}, \text{"big"}, \text{"house"}, \text{"cat"})$ 的可能性，這些詞看起來是合理的候選詞。另一方面，在步驟 $t=2$ 中，該方法將可以說是不合適的詞 $(\text{"down"}, \text{"a"})$ 包含在詞采樣池中。因此，將取樣池限制為固定大小K可能會導致模型為尖銳分佈產生胡言亂語，並限制模型對平坦分佈的創造力。這種直覺促使Ari Holtzman 等人 (2019)建立了Top-p-或nucleus-取樣。

Top-p (nucleus) 取樣

在Top-p取樣中，不是僅從最有可能的K個詞中取樣，而是從累積機率超過機率p的最小詞集合中選擇。然後，機率質量將重新分配到這個詞集合中。透過這種方式，詞集合的大小（即集合中詞的數量）可以根據下一個詞的機率分佈動態增加和減少。好吧，這說得太多了，讓我們視覺化一下。

設定 $p=0.92$ 後，Top-p取樣會選擇最少數量的詞，使其累積機率共同超過 $p=92\%$ 的機率質量，定義為 $V_{\text{top-p}}$ 。在第一個示例中，這包含了9個最可能的詞，而在第二個示例中，它只需要選擇前3個詞就能超過92%。實際上很簡單！可以看出，在下一個詞不太可預測的情況下，它保留了廣泛的詞彙，例如 $P(w | \text{"The''})$ ，而在下一個詞看起來更可預測的情況下，它只保留了少數詞，例如 $P(w | \text{"The"}, \text{"car"})$ 。

好的，是時候在transformers中查看了！我們透過設定0 < top_p < 1來啟用Top-p取樣

# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)

# set top_k to 50
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_p=0.92,
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog for the rest of the day, but this had me staying in an unusual room and not going on nights out with friends (which will always be my yearning for such a spacious screen on my desk

太棒了，這聽起來像是人類寫的。嗯，也許還沒完全達到。

雖然理論上 Top-p 看起來比 Top-K 更優雅，但兩種方法在實踐中都表現良好。Top-p 也可以與 Top-K 結合使用，這樣可以避免非常低排名的詞，同時允許一些動態選擇。

最後，為了獲得多個獨立取樣的輸出，我們可以再次將引數num_return_sequences設定為> 1

# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3,
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog for the rest of the day, but this time it was hard for me to figure out what to do with it. When I finally looked at this for a few moments, I immediately thought, "
1: I enjoy walking with my cute dog. The only time I felt like walking was when I was working, so it was awesome for me. I didn't want to walk for days. I am really curious how she can walk with me
2: I enjoy walking with my cute dog (Chama-I-I-I-I-I), and I really enjoy running. I play in a little game I play with my brother in which I take pictures of our houses.

太棒了，現在你應該掌握所有工具，讓你的模型用transformers編寫你的故事了！

結論

作為即時解碼方法，在開放式語言生成中，top-p和top-K取樣似乎比傳統的貪婪搜尋和束搜尋產生更流暢的文字。有證據表明，貪婪搜尋和束搜尋的明顯缺陷——主要是生成重複的詞序列——是由模型（特別是模型的訓練方式）造成的，而不是解碼方法造成的，參見Welleck et al. (2019)。此外，如Welleck et al. (2020)所示，top-K和top-p取樣似乎也存在生成重複詞序列的問題。

在Welleck et al. (2019)中，作者透過人類評估表明，在調整模型訓練目標後，束搜尋可以生成比 Top-p 取樣更流暢的文字。

開放式語言生成是一個快速發展的研究領域，通常情況下沒有一刀切的方法，因此必須根據具體用例看哪種方法效果最好。

幸運的是，您可以在transfomers🤗中嘗試所有不同的解碼方法——您可以在這裡檢視可用方法的概述。

感謝所有為這篇部落格文章做出貢獻的人：Alexander Rush、Julien Chaumand、Thomas Wolf、Victor Sanh、Sam Shleifer、Clément Delangue、Yacine Jernite、Oliver Åstrand 和 John de Wasseige。

附錄

generate 已經發展成為一種高度可組合的方法，其標誌可以從許多本部落格文章未涵蓋的方向操縱生成文字。以下是一些有用的頁面可供參考：

如果您覺得我們的文件難以導航，並且無法輕鬆找到您要查詢的內容，請在此 GitHub issue中給我們留言。您的反饋對於我們未來的方向至關重要！🤗

更多部落格文章

使用 Sentence Transformers v5 訓練和微調稀疏嵌入模型

作者： 2025年7月1日 • 106

使用 Sentence Transformers v4 訓練和微調 Reranker 模型

作者： 2025年3月26日 • 155

社群

透過拖放到文字輸入框、貼上或點選此處上傳圖片、音訊和影片。

點選或貼上此處以上傳圖片

· 註冊或登入發表評論

贊

231