在 Transformers 中使用對比搜尋生成人類水平的文字 🤗

釋出於 2022 年 11 月 8 日

1. 引言：

自然語言生成 (即文字生成) 是自然語言處理 (NLP) 的核心任務之一。在這篇部落格中，我們介紹了目前最先進的解碼方法，即用於神經文字生成的 對比搜尋 (Contrastive Search)。對比搜尋最初是在 NeurIPS 2022 的論文 "A Contrastive Framework for Neural Text Generation" [1] ([論文][官方實現]) 中提出的。此外，在其後續工作 "Contrastive Search Is What You Need For Neural Text Generation" [2] ([論文] [官方實現]) 中，作者進一步證明了對比搜尋可以使用 現成的 語言模型在 16 種語言中生成人類水平的文字。

[備註] 對於不熟悉文字生成的使用者，請參閱這篇部落格文章瞭解更多詳情。

2. Hugging Face 🤗 對比搜尋演示：

對比搜尋現已在 🤗 transformers 中提供，支援 PyTorch 和 TensorFlow。您可以在此 Colab 筆記本 (連結在頂部) 中使用您選擇的框架與本部落格文章中展示的示例進行互動。我們還構建了這個很棒的演示，它直接將對比搜尋與其他流行的解碼方法 (例如，集束搜尋 (beam search)、top-k 取樣 [3] 和核心取樣 (nucleus sampling) [4]) 進行比較。

3. 環境安裝：

在執行以下部分的實驗之前，請安裝最新版本的 transformers 如下

pip install torch
pip install "transformers==4.24.0"

4. 現有解碼方法的問題：

解碼方法可以分為兩類：(i) 確定性方法和 (ii) 隨機性方法。我們來討論一下這兩種方法！

4.1. 確定性方法：

確定性方法，例如貪婪搜尋 (greedy search) 和集束搜尋 (beam search)，透過選擇語言模型測量的可能性最高的文字續寫來生成文字。然而，正如先前研究 [3][4] 中廣泛討論的，確定性方法通常會導致 模型退化 問題，即生成的文字不自然且包含不必要的重複。

下面，我們來看一個使用 GPT-2 模型透過貪婪搜尋生成的文字示例。

from transformers import AutoTokenizer, GPT2LMHeadModel

tokenizer = AutoTokenizer.from_pretrained('gpt2-large')
input_ids = tokenizer('DeepMind Company is', return_tensors='pt').input_ids
model = GPT2LMHeadModel.from_pretrained('gpt2-large')

output = model.generate(input_ids, max_length=128)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 * '-')

模型輸出

Output:
----------------------------------------------------------------------------------------------------
DeepMind Company is a leading AI research company, with a focus on deep learning and deep
learning-based systems.

The company's research is focused on the development of deep learning-based systems that
can learn from large amounts of data, and that can be used to solve real-world problems.

DeepMind's research is also used by the UK government to develop new technologies for the
UK's National Health Service.

DeepMind's research is also used by the UK government to develop new technologies for the
UK's National Health Service.

DeepMind's research is also used by the UK government to develop new technologies
----------------------------------------------------------------------------------------------------

[備註] 從貪婪搜尋生成的結果中，我們可以看到明顯的重複模式。

4.2. 隨機性方法：

為了解決確定性方法帶來的問題，隨機性方法透過在解碼過程中引入隨機性來生成文字。兩種廣泛使用的隨機性方法是 (i) top-k 取樣 [3] 和 (ii) 核心取樣 (也稱為 top-p 取樣) [4]。

下面，我們展示一個使用 GPT-2 模型透過核心取樣 (p=0.95) 生成的文字示例。

import torch
from transformers import AutoTokenizer, GPT2LMHeadModel

tokenizer = AutoTokenizer.from_pretrained('gpt2-large')
input_ids = tokenizer('DeepMind Company is', return_tensors='pt').input_ids
model = GPT2LMHeadModel.from_pretrained('gpt2-large')

torch.manual_seed(0.)
output = model.generate(input_ids, do_sample=True, max_length=128, top_p=0.95, top_k=0)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 * '-')

模型輸出

Output:
----------------------------------------------------------------------------------------------------
DeepMind Company is a leading provider of AI-based research, development, and delivery of
AI solutions for security, infrastructure, machine learning, communications, and so on."

'AI is not journalism'

Worse still was the message its researchers hoped would reach the world's media — that it
was not really research, but rather a get-rich-quick scheme to profit from living forces'
ignorance.

"The thing is, we know that people don't consciously assess the value of the others'
information. They understand they will get the same on their own."

One example? Given the details of today
----------------------------------------------------------------------------------------------------

[備註] 雖然核心取樣可以生成沒有重複的文字，但生成文字的語義連貫性沒有得到很好的保持。例如，生成的短語 “AI is not journalism” (人工智慧不是新聞業) 與給定的字首，即 “DeepMind Company” (DeepMind 公司) 不連貫。

我們注意到，這個語義不一致的問題可以透過降低溫度 (temperature) 來部分解決。然而，降低溫度會使核心取樣更接近貪婪搜尋，這可以看作是貪婪搜尋和核心取樣之間的一種權衡。通常，很難找到一個與提示和模型無關的溫度，既能避免貪婪搜尋的陷阱，又能避免核心取樣的陷阱。

5. 對比搜尋：

在本節中，我們詳細介紹一種新的解碼方法，對比搜尋 (Contrastive Search)。

5.1. 解碼目標：

給定字首文字 $x_{< t}$ ，輸出詞元 $x_{t}$ 的選擇遵循

其中 $V^{(k)}$ 是語言模型機率分佈 $p_{\theta}(v|x_{< t})$ 中 top-k 預測的集合。第一項，即 模型置信度 (model confidence)，是語言模型預測的候選詞 $v$ 的機率。第二項，退化懲罰 (degeneration penalty)，衡量 $v$ 相對於先前上下文 $x_{< t}$ 的判別性，函式 $s(\cdot, \cdot)$ 計算詞元表示之間的餘弦相似度。更具體地說，退化懲罰定義為 $v$ 的詞元表示 (即 $h_{v}$ ) 與上下文中所有詞元 $x_{< t}$ 的詞元表示之間的最大余弦相似度。在這裡，候選詞表示 $h_{v}$ 是由語言模型在給定 $x_{< t}$ 和 $v$ 的串聯下計算的。直觀上， $v$ 的退化懲罰越大，意味著它與上下文在表示空間中越相似，因此越有可能導致模型退化問題。超引數 $\alpha$ 調節這兩個部分的重要性。當 $\alpha=0$ 時，對比搜尋退化為普通的貪婪搜尋。

[備註] 在生成輸出時，對比搜尋聯合考慮了 (i) 語言模型預測的機率，以保持生成文字和字首文字之間的語義連貫性；以及 (ii) 與先前上下文的相似性，以避免模型退化。

5.2. 使用對比搜尋生成文字：

下面，我們使用與第 4.1 節和 4.2 節中相同的字首文字 (即 "DeepMind Company is")，並使用對比搜尋 (k=4 和 $\alpha=0.6$ ) 生成文字。為了充分展示對比搜尋的優越能力，我們讓語言模型生成一個包含 512 個詞元的長文件，如下所示

from transformers import GPT2Tokenizer, GPT2LMHeadModel

model_name = 'gpt2-large'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)
model.eval()

# prepare the prefix
prefix_text = r'DeepMind Company is'
input_ids = tokenizer(prefix_text, return_tensors='pt').input_ids

# generate the result with contrastive search
output = model.generate(input_ids, penalty_alpha=0.6, top_k=4, max_length=512)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 * '-')

引數如下

--top_k：對比搜尋中的超引數 $k$ 。
--penalty_alpha：對比搜尋中的超引數 $\alpha$ 。

模型輸出

Output:
----------------------------------------------------------------------------------------------------
DeepMind Company is a leader in artificial intelligence (AI). We have a long history of working
with companies such as Google, Facebook, Amazon, and Microsoft to build products that improve
people's lives, and today we are excited to announce that DeepMind's AlphaGo program has won the
game of Go, becoming the first program to defeat a professional Go player.

The victory is a testament to the power of deep learning, and to the incredible work of our
research team, which has been at the forefront of AI research for the past five years. AlphaGo
is one of the most advanced Go programs ever created, and its performance is an important step
towards the goal of human-level AI.

"This is the culmination of a decade of hard work," said Andy Ng, co-founder and CTO of DeepMind.
"We are thrilled to have achieved this milestone and look forward to continuing to develop AI that
can be used in a wide range of applications and to help people live better lives."

DeepMind's work on Go began in 2010, when it began to train a neural network to play Go using
millions of games played by top Go players around the world. Since then, the team has refined the
algorithm, adding more and more layers of reinforcement learning to make it better at recognizing
patterns and making decisions based on those patterns. In the past year and a half, the team has
made significant progress in the game, winning a record-tying 13 games in a row to move into the
top four of the world rankings.

"The game of Go is a complex game in which players have to be very careful not to overextend their
territory, and this is something that we have been able to improve over and over again," said
Dr. Demis Hassabis, co-founder and Chief Scientific Officer of DeepMind. "We are very proud of our
team's work, and we hope that it will inspire others to take the next step in their research and
apply the same techniques to other problems."

In addition to the win in Go, DeepMind has also developed an AI system that can learn to play a
number of different games, including poker, Go, and chess. This AI system, called Tarsier, was
developed in partnership with Carnegie Mellon University and the University of California,
Berkeley, and is being used to teach computer vision and machine learning to identify objects in
images and recognize speech in natural language. Tarsier has been trained to play the game of Go
and other games on a
----------------------------------------------------------------------------------------------------

[備註] 我們可以看到生成的文字質量非常高。整個文件語法流暢，語義連貫。同時，生成的文字也很好地保持了其事實正確性。例如，在第一段中，它將 "AlphaGo" 闡述為 "第一個擊敗職業圍棋選手的程式"。

5.3. 對比搜尋的視覺化演示：

為了更好地理解對比搜尋的工作原理，我們提供了貪婪搜尋 (第 4.1 節) 和對比搜尋之間的視覺化比較。具體來說，我們分別可視化了貪婪搜尋和對比搜尋生成的文字的詞元相似度矩陣。兩個詞元之間的相似度定義為其詞元表示 (即 Transformer 最後一層的隱藏狀態) 之間的餘弦相似度。貪婪搜尋 (上) 和對比搜尋 (下) 的結果如下圖所示。

[備註] 從貪婪搜尋的結果中，我們看到非對角線上的高相似度分數，這清楚地表明瞭貪婪搜尋生成的重複。相反，在對比搜尋的結果中，高相似度分數主要出現在對角線上，這驗證了退化問題已成功解決。對比搜尋的這個良好特性是透過在解碼過程中引入退化懲罰 (見第 5.1 節) 實現的。

6. 更多生成示例：

在本節中，我們提供更多生成示例來比較不同的解碼方法。

6.1. 示例一 - GPT-2：

在這一部分，我們使用 GPT-2，以字首文字來自原始 OpenAI 部落格 (宣佈釋出 GPT-2 的部落格) 來生成文字。

在一個驚人的發現中，科學家在安第斯山脈一個偏遠、以前未曾探索過的山谷裡發現了一群獨角獸。更讓研究人員驚訝的是，這些獨角獸竟然說一口流利的英語。

載入語言模型並準備字首文字

import torch
from transformers import AutoTokenizer, GPT2LMHeadModel

tokenizer = AutoTokenizer.from_pretrained('gpt2-large')
model = GPT2LMHeadModel.from_pretrained('gpt2-large')

prefix_text = r"In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."
input_ids = tokenizer(prefix_text, return_tensors='pt').input_ids

6.1.1. 使用貪婪搜尋生成文字：

程式碼：[點選展開]

output = model.generate(input_ids, max_length=512)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 * '-')

模型輸出：[點選展開]

Output:
----------------------------------------------------------------------------------------------------
In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously
unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact
that the unicorns spoke perfect English.

The researchers, led by Dr. David R. Williams of the University of California, Santa Cruz,
discovered the unicorns in the Andes Mountains of Peru. The area is known for its unique geology
and is home to a number of rare species of animals.

The researchers found the unicorns in the Andes Mountains of Peru.

"We were surprised to find that the unicorns were able to communicate with each other," Williams
said. "We were also surprised to find that they were able to communicate in English."

The researchers believe that the unicorns are descendants of the ancient Incas, who lived in the
area around 2,000 years ago.

"The Incas were the first people to use the Andes Mountains as a refuge from the harsh climate of
the Andes," Williams said. "They were also the first people to use the Andes Mountains as a place
to hunt and gather food."

The researchers believe that the unicorns are descendants of the ancient Incas, who lived in the
area around 2,000 years ago.

"The Incas were the first people to use the Andes Mountains as a refuge from the harsh climate of
the Andes," Williams said. "They were also the first people to use the Andes Mountains as a place
to hunt and gather food."

The researchers believe that the unicorns are descendants of the ancient Incas, who lived in the
area around 2,000 years ago.

"The Incas were the first people to use the Andes Mountains as a refuge from the harsh climate of
the Andes," Williams said. "They were also the first people to use the Andes Mountains as a place
to hunt and gather food."

The researchers believe that the unicorns are descendants of the ancient Incas, who lived in the
area around 2,000 years ago.

"The Incas were the first people to use the Andes Mountains as a refuge from the harsh climate of
the Andes," Williams said. "They were also the first people to use the Andes Mountains as a place
to hunt and gather food."

The researchers believe that the unicorns are descendants of the ancient Incas, who lived in the
area around 2,000 years ago
----------------------------------------------------------------------------------------------------

6.1.2. 使用核心取樣生成文字：

程式碼：[點選展開]

torch.manual_seed(0.)
output = model.generate(input_ids, do_sample=True, max_length=512, top_p=0.95, top_k=0)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 * '-')

模型輸出：[點選展開]

Output:
----------------------------------------------------------------------------------------------------
In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously
unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact
that the unicorns spoke perfect English. The study was published in the Journal of Zoology in
March 2016.

Polygynous mammals such as unicorns have remained largely unknown to science. Professor Gustavo
Giacota, from the University of Oxford who led the study, said that they had been documented as
far as Eastern Siberia in Russia, but had only been seen a handful of times in the Gobi Desert.

Tiny animals with pale and shiny coats live in the presence of human beings and are hardly likely
to be victims of any cruelty. However, there is some evidence of the condition occurring in both
humans and animals in remote regions, which might have similarities to "black moles" that coexist
on the skin.

It is thought that Unicorns could be inside themselves, that they have different scents depending
on their current environment, or just fall out and there are plenty of legends of how they have
survived. Experts speculate that the moths and other animals could be remnants of the Yezidi Isis
and Charon, which literally is both the word which means great bird, and the Greek word for sound.
It is said that the Isis and Charon taught their young the use of voice in the form of calling out
to others.

The scientists think that it could be ancient folklore that has survived and is no longer attributed
to a real entity
----------------------------------------------------------------------------------------------------

6.1.3. 使用對比搜尋生成文字：

程式碼

output = model.generate(input_ids, max_length=512, penalty_alpha=0.6, top_k=4)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 * '-')

模型輸出

Output:
----------------------------------------------------------------------------------------------------
In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored
valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns
spoke perfect English.

According to the BBC, a team of scientists led by Dr David MacKay, from the University of Bristol, spent
two years searching for the unicorn herd, which they discovered during a survey of the area.

"It's a very rare find," MacKay told the BBC. "There are a few in the Himalayas, but this is the first
time we've been able to find one in such a remote area."

The team was surprised to find a herd of unicorns living in a region that has been known to be a hotbed
of poaching, with many of the animals poached for their horns, which are used in traditional Chinese
medicine to treat everything from rheumatism to cancer.

"We knew that the area was rich in rhino horn, but we had no idea how many there were, or what they were
doing there," MacKay said. "This is an area of high poaching pressure, and we wanted to find out what was
going on."

In order to do so, the team used GPS collars to track the animals as they moved around the mountain and
the surrounding area. The GPS data was then compared with information gathered from local villagers, who
had a wealth of information about the animals' movements, including where they were eating, what they were
doing at night, and how much time they spent in the mountains each day.

After analyzing the data, the team determined that the herd consisted of at least three species of unicorns,
including a male and two females. One of the females was the mother of the male, and the other two were her
daughters. All three had the same horn color, which is believed to be a sign of purity in the animal kingdom.

While the discovery is exciting, it's not the first time scientists have discovered an animal that speaks
English. Last year, scientists discovered a species of porcupine that can be heard by humans, and has been
dubbed "Porcupine Man" for his ability to converse with the human race.
----------------------------------------------------------------------------------------------------

6.2. 示例二 - OPT：

在這一部分，我們使用最近由 Meta 釋出的 OPT 模型 [5]，以著名的 ResNet 論文 [6] 摘要的前兩句話作為字首來生成文字。

更深的神經網路更難訓練。我們提出了一個殘差學習框架，以簡化比以前使用的網路更深的網路訓練。

載入語言模型並準備字首文字

import torch
from transformers import AutoTokenizer, OPTForCausalLM
model_name = r'facebook/opt-1.3b'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = OPTForCausalLM.from_pretrained(model_name)

prefix_text = r"Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously."
input_ids = tokenizer(prefix_text, return_tensors='pt').input_ids

6.2.1. 使用貪婪搜尋生成文字：

程式碼：[點選展開]

output = model.generate(input_ids, max_length=256)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 * '-')

模型輸出：[點選展開]

Output:
----------------------------------------------------------------------------------------------------
Deeper neural networks are more difficult to train. We present a residual learning framework to ease
the training of networks that are substantially deeper than those used previously. We show that the
residual learning framework can be used to train deep neural networks that are significantly more
difficult to train than those used previously. We also show that the residual learning framework can
be used to train deep neural networks that are significantly more difficult to train than those used
previously.

The paper presents a new residual learning framework for deep neural networks that is based on the
concept of residuals. The residuals are the residuals of the network that are not used in the training
process. The residuals are computed by taking the residuals of the network that are used in the training
process and subtracting the residuals of the network that are not used in the training process. The
residuals are then used to train the network. The residuals are computed by taking the residuals of
the network that are used in the training process and subtracting the residuals of the network that
are not used in the training process. The residuals are then used to train the network. The residuals
are computed by taking the residuals of the network that are used in the training process and
subtracting the residuals of the
----------------------------------------------------------------------------------------------------

6.2.2. 使用核心取樣生成文字：

程式碼：[點選展開]

torch.manual_seed(0.)
output = model.generate(input_ids, do_sample=True, max_length=256, top_p=0.95, top_k=0)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 * '-')

模型輸出：[點選展開]

Output:
----------------------------------------------------------------------------------------------------
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the
training of networks that are substantially deeper than those used previously. The theory focuses on
several aspects of learning, including the dynamics of replicative and non-replicative aspects of learning.
This framework emphasizes learning by entropy. New randomized algorithms enable training networks with
residual learning, so that deep networks can be deployed as reliably and as efficiently as their more
conventional counterparts.
----------------------------------------------------------------------------------------------------

6.2.3. 使用對比搜尋生成文字：

程式碼

output = model.generate(input_ids, max_length=256, penalty_alpha=0.6, top_k=6)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 * '-')

模型輸出

Output:
----------------------------------------------------------------------------------------------------
Deeper neural networks are more difficult to train. We present a residual learning framework to ease
the training of networks that are substantially deeper than those used previously.

In this paper, we propose a model-based residual learning (MBRL) framework that is based on neural
networks trained on data that is sparse in terms of dimensionality (e.g., 1, 2, 3, etc.). The network
parameters are chosen such that there is a high probability of convergence, i.e., the number of
iterations is large enough to minimize the variance of the residuals. This is achieved by training
the network on a set of training data, in which the data is sparse in terms of dimensionality, and
then discarding the nonparametric part of the data after training is complete.

We show that MBRL outperforms other methods for deep reinforcement learning (RL) and deep convolutional
neural networks (CNNs) by a factor of at least 2. In addition, we show that, compared to CNNs, MBRL
performs better in two-dimensional (2D) and three-dimensional (3D) cases.
----------------------------------------------------------------------------------------------------

7. 資源：

有關對比搜尋的更多詳細資訊，請檢視我們的論文和程式碼，如下所示

A Contrastive Framework for Neural Text Generation：(1) 論文和 (2) 官方實現。
Contrastive Search Is What You Need For Neural Text Generation：(1) 論文和 (2) 官方實現。

8. 引用：

@inproceedings{su2022a,
   title={A Contrastive Framework for Neural Text Generation},
   author={Yixuan Su and Tian Lan and Yan Wang and Dani Yogatama and Lingpeng Kong and Nigel Collier},
   booktitle={Advances in Neural Information Processing Systems},
   editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
   year={2022},
   url={https://openreview.net/forum?id=V88BafmH9Pj}
}

@article{su2022contrastiveiswhatyouneed,
  title={Contrastive Search Is What You Need For Neural Text Generation},
  author={Su, Yixuan and Collier, Nigel},
  journal={arXiv preprint arXiv:2210.14140},
  year={2022}
}

參考文獻：

[1] Su et al., 2022 "A Contrastive Framework for Neural Text Generation", NeurIPS 2022

[2] Su and Collier, 2022 "Contrastive Search Is What You Need For Neural Text Generation", Arxiv 2022

[3] Fan et al., 2018 "Hierarchical Neural Story Generation", ACL 2018

[4] Holtzman et al., 2020 "The Curious Case of Neural Text Degeneration", ICLR 2020

[5] Zhang et al., 2022 "OPT: Open Pre-trained Transformer Language Models", Arxiv 2022

[6] He et al., 2016 "Deep Residual Learning for Image Recognition", CVPR 2016

- 作者：Yixuan Su 和 Tian Lan

致謝：

我們要感謝 Joao Gante (@joaogante)、Patrick von Platen (@patrickvonplaten) 和 Sylvain Gugger (@sgugger) 在將本部落格文章中提到的對比搜尋新增到 transformers 庫中提供的幫助和指導。

更多部落格文章

序列對序列：成對編碼器和解碼器的 Ettin 套件

作者 2025 年 7 月 16 日 • 57

SmolLM3：小巧、多語言、長上下文的推理器

作者 2025 年 7 月 8 日 • 615

社群

透過拖放到文字輸入框、貼上或點選此處上傳圖片、音訊和影片。

點選或貼上此處以上傳圖片

· 註冊或登入以發表評論

贊