生成策略

解碼策略指示模型如何選擇下一個生成的token。解碼策略有多種型別，選擇合適的策略對生成的文字質量有重大影響。

本指南將幫助您瞭解 Transformers 中可用的不同解碼策略，以及如何及何時使用它們。

基本解碼方法

這些是成熟的解碼方法，應該作為您文字生成任務的起點。

貪婪搜尋

貪婪搜尋是預設的解碼策略。它在每一步選擇下一個最有可能的token。除非在 GenerationConfig 中指定，否則此策略最多生成 20 個新token。

貪婪搜尋適用於輸出相對較短且創造性不是優先考慮的任務。然而，當生成較長的序列時，它會因為開始重複自身而失效。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to("cuda")
# explicitly set to default length because Llama2 generation length is 4096
outputs = model.generate(**inputs, max_new_tokens=20)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
'Hugging Face is an open-source company that provides a suite of tools and services for building, deploying, and maintaining natural language processing'

取樣

取樣，或多項式取樣，根據模型詞彙表中所有token的機率分佈隨機選擇一個token（與貪婪搜尋中選擇最可能的token不同）。這意味著每個非零機率的token都有機會被選中。取樣策略可以減少重複並生成更具創意和多樣性的輸出。

透過設定 do_sample=True 和 num_beams=1 啟用多項式取樣。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to("cuda")
# explicitly set to 100 because Llama2 generation length is 4096
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, num_beams=1)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
'Hugging Face is an open-source company 🤗\nWe are open-source and believe that open-source is the best way to build technology. Our mission is to make AI accessible to everyone, and we believe that open-source is the best way to achieve that.'

束搜尋

束搜尋在每個時間步跟蹤多個生成的序列（束）。在經過一定數量的步驟後，它選擇整體機率最高的序列。與貪婪搜尋不同，此策略可以“向前看”並選擇整體機率更高的序列，即使初始token的機率較低。它最適合於輸入接地任務，如描述影像或語音識別。您也可以在束搜尋中使用 do_sample=True 進行每步取樣，但束搜尋仍會在步驟之間貪婪地剪除低機率序列。

請檢視束搜尋視覺化工具，瞭解束搜尋的工作原理。

使用 num_beams 引數啟用束搜尋（應大於 1，否則等同於貪婪搜尋）。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to("cuda")
# explicitly set to 100 because Llama2 generation length is 4096
outputs = model.generate(**inputs, max_new_tokens=50, num_beams=2)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
"['Hugging Face is an open-source company that develops and maintains the Hugging Face platform, which is a collection of tools and libraries for building and deploying natural language processing (NLP) models. Hugging Face was founded in 2018 by Thomas Wolf']"

高階解碼方法

高階解碼方法旨在解決特定的生成質量問題（例如重複）或在某些情況下提高生成吞吐量。這些技術更復雜，可能不適用於所有模型。

推測解碼

推測或輔助解碼並非搜尋或採樣策略。相反，推測解碼添加了一個第二個較小的模型來生成候選token。主模型在單個 forward 傳遞中驗證候選token，這整體上加快了解碼過程。此方法對於 LLM 特別有用，因為生成token可能更昂貴和緩慢。有關更多資訊，請參閱推測解碼指南。

目前，推測解碼僅支援貪婪搜尋和多項式取樣。批處理輸入也不受支援。

使用 assistant_model 引數啟用推測解碼。您會注意到，如果輔助模型遠小於主模型，速度提升最快。新增 do_sample=True 以啟用重取樣驗證token。

貪婪搜尋

多項式取樣

提示查詢解碼

提示查詢解碼是推測解碼的一種變體，它使用重疊的N-gram作為候選token。它非常適用於摘要等輸入接地任務。有關更多資訊，請參閱提示查詢解碼指南。

使用 prompt_lookup_num_tokens 引數啟用提示查詢解碼。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-1.7B")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-1.7B", torch_dtype=torch.float16).to("cuda")
assistant_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-135M", torch_dtype=torch.float16).to("cuda")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, assistant_model=assistant_model, max_new_tokens=20, prompt_lookup_num_tokens=5)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
'Hugging Face is an open-source company that provides a platform for developers to build and deploy machine learning models. It offers a variety of tools'

自推測解碼

早期退出使用來自語言建模頭的早期隱藏狀態作為輸入，有效跳過層以產生較低質量的輸出。較低質量的輸出用作輔助輸出，並應用自推測來使用剩餘層修復輸出。這種自推測方法最終生成的結果與原始模型的生成結果相同（或具有相同的分佈）。

輔助模型也是目標模型的一部分，因此可以共享快取和權重，從而降低記憶體需求。

對於使用早期退出訓練的模型，將 assistant_early_exit 傳遞給 generate()。

from transformers import AutoModelForCausalLM, AutoTokenizer

prompt = "Alice and Bob"
checkpoint = "facebook/layerskip-llama3.2-1B"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt")

model = AutoModelForCausalLM.from_pretrained(checkpoint)
outputs = model.generate(**inputs, assistant_early_exit=4, do_sample=False, max_new_tokens=20)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

通用輔助解碼

通用輔助解碼 (UAD) 允許主模型和輔助模型使用不同的分詞器。主模型的輸入token被重新編碼為輔助模型token。候選token在輔助編碼中生成，然後重新編碼為主模型候選token。候選token的驗證方式與推測解碼中解釋的一樣。

重新編碼涉及將token ID解碼為文字，並使用不同的分詞器對文字進行編碼。為了防止重新編碼期間出現分詞差異，UAD 會在源編碼和目標編碼之間找到最長公共子序列，以確保新token包含正確的提示字尾。

將 tokenizer 和 assistant_tokenizer 引數新增到 generate() 以啟用 UAD。

from transformers import AutoModelForCausalLM, AutoTokenizer

prompt = "Alice and Bob"

assistant_tokenizer = AutoTokenizer.from_pretrained("double7/vicuna-68m")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
inputs = tokenizer(prompt, return_tensors="pt")

model = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b")
assistant_model = AutoModelForCausalLM.from_pretrained("double7/vicuna-68m")
outputs = model.generate(**inputs, assistant_model=assistant_model, tokenizer=tokenizer, assistant_tokenizer=assistant_tokenizer)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']

對比搜尋

對比搜尋是一種解碼策略，旨在減少重複，即使在生成較長序列時也是如此。此策略會比較生成的token與先前token的相似程度，如果它們更相似，則應用懲罰。

使用 penalty_alpha 和 top_k 引數啟用對比搜尋。penalty_alpha 管理應用的懲罰，top_k 是要返回的最有可能的token數量。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to("cuda")
# explicitly set to 100 because Llama2 generation length is 4096
outputs = model.generate(**inputs, max_new_tokens=100, penalty_alpha=0.6, top_k=4)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
'Hugging Face is an open-source company that provides a platform for building and deploying AI models.\nHugging Face is an open-source company that provides a platform for building and deploying AI models. The platform allows developers to build and deploy AI models, as well as collaborate with other developers.\nHugging Face was founded in 2019 by Thibault Wittemberg and Clément Delangue. The company is based in Paris, France.\nHugging Face has'

DoLa

透過對比層進行解碼 (DoLa) 是一種對比解碼策略，旨在提高事實性和減少幻覺。此策略透過對比最終層和早期層之間的logit差異來工作。結果，特定層中本地化的事實知識被放大。不建議將 DoLa 用於 GPT-2 等較小的模型。

使用以下引數啟用 DoLa。

dola_layers 是要與最終層進行對比的候選層。它可以是字串（low 或 high），用於對比層的較低或較高部分。對於 TruthfulQA 等簡答任務，建議使用 high。對於 GSM8K、StrategyQA、FACTOR 和 VicunaQA 等長答推理任務，建議使用 low。

當模型具有共享詞嵌入時，會跳過第 0 層並從第 2 層開始。

它也可以是表示 0 和總層數之間層索引的整數列表。第 0 層是詞嵌入，第 1 層是第一個transformer層，依此類推。有關層索引範圍的更多資訊，請參閱下表，具體取決於模型層數。

層低高

> 40 (0, 20, 2) (N - 20, N, 2)

<= 40 range(0, N // 2, 2) range(N // 2, N, 2)
repetition_penalty 減少重複，建議設定為 1.2。

層	低	高
> 40	(0, 20, 2)	(N - 20, N, 2)
<= 40	range(0, N // 2, 2)	range(N // 2, N, 2)

對比高層

對位元定層

多樣化束搜尋

多樣化束搜尋是束搜尋的一種變體，它生成更多樣化的輸出候選以供選擇。此策略衡量序列的差異性，如果序列過於相似，則會施加懲罰。為了避免高計算成本，束的數量被分成多個組。

使用 num_beams、num_beam_groups 和 diversity_penalty 引數啟用多樣化束搜尋（num_beams 引數應可被 num_beam_groups 整除）。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to("cuda")
# explicitly set to 100 because Llama2 generation length is 4096
outputs = model.generate(**inputs, max_new_tokens=50, num_beams=6, num_beam_groups=3, diversity_penalty=1.0, do_sample=False)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
'Hugging Face is an open-source company 🤗\nWe are an open-source company. Our mission is to democratize AI and make it accessible to everyone. We believe that AI should be used for the benefit of humanity, not for the benefit of a'

自定義解碼方法

自定義解碼方法支援專門的生成行為，例如：

在不確定時讓模型繼續思考；
如果模型卡住，則回滾生成；
使用自定義邏輯處理特殊token；
為高階模型增強輸入準備；

我們透過模型倉庫啟用自定義解碼方法，假定特定的模型標籤和檔案結構（參見下面的子部分）。此功能是自定義建模程式碼的擴充套件，因此需要設定 trust_remote_code=True。

如果模型倉庫包含自定義解碼方法，最簡單的嘗試方法是載入模型並用其生成

from transformers import AutoModelForCausalLM, AutoTokenizer

# `transformers-community/custom_generate_example` holds a copy of `Qwen/Qwen2.5-0.5B-Instruct`, but
# with custom generation code -> calling `generate` uses the custom decoding method!
tokenizer = AutoTokenizer.from_pretrained("transformers-community/custom_generate_example")
model = AutoModelForCausalLM.from_pretrained(
    "transformers-community/custom_generate_example", device_map="auto", trust_remote_code=True
)

inputs = tokenizer(["The quick brown"], return_tensors="pt").to(model.device)
# The custom decoding method is a minimal greedy decoding implementation. It also prints a custom message at run time.
gen_out = model.generate(**inputs)
# you should now see its custom message, "✨ using a custom generation method ✨"
print(tokenizer.batch_decode(gen_out, skip_special_tokens=True))
'The quick brown fox jumps over a lazy dog, and the dog is a type of animal. Is'

具有自定義解碼方法的模型倉庫有一個特殊屬性：它們的解碼方法可以透過 generate() 的 custom_generate 引數從**任何**模型載入。這意味著任何人都可以建立和分享他們的自定義生成方法，以潛在地與任何 Transformers 模型配合使用，而無需使用者安裝額外的 Python 包。

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", device_map="auto")

inputs = tokenizer(["The quick brown"], return_tensors="pt").to(model.device)
# `custom_generate` replaces the original `generate` by the custom decoding method defined in
# `transformers-community/custom_generate_example`
gen_out = model.generate(**inputs, custom_generate="transformers-community/custom_generate_example", trust_remote_code=True)
print(tokenizer.batch_decode(gen_out, skip_special_tokens=True)[0])
'The quick brown fox jumps over a lazy dog, and the dog is a type of animal. Is'

您應該閱讀包含自定義生成策略的倉庫的 README.md 檔案，以檢視新的引數和輸出型別差異（如果存在）。否則，您可以假定它像基本 generate() 方法一樣工作。

您可以透過搜尋其自定義標籤 custom_generate 來查詢所有自定義解碼方法。

以 Hub 倉庫 transformers-community/custom_generate_example 為例。README.md 指出它有一個額外的輸入引數 left_padding，它在提示前新增一些填充token。

gen_out = model.generate(
    **inputs, custom_generate="transformers-community/custom_generate_example", trust_remote_code=True, left_padding=5
)
print(tokenizer.batch_decode(gen_out)[0])
'<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>The quick brown fox jumps over the lazy dog.\n\nThe sentence "The quick'

如果自定義方法有您環境不滿足的固定 Python 要求，您將收到缺少要求的異常。例如，transformers-community/custom_generate_bad_requirements 在其 custom_generate/requirements.txt 檔案中定義了一組不可能滿足的要求，如果您嘗試執行它，您將看到以下錯誤訊息。

ImportError: Missing requirements in your local environment for `transformers-community/custom_generate_bad_requirements`:
foo (installed: None)
bar==0.0.0 (installed: None)
torch>=99.0 (installed: 2.6.0)

相應地更新您的 Python 要求將消除此錯誤訊息。

建立自定義解碼方法

要建立新的解碼方法，您需要建立一個新的 **Model** 倉庫並將一些檔案推送到其中。

您設計解碼方法所用的模型。
custom_generate/generate.py，其中包含您自定義解碼方法的所有邏輯。
custom_generate/requirements.txt，用於選擇性地新增新的 Python 要求和/或鎖定特定版本以正確使用您的方法。
README.md，您應在此處新增 custom_generate 標籤並記錄您自定義方法的任何新引數或輸出型別差異。

新增所有必需檔案後，您的倉庫應如下所示

your_repo/
├── README.md          # include the 'custom_generate' tag
├── config.json
├── ...
└── custom_generate/
    ├── generate.py
    └── requirements.txt

新增基礎模型

自定義解碼方法的起點與其他任何模型倉庫一樣。要新增到此倉庫的模型應該是您設計該方法時所使用的模型，它旨在成為一個可工作的自包含模型-生成對的一部分。當載入此倉庫中的模型時，您的自定義解碼方法將覆蓋 generate。不用擔心——您的解碼方法仍然可以載入到任何其他 Transformers 模型中，如上文所述。

如果您只是想複製現有模型，可以這樣做

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("source/model_repo")
model = AutoModelForCausalLM.from_pretrained("source/model_repo")
tokenizer.save_pretrained("your/decoding_method", push_to_hub=True)
model.save_pretrained("your/decoding_method", push_to_hub=True)

generate.py

這是您解碼方法的核心。它**必須**包含一個名為 generate 的方法，並且此方法**必須**包含一個 model 引數作為其第一個引數。model 是模型例項，這意味著您可以訪問模型中的所有屬性和方法，包括 GenerationMixin 中定義的方法（如基本 generate 方法）。

generate.py 必須放在名為 custom_generate 的資料夾中，而不是倉庫的根目錄。此功能的檔案路徑是硬編碼的。

在幕後，當使用 custom_generate 引數呼叫基本 generate() 方法時，它首先檢查其 Python 要求（如果有），然後定位 generate.py 中的自定義 generate 方法，最後呼叫自定義 generate。所有接收到的引數和 model 都將被轉發到您的自定義 generate 方法，但用於觸發自定義生成的引數（trust_remote_code 和 custom_generate）除外。

這意味著您的 generate 可以混合原始引數和自定義引數（以及不同的輸出型別），如下所示。

import torch

def generate(model, input_ids, generation_config=None, left_padding=None, **kwargs):
    generation_config = generation_config or model.generation_config  # default to the model generation config
    cur_length = input_ids.shape[1]
    max_length = generation_config.max_length or cur_length + generation_config.max_new_tokens

    # Example of custom argument: add `left_padding` (integer) pad tokens before the prompt
    if left_padding is not None:
        if not isinstance(left_padding, int) or left_padding < 0:
            raise ValueError(f"left_padding must be an integer larger than 0, but is {left_padding}")

        pad_token = kwargs.pop("pad_token", None) or generation_config.pad_token_id or model.config.pad_token_id
        if pad_token is None:
            raise ValueError("pad_token is not defined")
        batch_size = input_ids.shape[0]
        pad_tensor = torch.full(size=(batch_size, left_padding), fill_value=pad_token).to(input_ids.device)
        input_ids = torch.cat((pad_tensor, input_ids), dim=1)
        cur_length = input_ids.shape[1]

    # Simple greedy decoding loop
    while cur_length < max_length:
        logits = model(input_ids).logits
        next_token_logits = logits[:, -1, :]
        next_tokens = torch.argmax(next_token_logits, dim=-1)
        input_ids = torch.cat((input_ids, next_tokens[:, None]), dim=-1)
        cur_length += 1

    return input_ids

請遵循以下推薦實踐，以確保您的自定義解碼方法按預期工作。

請隨意重用原始 generate() 中用於驗證和輸入準備的邏輯。
如果您在 model 中使用任何私有方法/屬性，請在需求中固定 transformers 版本。
考慮新增模型驗證、輸入驗證，甚至單獨的測試檔案，以幫助使用者在其環境中進行程式碼健全性檢查。

您的自定義 generate 方法可以從 custom_generate 資料夾進行相對匯入程式碼。例如，如果您有一個 utils.py 檔案，您可以這樣匯入它

from .utils import some_function

僅支援來自同一級別 custom_generate 資料夾的相對匯入。父/兄弟資料夾匯入無效。custom_generate 引數也適用於包含 custom_generate 結構的任何目錄。這是開發自定義解碼方法的推薦工作流程。

requirements.txt

您可以選擇在 custom_generate 資料夾內的 requirements.txt 檔案中指定額外的 Python 要求。這些要求在執行時進行檢查，如果缺少，將丟擲異常，提示使用者相應地更新其環境。

README.md

模型倉庫的根目錄 README.md 通常描述其中的模型。然而，由於倉庫的重點是自定義解碼方法，我們強烈建議將其重點轉移到描述自定義解碼方法。除了方法的描述外，我們建議記錄與原始 generate() 的任何輸入和/或輸出差異。這樣，使用者可以專注於新內容，並依靠 Transformers 文件獲取通用實現細節。

為了便於發現，我們強烈建議您為您的倉庫新增 custom_generate 標籤。為此，您的 README.md 檔案的頂部應如下例所示。推送檔案後，您應該會在倉庫中看到該標籤！

---
library_name: transformers
tags:
  - custom_generate
---

(your markdown content here)

推薦實踐

在 generate() 中記錄輸入和輸出差異。
新增自包含示例以實現快速實驗。
描述軟性要求，例如該方法是否僅適用於某些模型系列。

資源

閱讀如何生成文字：使用不同的解碼方法進行語言生成與 Transformers 部落格文章，瞭解常見解碼策略的工作原理。

< > 在 GitHub 上更新