Transformers

( sequences: LongTensor scores: typing.Optional[tuple[torch.FloatTensor]] = None logits: typing.Optional[tuple[torch.FloatTensor]] = None attentions: typing.Optional[tuple[tuple[torch.FloatTensor]]] = None hidden_states: typing.Optional[tuple[tuple[torch.FloatTensor]]] = None past_key_values: typing.Optional[tuple[tuple[tuple[torch.FloatTensor]]]] = None )

引數

sequences (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 生成的序列。第二個維度 (sequence_length) 等於 max_length，或者如果所有批次都因 eos_token_id 而提前結束，則會更短。
scores (tuple(torch.FloatTensor)，*可選*，在 output_scores=True 時返回) — 語言建模頭在每個生成步驟的已處理預測分數（每個詞彙表 token 在 SoftMax 之前的分數）。torch.FloatTensor 的元組，最多包含 max_new_tokens 個元素（每個生成的 token 一個元素），每個張量的形狀為 (batch_size, config.vocab_size)。
logits (tuple(torch.FloatTensor)，*可選*，在 output_logits=True 時返回) — 語言建模頭在每個生成步驟的未處理預測分數（每個詞彙表 token 在 SoftMax 之前的分數）。torch.FloatTensor 的元組，最多包含 max_new_tokens 個元素（每個生成的 token 一個元素），每個張量的形狀為 (batch_size, config.vocab_size)。
attentions (tuple(tuple(torch.FloatTensor)), *可選*, 在 output_attentions=True 時返回) — 元組（每個生成的 token 一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size, num_heads, generated_length, sequence_length) 的 torch.FloatTensor。
hidden_states (tuple(tuple(torch.FloatTensor)), *可選*, 在 output_hidden_states=True 時返回) — 元組（每個生成的 token 一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size, generated_length, hidden_size) 的 torch.FloatTensor。
past_key_values (tuple(tuple(torch.FloatTensor))), *可選*, 在 use_cache=True 時返回) — 返回模型快取，用於加速解碼。不同模型的快取格式不同，請查閱模型的文件。通常是 Cache 的例項。

僅解碼器生成模型在使用非束搜尋方法時的輸出。

class transformers.generation.GenerateEncoderDecoderOutput

( sequences: LongTensor scores: typing.Optional[tuple[torch.FloatTensor]] = None logits: typing.Optional[tuple[torch.FloatTensor]] = None encoder_attentions: typing.Optional[tuple[torch.FloatTensor]] = None encoder_hidden_states: typing.Optional[tuple[torch.FloatTensor]] = None decoder_attentions: typing.Optional[tuple[tuple[torch.FloatTensor]]] = None cross_attentions: typing.Optional[tuple[tuple[torch.FloatTensor]]] = None decoder_hidden_states: typing.Optional[tuple[tuple[torch.FloatTensor]]] = None past_key_values: typing.Optional[tuple[tuple[tuple[torch.FloatTensor]]]] = None )

引數

sequences (torch.LongTensor，形狀為 (batch_size*num_return_sequences, sequence_length)) — 生成的序列。第二個維度 (sequence_length) 等於 max_length，或者如果所有批次都因 eos_token_id 而提前結束，則會更短。
scores (tuple(torch.FloatTensor)，*可選*，在 output_scores=True 時返回) — 語言建模頭在每個生成步驟的已處理預測分數（每個詞彙表 token 在 SoftMax 之前的分數）。torch.FloatTensor 的元組，最多包含 max_new_tokens 個元素（每個生成的 token 一個元素），每個張量的形狀為 (batch_size, config.vocab_size)。
logits (tuple(torch.FloatTensor)，*可選*，在 output_logits=True 時返回) — 語言建模頭在每個生成步驟的未處理預測分數（每個詞彙表 token 在 SoftMax 之前的分數）。torch.FloatTensor 的元組，最多包含 max_new_tokens 個元素（每個生成的 token 一個元素），每個張量的形狀為 (batch_size, config.vocab_size)。
encoder_attentions (tuple(torch.FloatTensor), *可選*, 在 output_attentions=True 時返回) — torch.FloatTensor 的元組（解碼器每層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。
encoder_hidden_states (tuple(torch.FloatTensor), *可選*, 在 output_hidden_states=True 時返回) — torch.FloatTensor 的元組（一個用於嵌入層輸出 + 每個層一個），形狀為 (batch_size, sequence_length, hidden_size)。
decoder_attentions (tuple(tuple(torch.FloatTensor)), *可選*, 在 output_attentions=True 時返回) — 元組（每個生成的 token 一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size, num_heads, generated_length, sequence_length) 的 torch.FloatTensor。
cross_attentions (tuple(tuple(torch.FloatTensor)), *可選*, 在 output_attentions=True 時返回) — 元組（每個生成的 token 一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size, num_heads, generated_length, sequence_length) 的 torch.FloatTensor。
decoder_hidden_states (tuple(tuple(torch.FloatTensor)), *可選*, 在 output_hidden_states=True 時返回) — 元組（每個生成的 token 一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size, generated_length, hidden_size) 的 torch.FloatTensor。
past_key_values (tuple(tuple(torch.FloatTensor))), *可選*, 在傳遞 use_cache=True 或 config.use_cache=True 時返回) — 返回模型快取，用於加速解碼。不同模型的快取格式不同，請查閱模型的文件。通常是 Cache 的例項。

編碼器-解碼器生成模型在使用非束搜尋方法時的輸出。

class transformers.generation.GenerateBeamDecoderOnlyOutput

引數

sequences (torch.LongTensor，形狀為 (batch_size*num_return_sequences, sequence_length)) — 生成的序列。第二個維度 (sequence_length) 等於 max_length，或者如果所有批次都因 eos_token_id 而提前結束，則會更短。
sequences_scores (torch.FloatTensor，形狀為 (batch_size*num_return_sequences), *可選*, 在 output_scores=True 時返回) — 生成的 sequences 的最終束分數。
scores (tuple(torch.FloatTensor)，*可選*，在 output_scores=True 時返回) — 每個生成步驟中每個詞彙表 token 的束轉換分數。束轉換分數由 token 的對數機率組成，該機率以該束中先前生成的 token 的對數 softmax 為條件。torch.FloatTensor 的元組，最多包含 max_new_tokens 個元素（每個生成的 token 一個元素），每個張量的形狀為 (batch_size*num_beams, config.vocab_size)。
logits (tuple(torch.FloatTensor)，*可選*，在 output_logits=True 時返回) — 語言建模頭在每個生成步驟的未處理預測分數（每個詞彙表 token 在 SoftMax 之前的分數）。torch.FloatTensor 的元組，最多包含 max_new_tokens 個元素（每個生成的 token 一個元素），每個張量的形狀為 (batch_size*num_beams, config.vocab_size)。
beam_indices (torch.LongTensor, *可選*, 在 output_scores=True 時返回) — 每個生成步驟中生成的 token ID 的束索引。torch.LongTensor 的形狀為 (batch_size*num_return_sequences, sequence_length)。
attentions (tuple(tuple(torch.FloatTensor)), *可選*, 在 output_attentions=True 時返回) — 元組（每個生成的 token 一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size*num_beams, num_heads, generated_length, sequence_length) 的 torch.FloatTensor。
hidden_states (tuple(tuple(torch.FloatTensor)), *可選*, 在 output_hidden_states=True 時返回) — 元組（每個生成的 token 一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size*num_beams*num_return_sequences, generated_length, hidden_size) 的 torch.FloatTensor。
past_key_values (tuple(tuple(torch.FloatTensor))), *可選*, 在 use_cache=True 時返回) — 返回模型快取，用於加速解碼。不同模型的快取格式不同，請查閱模型的文件。通常是 Cache 的例項。

僅解碼器生成模型在使用束搜尋方法時的輸出。

class transformers.generation.GenerateBeamEncoderDecoderOutput

( sequences: LongTensor sequences_scores: typing.Optional[torch.FloatTensor] = None scores: typing.Optional[tuple[torch.FloatTensor]] = None logits: typing.Optional[tuple[torch.FloatTensor]] = None beam_indices: typing.Optional[torch.LongTensor] = None encoder_attentions: typing.Optional[tuple[torch.FloatTensor]] = None encoder_hidden_states: typing.Optional[tuple[torch.FloatTensor]] = None decoder_attentions: typing.Optional[tuple[tuple[torch.FloatTensor]]] = None cross_attentions: typing.Optional[tuple[tuple[torch.FloatTensor]]] = None decoder_hidden_states: typing.Optional[tuple[tuple[torch.FloatTensor]]] = None past_key_values: typing.Optional[tuple[tuple[tuple[torch.FloatTensor]]]] = None )

引數

sequences (torch.LongTensor，形狀為 (batch_size*num_return_sequences, sequence_length)) — 生成的序列。第二個維度 (sequence_length) 等於 max_length，或者如果所有批次都因 eos_token_id 而提前結束，則會更短。
sequences_scores (torch.FloatTensor，形狀為 (batch_size*num_return_sequences), *可選*, 在 output_scores=True 時返回) — 生成的 sequences 的最終束分數。
scores (tuple(torch.FloatTensor)，*可選*，在 output_scores=True 時返回) — 每個生成步驟中每個詞彙表 token 的束轉換分數。束轉換分數由 token 的對數機率組成，該機率以該束中先前生成的 token 的對數 softmax 為條件。torch.FloatTensor 的元組，最多包含 max_new_tokens 個元素（每個生成的 token 一個元素），每個張量的形狀為 (batch_size*num_beams, config.vocab_size)。
logits (tuple(torch.FloatTensor)，*可選*，在 output_logits=True 時返回) — 語言建模頭在每個生成步驟的未處理預測分數（每個詞彙表 token 在 SoftMax 之前的分數）。torch.FloatTensor 的元組，最多包含 max_new_tokens 個元素（每個生成的 token 一個元素），每個張量的形狀為 (batch_size*num_beams, config.vocab_size)。
beam_indices (torch.LongTensor, *可選*, 在 output_scores=True 時返回) — 每個生成步驟中生成的 token ID 的束索引。torch.LongTensor 的形狀為 (batch_size*num_return_sequences, sequence_length)。
encoder_attentions (tuple(torch.FloatTensor), *可選*, 在 output_attentions=True 時返回) — torch.FloatTensor 的元組（解碼器每層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。
encoder_hidden_states (tuple(torch.FloatTensor), *可選*, 在 output_hidden_states=True 時返回) — torch.FloatTensor 的元組（一個用於嵌入層輸出 + 每個層一個），形狀為 (batch_size*num_beams*num_return_sequences, sequence_length, hidden_size)。
decoder_attentions (tuple(tuple(torch.FloatTensor)), *可選*, 在 output_attentions=True 時返回) — 元組（每個生成的 token 一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size*num_beams*num_return_sequences, num_heads, generated_length, sequence_length) 的 torch.FloatTensor。
cross_attentions (tuple(tuple(torch.FloatTensor)), 可選, 當 output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size, num_heads, generated_length, sequence_length) 的 torch.FloatTensor。
decoder_hidden_states (tuple(tuple(torch.FloatTensor)), 可選, 當 output_hidden_states=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size*num_beams*num_return_sequences, generated_length, hidden_size) 的 torch.FloatTensor。
past_key_values (tuple(tuple(torch.FloatTensor))), 可選, 當 use_cache=True 時返回) — 返回模型快取，用於加速解碼。不同模型的快取格式不同，請檢視模型的文件。通常是一個 Cache 例項。

當使用束搜尋方法時，編碼器-解碼器生成模型的輸出。

TensorFlow

class transformers.generation.TFGreedySearchEncoderDecoderOutput

( sequences: typing.Optional[tensorflow.python.framework.tensor.Tensor] = None scores: typing.Optional[tuple[tensorflow.python.framework.tensor.Tensor]] = None encoder_attentions: typing.Optional[tuple[tensorflow.python.framework.tensor.Tensor]] = None encoder_hidden_states: typing.Optional[tuple[tensorflow.python.framework.tensor.Tensor]] = None decoder_attentions: typing.Optional[tuple[tuple[tensorflow.python.framework.tensor.Tensor]]] = None cross_attentions: typing.Optional[tuple[tuple[tensorflow.python.framework.tensor.Tensor]]] = None decoder_hidden_states: typing.Optional[tuple[tuple[tensorflow.python.framework.tensor.Tensor]]] = None )

引數

sequences (tf.Tensor，形狀為 (batch_size, sequence_length)) — 生成的序列。第二維度 (sequence_length) 等於 max_length，或者如果所有批次因 eos_token_id 而提前結束，則更短。
scores (tuple(tf.Tensor)，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 語言模型頭在每個生成步驟的已處理預測分數（SoftMax 之前的每個詞彙詞元的分數）。tf.Tensor 的元組，最多包含 max_new_tokens 個元素（每個生成的詞元一個元素），每個張量的形狀為 (batch_size, config.vocab_size)。
encoder_attentions (tuple(tf.Tensor), 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — tf.Tensor 的元組（解碼器的每一層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。
encoder_hidden_states (tuple(tf.Tensor), 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — tf.Tensor 的元組（一個用於詞嵌入輸出 + 每個層的輸出一個），形狀為 (batch_size, sequence_length, hidden_size)。
decoder_attentions (tuple(tuple(tf.Tensor)), 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器的每一層一個元素），其中包含形狀為 (batch_size, num_heads, generated_length, sequence_length) 的 tf.Tensor。
cross_attentions (tuple(tuple(tf.Tensor)), 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器的每一層一個元素），其中包含形狀為 (batch_size, num_heads, generated_length, sequence_length) 的 tf.Tensor。
decoder_hidden_states (tuple(tuple(tf.Tensor)), 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器的每一層一個元素），其中包含形狀為 (batch_size, generated_length, hidden_size) 的 tf.Tensor。

使用貪婪搜尋的編碼器-解碼器生成模型的輸出基類。解碼器（和編碼器）的隱藏狀態和注意力權重可以透過 encoder_attentions 和 encoder_hidden_states 屬性（以及 decoder_attentions 和 decoder_hidden_states 屬性）訪問。

class transformers.generation.TFGreedySearchDecoderOnlyOutput

引數

sequences (tf.Tensor，形狀為 (batch_size, sequence_length)) — 生成的序列。第二維度 (sequence_length) 等於 max_length，或者如果所有批次因 eos_token_id 而提前結束，則更短。
scores (tuple(tf.Tensor)，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 語言模型頭在每個生成步驟的已處理預測分數（SoftMax 之前的每個詞彙詞元的分數）。tf.Tensor 的元組，最多包含 max_new_tokens 個元素（每個生成的詞元一個元素），每個張量的形狀為 (batch_size, config.vocab_size)。
attentions (tuple(tuple(tf.Tensor)), 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器的每一層一個元素），其中包含形狀為 (batch_size, num_heads, generated_length, sequence_length) 的 tf.Tensor。
hidden_states (tuple(tuple(tf.Tensor)), 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器的每一層一個元素），其中包含形狀為 (batch_size, generated_length, hidden_size) 的 tf.Tensor。

使用貪婪搜尋的僅解碼器生成模型的輸出基類。

class transformers.generation.TFSampleEncoderDecoderOutput

引數

sequences (tf.Tensor，形狀為 (batch_size*num_return_sequences, sequence_length)) — 生成的序列。第二維度 (sequence_length) 等於 max_length，或者如果所有批次因 eos_token_id 而提前結束，則更短。
scores (tuple(tf.Tensor)，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 語言模型頭在每個生成步驟的已處理預測分數（SoftMax 之前的每個詞彙詞元的分數）。tf.Tensor 的元組，最多包含 max_new_tokens 個元素（每個生成的詞元一個元素），每個張量的形狀為 (batch_size*num_return_sequences, config.vocab_size)。
encoder_attentions (tuple(tf.Tensor), 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — tf.Tensor 的元組（解碼器的每一層一個），形狀為 (batch_size*num_return_sequences, num_heads, sequence_length, sequence_length)。
encoder_hidden_states (tuple(tf.Tensor), 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — tf.Tensor 的元組（一個用於詞嵌入輸出 + 每個層的輸出一個），形狀為 (batch_size*num_return_sequences, sequence_length, hidden_size)。
decoder_attentions (tuple(tuple(tf.Tensor)), 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器的每一層一個元素），其中包含形狀為 (batch_size*num_return_sequences, num_heads, generated_length, sequence_length) 的 tf.Tensor。
cross_attentions (tuple(tuple(tf.Tensor)), 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器的每一層一個元素），其中包含形狀為 (batch_size, num_heads, generated_length, sequence_length) 的 tf.Tensor。
decoder_hidden_states (tuple(tuple(tf.Tensor)), 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器的每一層一個元素），其中包含形狀為 (batch_size*num_return_sequences, generated_length, hidden_size) 的 tf.Tensor。

使用取樣的編碼器-解碼器生成模型的輸出基類。解碼器（和編碼器）的隱藏狀態和注意力權重可以透過 encoder_attentions 和 encoder_hidden_states 屬性（以及 decoder_attentions 和 decoder_hidden_states 屬性）訪問。

class transformers.generation.TFSampleDecoderOnlyOutput

引數

sequences (tf.Tensor，形狀為 (batch_size*num_return_sequences, sequence_length)) — 生成的序列。第二維度 (sequence_length) 等於 max_length，或者如果所有批次因 eos_token_id 而提前結束，則更短。
scores (tuple(tf.Tensor)，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 語言模型頭在每個生成步驟的已處理預測分數（SoftMax 之前的每個詞彙詞元的分數）。tf.Tensor 的元組，最多包含 max_new_tokens 個元素（每個生成的詞元一個元素），每個張量的形狀為 (batch_size*num_return_sequences, config.vocab_size)。
attentions (tuple(tuple(tf.Tensor)), 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器的每一層一個元素），其中包含形狀為 (num_return_sequences*batch_size, num_heads, generated_length, sequence_length) 的 tf.Tensor。
hidden_states (tuple(tuple(tf.Tensor)), 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器的每一層一個元素），其中包含形狀為 (num_return_sequences*batch_size, generated_length, hidden_size) 的 tf.Tensor。

使用取樣的僅解碼器生成模型的輸出基類。

class transformers.generation.TFBeamSearchEncoderDecoderOutput

( sequences: typing.Optional[tensorflow.python.framework.tensor.Tensor] = None sequences_scores: typing.Optional[tensorflow.python.framework.tensor.Tensor] = None scores: typing.Optional[tuple[tensorflow.python.framework.tensor.Tensor]] = None beam_indices: typing.Optional[tensorflow.python.framework.tensor.Tensor] = None encoder_attentions: typing.Optional[tuple[tensorflow.python.framework.tensor.Tensor]] = None encoder_hidden_states: typing.Optional[tuple[tensorflow.python.framework.tensor.Tensor]] = None decoder_attentions: typing.Optional[tuple[tuple[tensorflow.python.framework.tensor.Tensor]]] = None cross_attentions: typing.Optional[tuple[tuple[tensorflow.python.framework.tensor.Tensor]]] = None decoder_hidden_states: typing.Optional[tuple[tuple[tensorflow.python.framework.tensor.Tensor]]] = None )

引數

sequences (tf.Tensor，形狀為 (batch_size*num_return_sequences, sequence_length)) — 生成的序列。第二維度 (sequence_length) 等於 max_length，或者如果所有批次因 eos_token_id 而提前結束，則更短。
sequences_scores (tf.Tensor，形狀為 (batch_size*num_return_sequences)，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 生成的 sequences 的最終束搜尋分數。
scores (tuple(tf.Tensor)，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 每個生成步驟中每個詞彙詞元的已處理束搜尋分數。束搜尋分數包括每個詞彙詞元的對數 softmax 分數和該束中先前生成的詞元的對數 softmax 之和。tf.Tensor 的元組，最多包含 max_new_tokens 個元素（每個生成的詞元一個元素），每個張量的形狀為 (batch_size*num_beams, config.vocab_size)。
beam_indices (tf.Tensor, 可選, 當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 每個生成步驟中生成的詞元 id 的束索引。tf.Tensor 的形狀為 (batch_size*num_return_sequences, sequence_length)。
encoder_attentions (tuple(tf.Tensor), 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — tf.Tensor 的元組（解碼器的每一層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。
encoder_hidden_states (tuple(tf.Tensor), 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — tf.Tensor 的元組（一個用於詞嵌入輸出 + 每個層的輸出一個），形狀為 (batch_size*num_beams*num_return_sequences, sequence_length, hidden_size)。
decoder_attentions (tuple(tuple(tf.Tensor)), 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器的每一層一個元素），其中包含形狀為 (batch_size*num_beams*num_return_sequences, num_heads, generated_length, sequence_length) 的 tf.Tensor。
cross_attentions (tuple(tuple(tf.Tensor)), 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器的每一層一個元素），其中包含形狀為 (batch_size, num_heads, generated_length, sequence_length) 的 tf.Tensor。
decoder_hidden_states (tuple(tuple(tf.Tensor)), 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器的每一層一個元素），其中包含形狀為 (batch_size*num_beams*num_return_sequences, generated_length, hidden_size) 的 tf.Tensor。

使用束搜尋的編碼器-解碼器生成模型的輸出基類。解碼器（和編碼器）的隱藏狀態和注意力權重可以透過 encoder_attentions 和 encoder_hidden_states 屬性（以及 decoder_attentions 和 decoder_hidden_states 屬性）訪問。

class transformers.generation.TFBeamSearchDecoderOnlyOutput

引數

sequences (tf.Tensor，形狀為 (batch_size*num_return_sequences, sequence_length)) — 生成的序列。第二維度 (sequence_length) 等於 max_length，或者如果所有批次因 eos_token_id 而提前結束，則更短。
sequences_scores (tf.Tensor，形狀為 (batch_size*num_return_sequences)，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 生成的 sequences 的最終束得分。
scores (tuple(tf.Tensor)，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 在每個生成步驟中，為每個詞彙表詞元處理後的束得分。束得分包括每個詞彙表詞元的 log softmax 分數，以及該束中先前生成的詞元的 log softmax 分數之和。這是一個 `tf.Tensor` 的元組，最多包含 `max_new_tokens` 個元素（每個生成的詞元一個元素），每個張量的形狀為 `(batch_size*num_beams*num_return_sequences, config.vocab_size)`。
beam_indices (tf.Tensor，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 在每個生成步驟中生成的詞元 id 的束索引。tf.Tensor 的形狀為 (batch_size*num_return_sequences, sequence_length)。
attentions (tuple(tuple(tf.Tensor))，可選，當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size*num_beams, num_heads, generated_length, sequence_length) 的 tf.Tensor。
hidden_states (tuple(tuple(tf.Tensor))，可選，當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size*num_beams*num_return_sequences, generated_length, hidden_size) 的 tf.Tensor。

使用束搜尋的僅解碼器生成模型輸出的基類。

class transformers.generation.TFBeamSampleEncoderDecoderOutput

引數

sequences (tf.Tensor，形狀為 (batch_size*num_beams, sequence_length)) — 生成的序列。第二個維度 (sequence_length) 等於 max_length，或者如果所有批次都因 eos_token_id 而提前完成，則會更短。
sequences_scores (tf.Tensor，形狀為 (batch_size * num_return_sequence)，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 生成的 sequences 的最終束得分。
scores (tuple(tf.Tensor)，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 在每個生成步驟中，為每個詞彙表詞元處理後的束得分。束得分包括每個詞彙表詞元的 log softmax 分數，以及該束中先前生成的詞元的 log softmax 分數之和。這是一個 `tf.Tensor` 的元組，最多包含 `max_new_tokens` 個元素（每個生成的詞元一個元素），每個張量的形狀為 `(batch_size*num_beams, config.vocab_size)`。
beam_indices (tf.Tensor，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 在每個生成步驟中生成的詞元 id 的束索引。tf.Tensor 的形狀為 (batch_size*num_return_sequences, sequence_length)。
encoder_attentions (tuple(tf.Tensor)，可選，當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — `tf.Tensor` 的元組（解碼器每層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。
encoder_hidden_states (tuple(tf.Tensor)，可選，當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — `tf.Tensor` 的元組（嵌入層的輸出 + 每層的輸出），形狀為 (batch_size*num_beams, sequence_length, hidden_size)。
decoder_attentions (tuple(tuple(tf.Tensor))，可選，當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size*num_beams, num_heads, generated_length, sequence_length) 的 tf.Tensor。
cross_attentions (tuple(tuple(tf.Tensor))，可選，當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size, num_heads, generated_length, sequence_length) 的 tf.Tensor。
decoder_hidden_states (tuple(tuple(tf.Tensor))，可選，當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size*num_beams, generated_length, hidden_size) 的 tf.Tensor。

使用束取樣的編碼器-解碼器生成模型輸出的基類。解碼器（或編碼器）的隱藏狀態和注意力權重可以透過 encoder_attentions 和 encoder_hidden_states 屬性（或 decoder_attentions 和 decoder_hidden_states 屬性）訪問。

class transformers.generation.TFBeamSampleDecoderOnlyOutput

引數

sequences (tf.Tensor，形狀為 (batch_size*num_return_sequences, sequence_length)) — 生成的序列。第二個維度 (sequence_length) 等於 max_length，或者如果所有批次都因 eos_token_id 而提前完成，則會更短。
sequences_scores (tf.Tensor，形狀為 (batch_size * num_return_sequence)，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 生成的 sequences 的最終束得分。
scores (tuple(tf.Tensor)，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 在每個生成步驟中，為每個詞彙表詞元處理後的束得分。束得分包括每個詞彙表詞元的 log softmax 分數，以及該束中先前生成的詞元的 log softmax 分數之和。這是一個 `tf.Tensor` 的元組，最多包含 `max_new_tokens` 個元素（每個生成的詞元一個元素），每個張量的形狀為 `(batch_size*num_beams*num_return_sequences, config.vocab_size)`。
beam_indices (tf.Tensor，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 在每個生成步驟中生成的詞元 id 的束索引。tf.Tensor 的形狀為 (batch_size*num_return_sequences, sequence_length)。
attentions (tuple(tuple(tf.Tensor))，可選，當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size*num_beams, num_heads, generated_length, sequence_length) 的 tf.Tensor。
hidden_states (tuple(tuple(tf.Tensor))，可選，當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size*num_beams, generated_length, hidden_size) 的 tf.Tensor。

使用束取樣的僅解碼器生成模型輸出的基類。

class transformers.generation.TFContrastiveSearchEncoderDecoderOutput

引數

sequences (tf.Tensor，形狀為 (batch_size, sequence_length)) — 生成的序列。第二個維度 (sequence_length) 等於 max_length，或者如果所有批次都因 eos_token_id 而提前完成，則會更短。
scores (tuple(tf.Tensor)，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 語言模型頭在每個生成步驟中的處理後預測分數（SoftMax 之前的每個詞彙表詞元的分數）。這是一個 `tf.Tensor` 的元組，最多包含 `max_new_tokens` 個元素（每個生成的詞元一個元素），每個張量的形狀為 `(batch_size, config.vocab_size)`。
encoder_attentions (tuple(tf.Tensor)，可選，當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — `tf.Tensor` 的元組（解碼器每層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。
encoder_hidden_states (tuple(tf.Tensor)，可選，當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — `tf.Tensor` 的元組（嵌入層的輸出 + 每層的輸出），形狀為 (batch_size, sequence_length, hidden_size)。
decoder_attentions (tuple(tuple(tf.Tensor))，可選，當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size, num_heads, generated_length, sequence_length) 的 tf.Tensor。
cross_attentions (tuple(tuple(tf.Tensor))，可選，當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size, num_heads, generated_length, sequence_length) 的 tf.Tensor。
decoder_hidden_states (tuple(tuple(tf.Tensor))，可選，當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size, generated_length, hidden_size) 的 tf.Tensor。

使用對比搜尋的編碼器-解碼器生成模型輸出的基類。解碼器（或編碼器）的隱藏狀態和注意力權重可以透過 encoder_attentions 和 encoder_hidden_states 屬性（或 decoder_attentions 和 decoder_hidden_states 屬性）訪問。

class transformers.generation.TFContrastiveSearchDecoderOnlyOutput

引數

sequences (tf.Tensor，形狀為 (batch_size, sequence_length)) — 生成的序列。第二個維度 (sequence_length) 等於 max_length，或者如果所有批次都因 eos_token_id 而提前完成，則會更短。
scores (tuple(tf.Tensor)，可選，當傳遞 output_scores=True 或 config.output_scores=True 時返回) — 語言模型頭在每個生成步驟中的處理後預測分數（SoftMax 之前的每個詞彙表詞元的分數）。這是一個 `tf.Tensor` 的元組，最多包含 `max_new_tokens` 個元素（每個生成的詞元一個元素），每個張量的形狀為 `(batch_size, config.vocab_size)`。
attentions (tuple(tuple(tf.Tensor))，可選，當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size, num_heads, generated_length, sequence_length) 的 tf.Tensor。
hidden_states (tuple(tuple(tf.Tensor))，可選，當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — 元組（每個生成的詞元一個元素）的元組（解碼器每層一個元素），其中包含形狀為 (batch_size, generated_length, hidden_size) 的 tf.Tensor。

使用對比搜尋的僅解碼器生成模型輸出的基類。

FLAX

class transformers.generation.FlaxSampleOutput

( sequences: typing.Optional[jax.Array] = None )

引數

sequences (jnp.ndarray，形狀為 (batch_size, max_length)) — 生成的序列。

使用取樣的僅解碼器生成模型的 Flax 輸出基類。

替換

( **updates )

“返回一個新物件，用新值替換指定欄位。

class transformers.generation.FlaxGreedySearchOutput

( sequences: typing.Optional[jax.Array] = None )

引數

sequences (jnp.ndarray，形狀為 (batch_size, max_length)) — 生成的序列。

使用貪婪搜尋的僅解碼器生成模型的 Flax 輸出基類。

替換

( **updates )

“返回一個新物件，用新值替換指定欄位。

class transformers.generation.FlaxBeamSearchOutput

( sequences: typing.Optional[jax.Array] = None scores: typing.Optional[jax.Array] = None )

引數

sequences (jnp.ndarray，形狀為 (batch_size, max_length)) — 生成的序列。
scores (jnp.ndarray，形狀為 (batch_size,)) — 生成序列的分數（對數機率）。

使用貪婪搜尋的僅解碼器生成模型的 Flax 輸出基類。

替換

( **updates )

“返回一個新物件，用新值替換指定欄位。

LogitsProcessor

LogitsProcessor 可用於修改語言模型頭在生成時的預測分數。

PyTorch

class transformers.AlternatingCodebooksLogitsProcessor

( input_start_len: int semantic_vocab_size: int codebook_size: int )

引數

input_start_len (int) — 初始輸入序列的長度。
semantic_vocab_size (int) — 語義部分的詞彙表大小，即與語義詞彙表關聯的詞元數量。
codebook_size (int) — 與碼本關聯的詞元數量。

用於強制 Bark 兩個碼本之間交替生成的 LogitsProcessor。

此 logits 處理器僅與 Bark 的 fine 子模型相容。有關示例，請參閱模型文件。

call

( input_ids: LongTensor scores: FloatTensor )

class transformers.ClassifierFreeGuidanceLogitsProcessor

( guidance_scale )

引數

guidance_scale (float) — 用於無分類器指導 (CFG) 的指導比例。透過設定 `guidance_scale > 1` 來啟用 CFG。更高的指導比例會鼓勵模型生成與輸入提示更緊密相關的樣本，但這通常會以犧牲質量為代價。

用於無分類器指導 (CFG) 的 LogitsProcessor。分數在批次維度上被分割，其中前半部分對應於條件 logits（根據輸入提示預測），後半部分對應於無條件 logits（根據空或“null”提示預測）。該處理器根據 `guidance_scale` 引數計算條件和無條件 logits 的加權平均值。

更多資訊請參閱論文。

此 logits 處理器僅與 MusicGen 相容。

示例

>>> from transformers import AutoProcessor, MusicgenForConditionalGeneration

>>> processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
>>> model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")

>>> inputs = processor(
...     text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
...     padding=True,
...     return_tensors="pt",
... )
>>> audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。什麼是 input ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言建模頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，可以是每個詞彙詞元的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.EncoderNoRepeatNGramLogitsProcessor

( encoder_ngram_size: int encoder_input_ids: LongTensor )

引數

encoder_ngram_size (int) — 大小為 `ngram_size` 的所有 n-gram 只能在編碼器輸入 ID 中出現。
encoder_input_ids (int) — 不應在解碼器 ID 中重複的 encoder_input_ids。

功能類似於 NoRepeatNGramLogitsProcessor 的 LogitsProcessor，但專門用於防止提示中存在的 n-gram 重複。

它旨在透過防止生成先前對話輪次中存在的 n-gram 來促進語言模型的健談性。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("bigscience/bloomz-560m")
>>> tokenizer = AutoTokenizer.from_pretrained("bigscience/bloomz-560m")

>>> inputs = tokenizer("Alice: I love cats. What do you love?\nBob:", return_tensors="pt")

>>> # With greedy decoding, we see Bob repeating Alice's opinion. If Bob was a chatbot, it would be a poor one.
>>> outputs = model.generate(**inputs)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
Alice: I love cats. What do you love?
Bob: I love cats. What do you

>>> # With this logits processor, we can prevent Bob from repeating Alice's opinion.
>>> outputs = model.generate(**inputs, encoder_no_repeat_ngram_size=2)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
Alice: I love cats. What do you love?
Bob: My cats are very cute.

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。什麼是 input ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言建模頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，可以是每個詞彙詞元的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.EncoderRepetitionPenaltyLogitsProcessor

( penalty: float encoder_input_ids: LongTensor )

引數

penalty (float) — 重複懲罰的引數。1.0 表示沒有懲罰。大於 1.0 會獎勵提示詞元。介於 0.0 和 1.0 之間會懲罰提示詞元。
encoder_input_ids (torch.LongTensor) — 應該在解碼器 ID 中重複的 encoder_input_ids。

功能類似於 RepetitionPenaltyLogitsProcessor 的 LogitsProcessor，但它對提示中存在的詞元施加*反向*懲罰。換句話說，大於 1.0 的懲罰會增加選擇提示中出現過的詞元的機率。

它旨在避免在基於輸入的任務（如摘要）中產生幻覺。雖然最初是為編碼器-解碼器模型設計的，但它也可以用於僅解碼器模型，如大語言模型 (LLM)。

示例

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("bigscience/bloomz-560m")
>>> model = AutoModelForCausalLM.from_pretrained("bigscience/bloomz-560m")

>>> inputs = tokenizer(["Alice and Bob. The third member's name was"], return_tensors="pt")
>>> gen_out = model.generate(**inputs)
>>> print(tokenizer.batch_decode(gen_out, skip_special_tokens=True)[0])
Alice and Bob. The third member's name was not mentioned.

>>> # With the `encoder_repetition_penalty` argument we can trigger this logits processor in `generate`, which can
>>> # promote the use of prompt tokens ("Bob" in this example)
>>> gen_out = model.generate(**inputs, encoder_repetition_penalty=1.2)
>>> print(tokenizer.batch_decode(gen_out, skip_special_tokens=True)[0])
Alice and Bob. The third member's name was Bob. The third member's name was Bob.

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。什麼是 input ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言建模頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，可以是每個詞彙詞元的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.EpsilonLogitsWarper

( epsilon: float filter_value: float = -inf min_tokens_to_keep: int = 1 )

引數

epsilon (float) — 如果設定為 > 0，則只保留機率為 `epsilon` 或更高的詞元用於生成。
filter_value (float, *可選*, 預設為 -inf) — 所有被過濾的值將被設定為此浮點數值。
min_tokens_to_keep (int, *可選*, 預設為 1) — 不能被過濾的最小詞元數量。

執行 epsilon-sampling 的 LogitsProcessor，即限制為機率 `prob >= epsilon` 的詞元。如果沒有詞元滿足此約束，則取機率最大的 min_tokens_to_keep 個詞元。更多資訊請參閱《截斷取樣作為語言模型去平滑》。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

>>> set_seed(1)
>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

>>> inputs = tokenizer("A sequence: 1, 2", return_tensors="pt")

>>> # With sampling, the output is unexpected -- sometimes too unexpected.
>>> outputs = model.generate(**inputs, do_sample=True)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
A sequence: 1, 2, 3 | < 4 (left-hand pointer) ;
<BLANKLINE>
<BLANKLINE>

>>> # With epsilon sampling, the output gets restricted to high-probability tokens. Note that this is similar to
>>> # Top P sampling, which restricts tokens based on their cumulative probability.
>>> # Pro tip: The paper recommends using `epsilon_cutoff` values between 3e-4 and 9e-4
>>> outputs = model.generate(**inputs, do_sample=True, epsilon_cutoff=0.1)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
A sequence: 1, 2, 3, 4, 5, 6, 7, 8, 9

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。什麼是 input ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言建模頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，可以是每個詞彙詞元的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.EtaLogitsWarper

( epsilon: float filter_value: float = -inf min_tokens_to_keep: int = 1 device: str = 'cpu' )

引數

epsilon (float) — 範圍在 (0, 1) 的浮點值。用於計算動態截斷值 `eta` 的超引數。根據模型的大小，論文中建議的值範圍從 3e-4 到 4e-3 不等。
filter_value (float, *可選*, 預設為 -inf) — 所有被發現低於動態截斷值 `eta` 的值都會被設定為此浮點數值。當需要修改那些機率極低、應完全排除在生成之外的詞元的 logits 時，此引數很有用。
min_tokens_to_keep (int, *可選*, 預設為 1) — 指定無論機率如何，都必須保留用於生成的最小詞元數量。例如，如果 `min_tokens_to_keep` 設定為 1，即使所有詞元的機率都低於截斷值 `eta`，也至少會保留一個詞元用於生成。
device (str, *可選*, 預設為 "cpu") — 分配張量的裝置。

執行 eta-sampling 的 LogitsProcessor，這是一種過濾掉機率低於動態截斷值 `eta` 的詞元的技術，`eta` 是根據超引數 `epsilon` 和詞元機率的熵組合計算的，即 `eta := min(epsilon, sqrt(epsilon * e^-entropy(probabilities)))`。如果沒有詞元滿足此約束，則取機率最大的 min_tokens_to_keep 個詞元。它解決了神經語言模型生成的長文字樣本質量差的問題，從而產生更連貫、更流暢的文字。更多資訊請參閱《截斷取樣作為語言模型去平滑》。注意：必須將 `do_sample` 設定為 `True`，此 `LogitsProcessor` 才能工作。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

>>> set_seed(1)
>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

>>> inputs = tokenizer("A sequence: 1, 2", return_tensors="pt")

>>> # With sampling, the output is unexpected -- sometimes too unexpected.
>>> outputs = model.generate(**inputs, do_sample=True)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
A sequence: 1, 2, 3 | < 4 (left-hand pointer) ;
<BLANKLINE>
<BLANKLINE>

>>> # With eta sampling, the output gets restricted to high-probability tokens. You can see it as a dynamic form of
>>> # epsilon sampling that adapts its cutoff probability based on the entropy (high entropy = lower cutoff).
>>> # Pro tip: The paper recommends using `eta_cutoff` values between 3e-4 to 4e-3
>>> outputs = model.generate(**inputs, do_sample=True, eta_cutoff=0.1)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
A sequence: 1, 2, 3, 4, 5, 6, 7, 8, 9

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。什麼是 input ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言建模頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，可以是每個詞彙詞元的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.ExponentialDecayLengthPenalty

( exponential_decay_length_penalty: tuple eos_token_id: typing.Union[int, list[int], torch.Tensor] input_ids_seq_length: int )

引數

exponential_decay_length_penalty (tuple(int, float)) — 該元組應包含：`(start_index, decay_factor)`，其中 `start_index` 表示懲罰開始的位置，`decay_factor` 代表指數衰減的因子。
eos_token_id (Union[int, list[int], torch.Tensor]) — *序列結束* 詞元的 ID。
input_ids_seq_length (int) — 輸入序列的長度。

在達到 `start_index` 後，指數級增加 `eos_token_id` 分數的 LogitsProcessor。這允許生成較短的序列而無需硬性截斷，使得 `eos_token` 可以在有意義的位置被預測。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

>>> model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")

>>> text = "Just wanted to let you know, I"
>>> inputs = tokenizer(text, return_tensors="pt")

>>> # Let's consider that we want short sentences, so we limit `max_length=30`. However, we observe that the answer
>>> # tends to end abruptly.
>>> set_seed(1)
>>> outputs = model.generate(**inputs, do_sample=True, temperature=0.9, max_length=30, pad_token_id=50256)
>>> print(tokenizer.batch_decode(outputs)[0])
Just wanted to let you know, I received a link to an ebook, the book How To Start A Social Network which was
published in 2010. Although

>>> # To promote the appearance of the EOS token at the right time, we add the `exponential_decay_length_penalty =
>>> # (start_index, decay_factor)`. Instead of cutting at max_tokens, the output comes to an end before and usually
>>> # with more meaning. What happens is that starting from `start_index` the EOS token score will be increased
>>> # by `decay_factor` exponentially. However, if you set a high decay factor, you may also end up with abruptly
>>> # ending sequences.
>>> set_seed(1)
>>> outputs = model.generate(
...     **inputs,
...     do_sample=True,
...     temperature=0.9,
...     max_length=30,
...     pad_token_id=50256,
...     exponential_decay_length_penalty=(15, 1.6),
... )
>>> print(tokenizer.batch_decode(outputs)[0])
Just wanted to let you know, I received a link to an ebook, the book How To Start A Social Network
which<|endoftext|>

>>> # With a small decay factor, you will have a higher chance of getting a meaningful sequence.
>>> set_seed(1)
>>> outputs = model.generate(
...     **inputs,
...     do_sample=True,
...     temperature=0.9,
...     max_length=30,
...     pad_token_id=50256,
...     exponential_decay_length_penalty=(15, 1.01),
... )
>>> print(tokenizer.batch_decode(outputs)[0])
Just wanted to let you know, I received a link to an ebook, the book How To Start A Social Network which was
published in 2010.<|endoftext|>

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。什麼是 input ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言建模頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，可以是每個詞彙詞元的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.ForcedBOSTokenLogitsProcessor

( bos_token_id: int )

引數

bos_token_id (int) — 強制作為第一個生成詞元的 ID。

強制將指定詞元作為第一個生成詞元的 LogitsProcessor。用於編碼器-解碼器模型。

示例

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

>>> model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
>>> tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

>>> inputs = tokenizer("Translate from English to German: I love cats.", return_tensors="pt")

>>> # By default, it continues generating according to the model's logits
>>> outputs = model.generate(**inputs, max_new_tokens=10)
>>> print(tokenizer.batch_decode(outputs)[0])
<pad> Ich liebe Kitty.</s>

>>> # We can use `forced_bos_token_id` to force the start of generation with an encoder-decoder model
>>> # (including forcing it to end straight away with an EOS token)
>>> outputs = model.generate(**inputs, max_new_tokens=10, forced_bos_token_id=tokenizer.eos_token_id)
>>> print(tokenizer.batch_decode(outputs)[0])
<pad></s>

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。什麼是 input ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言建模頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，可以是每個詞彙詞元的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.ForcedEOSTokenLogitsProcessor

( max_length: int eos_token_id: typing.Union[int, list[int], torch.Tensor] device: str = 'cpu' )

引數

max_length (int) — 要生成的序列的最大長度。
eos_token_id (Union[int, list[int], torch.Tensor]) — *序列結束* 詞元的 ID。
device (str, *可選*, 預設為 "cpu") — 分配張量的裝置。

當達到 `max_length` 時，強制將指定詞元作為最後一個生成詞元的 LogitsProcessor。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

>>> inputs = tokenizer("A sequence: 1, 2, 3", return_tensors="pt")

>>> # By default, it continues generating according to the model's logits
>>> outputs = model.generate(**inputs, max_new_tokens=10)
>>> print(tokenizer.batch_decode(outputs)[0])
A sequence: 1, 2, 3, 4, 5, 6, 7, 8

>>> # `forced_eos_token_id` ensures the generation ends with a EOS token
>>> outputs = model.generate(**inputs, max_new_tokens=10, forced_eos_token_id=tokenizer.eos_token_id)
>>> print(tokenizer.batch_decode(outputs)[0])
A sequence: 1, 2, 3, 4, 5, 6, 7,<|endoftext|>

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。什麼是 input ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言建模頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，可以是每個詞彙詞元的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.HammingDiversityLogitsProcessor

( diversity_penalty: float num_beams: int num_beam_groups: int )

引數

diversity_penalty (float) — 如果一個束（beam）在特定時間點生成的詞元（token）與其他組的任何束生成的詞元相同，則會從該束的分數中減去此值。更高的 diversity_penalty 會強制束之間具有更大的多樣性。調整此值有助於在多樣性和自然可能性之間取得平衡。
num_beams (int) — 用於束搜尋的束數量。1 表示不使用束搜尋。
num_beam_groups (int) — 為確保不同束組之間的多樣性，將 num_beams 分成的組數。更多細節請參閱這篇論文。

強制執行多樣化束搜尋的 LogitsProcessor。

請注意，此 logits 處理器僅對 PreTrainedModel.group_beam_search 有效。更多細節請參閱 Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models。

傳統的束搜尋通常會在不同的束中生成非常相似的序列。HammingDiversityLogitsProcessor 透過懲罰那些生成了同一時間步中其他束已選擇的詞元的束來解決此問題。

示例

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
>>> import torch

>>> # Initialize the model and tokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-base")

>>> # A long text about the solar system
>>> text = (
...     "The Solar System is a gravitationally bound system comprising the Sun and the objects that orbit it, "
...     "either directly or indirectly. Of the objects that orbit the Sun directly, the largest are the eight "
...     "planets, with the remainder being smaller objects, such as the five dwarf planets and small Solar System "
...     "bodies. The Solar System formed 4.6 billion years ago from the gravitational collapse of a giant "
...     "interstellar molecular cloud."
... )
>>> inputs = tokenizer("summarize: " + text, return_tensors="pt")

>>> # Generate diverse summary
>>> outputs_diverse = model.generate(
...     **inputs,
...     num_beam_groups=2,
...     diversity_penalty=10.0,
...     max_length=100,
...     num_beams=4,
...     num_return_sequences=2,
... )
>>> summaries_diverse = tokenizer.batch_decode(outputs_diverse, skip_special_tokens=True)

>>> # Generate non-diverse summary
>>> outputs_non_diverse = model.generate(
...     **inputs,
...     max_length=100,
...     num_beams=4,
...     num_return_sequences=2,
... )
>>> summary_non_diverse = tokenizer.batch_decode(outputs_non_diverse, skip_special_tokens=True)

>>> # With `diversity_penalty`, the resulting beams are much more diverse
>>> print(summary_non_diverse)
['the solar system formed 4.6 billion years ago from the collapse of a giant interstellar molecular cloud. of the objects that orbit the Sun directly, the largest are the eight planets.',
'the Solar System formed 4.6 billion years ago from the collapse of a giant interstellar molecular cloud. of the objects that orbit the Sun directly, the largest are the eight planets.']

>>> print(summaries_diverse)
['the solar system formed 4.6 billion years ago from the collapse of a giant interstellar molecular cloud. of the objects that orbit the Sun directly, the largest are the eight planets.',
'the solar system formed 4.6 billion years ago from the collapse of a giant interstellar molecular cloud. of the objects that orbit the Sun directly, the largest are the eight planets. the rest of the objects are smaller objects, such as the five dwarf planets and small solar system bodies.']

call

( input_ids: LongTensor scores: FloatTensor current_tokens: LongTensor beam_group_idx: int ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言模型頭的預測分數。在不使用束搜尋時，這些可以是每個詞彙的 logits；在使用束搜尋時，這些是每個詞彙詞元的 log softmax。
current_tokens (形狀為 (batch_size) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引，對應於當前生成步驟中其他束組選擇的詞元。
beam_group_idx (int) — 當前正在處理的束組的索引。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.InfNanRemoveLogitsProcessor

( )

移除所有 nan 和 inf 值的 LogitsProcessor，以避免生成方法失敗。請注意，只有在必要時才應使用此 logits 處理器，因為它可能會減慢生成方法的速度。

此 logits 處理器沒有 generate 示例，因為不應該有任何正確的標誌組合來保證其使用。

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言模型頭的預測分數。在不使用束搜尋時，這些可以是每個詞彙的 logits；在使用束搜尋時，這些是每個詞彙詞元的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.LogitNormalization

( )

使用 log-softmax 對分數進行歸一化的 LogitsProcessor。在束搜尋期間，應用 logits 處理器或變換器後，對分數進行歸一化很重要，因為本庫中使用的搜尋演算法不會這樣做（它只在之前做，但可能需要重新歸一化），但在比較假設時它仍然假設分數是歸一化的。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> import torch

>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

>>> inputs = tokenizer("A sequence: 1, 2, 3", return_tensors="pt")

>>> # By default, the scores are not normalized -- the sum of their exponentials is NOT a normalized probability
>>> # distribution, summing to 1
>>> outputs = model.generate(**inputs, return_dict_in_generate=True, output_scores=True)
>>> print(torch.allclose(torch.sum(torch.exp(outputs.scores[-1])), torch.Tensor((1.000,)), rtol=1e-4))
False

>>> # Normalizing them may have a positive impact on beam methods, or when using the scores on your application
>>> outputs = model.generate(**inputs, renormalize_logits=True, return_dict_in_generate=True, output_scores=True)
>>> print(torch.allclose(torch.sum(torch.exp(outputs.scores[-1])), torch.Tensor((1.000,)), rtol=1e-4))
True

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言模型頭的預測分數。在不使用束搜尋時，這些可以是每個詞彙的 logits；在使用束搜尋時，這些是每個詞彙詞元的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.LogitsProcessor

( )

可應用於生成過程中的所有 logit 處理器的抽象基類。

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言模型頭的預測分數。在不使用束搜尋時，這些可以是每個詞彙的 logits；在使用束搜尋時，這些是每個詞彙詞元的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.LogitsProcessorList

( iterable = () )

此類可用於建立 LogitsProcessor 的列表，以隨後處理一個 scores 輸入張量。此類繼承自列表，並添加了一個特定的 call 方法，用於將每個 LogitsProcessor 應用於輸入。

call

( input_ids: LongTensor scores: FloatTensor **kwargs ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言模型頭的預測分數。在不使用束搜尋時，這些可以是每個詞彙的 logits；在使用束搜尋時，這些是每個詞彙詞元的 log softmax。
kwargs (dict[str, Any], 可選) — 特定於某個 logits 處理器的附加關鍵字引數。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.MinLengthLogitsProcessor

( min_length: int eos_token_id: typing.Union[int, list[int], torch.Tensor] device: str = 'cpu' )

引數

min_length (int) — 最小長度，低於此長度時，eos_token_id 的分數將被設定為 -float("Inf")。
eos_token_id (Union[int, list[int], torch.Tensor]) — 序列結束 詞元的 ID。
device (str, 可選, 預設為 "cpu") — 分配張量的裝置。

透過將 EOS 機率設定為 0 來強制執行最小長度的 LogitsProcessor。請注意，對於像大多數 LLM 這樣的僅解碼器模型，長度包括提示（prompt）。

示例

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("bigscience/bloomz-560m")
>>> model = AutoModelForCausalLM.from_pretrained("bigscience/bloomz-560m")

>>> inputs = tokenizer("A number:", return_tensors="pt")
>>> gen_out = model.generate(**inputs)
>>> print(tokenizer.batch_decode(gen_out, skip_special_tokens=True)[0])
A number: one

>>> # setting `min_length` to a value smaller than the uncontrolled output length has no impact
>>> gen_out = model.generate(**inputs, min_length=3)
>>> print(tokenizer.batch_decode(gen_out, skip_special_tokens=True)[0])
A number: one

>>> # setting a larger `min_length` will force the model to generate beyond its natural ending point, which is not
>>> # necessarily incorrect
>>> gen_out = model.generate(**inputs, min_length=10)
>>> print(tokenizer.batch_decode(gen_out, skip_special_tokens=True)[0])
A number: one thousand, nine hundred and ninety-four

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言模型頭的預測分數。在不使用束搜尋時，這些可以是每個詞彙的 logits；在使用束搜尋時，這些是每個詞彙詞元的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.MinNewTokensLengthLogitsProcessor

( prompt_length_to_skip: int min_new_tokens: int eos_token_id: typing.Union[int, list[int], torch.Tensor] device: str = 'cpu' )

引數

prompt_length_to_skip (int) — 輸入詞元的長度。與 generate 一起使用時不是有效引數，因為它會自動分配輸入長度。
min_new_tokens (int) — 最小*新*詞元長度，低於此長度時，eos_token_id 的分數將被設定為 -float("Inf")。
eos_token_id (Union[int, list[int], torch.Tensor]) — 序列結束 詞元的 ID。
device (str, 可選, 預設為 "cpu") — 分配張量的裝置。

透過將 EOS（序列結束）詞元機率設定為 0 來強制執行新詞元最小長度的 LogitsProcessor。與 MinLengthLogitsProcessor 相反，此處理器會忽略提示（prompt）。

示例

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("bigscience/bloomz-560m")
>>> model = AutoModelForCausalLM.from_pretrained("bigscience/bloomz-560m")

>>> inputs = tokenizer(["A number:"], return_tensors="pt")
>>> gen_out = model.generate(**inputs)
>>> print(tokenizer.batch_decode(gen_out, skip_special_tokens=True)[0])
A number: one

>>> # setting `min_new_tokens` will force the model to generate beyond its natural ending point, which is not
>>> # necessarily incorrect
>>> gen_out = model.generate(**inputs, min_new_tokens=2)
>>> print(tokenizer.batch_decode(gen_out, skip_special_tokens=True)[0])
A number: one thousand

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言模型頭的預測分數。在不使用束搜尋時，這些可以是每個詞彙的 logits；在使用束搜尋時，這些是每個詞彙詞元的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.MinPLogitsWarper

( min_p: float filter_value: float = -inf min_tokens_to_keep: int = 1 )

引數

min_p (float) — 最小詞元機率，將根據最可能詞元的機率進行縮放。它必須是 0 到 1 之間的值。典型值在 0.01-0.2 範圍內，選擇性與將 top_p 設定在 0.99-0.8 範圍內相當（使用與正常 top_p 值相反的值）。
filter_value (float, 可選, 預設為 -inf) — 所有被過濾的值都將被設定為此浮點值。
min_tokens_to_keep (int, 可選, 預設為 1) — 無法被過濾的最小詞元數量。

執行 min-p 的 LogitsProcessor，即保留所有高於最小機率（根據最可能詞元的機率進行縮放）的詞元。結果是，在存在高機率詞元時，過濾器會變得更具攻擊性，這是自信輸出的標誌，我們不應偏離。

通常與 TemperatureLogitsWarper 一起使用。用作 TopPLogitsWarper 和 TopKLogitsWarper 的替代方案。

由 @menhguin 和 @kalomaze (github 使用者名稱) 建立。程式碼改編自此外部 PR

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

>>> set_seed(1)
>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

>>> inputs = tokenizer("A sequence: 1, 2", return_tensors="pt")

>>> # With sampling, the output is unexpected -- sometimes too unexpected.
>>> outputs = model.generate(**inputs, do_sample=True)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
A sequence: 1, 2, 3 | < 4 (left-hand pointer) ;
<BLANKLINE>
<BLANKLINE>

>>> # With `min_p` sampling, the output gets restricted to high-probability tokens.
>>> # Pro tip: In practice, LLMs use `min_p` in the 0.01-0.2 range.
>>> outputs = model.generate(**inputs, do_sample=True, min_p=0.1)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
A sequence: 1, 2, 3, 4, 5, 6, 7, 8, 9

call

( input_ids: LongTensor scores: FloatTensor )

class transformers.NoBadWordsLogitsProcessor

( bad_words_ids: list eos_token_id: typing.Union[int, list[int], torch.Tensor, NoneType] = None )

引數

bad_words_ids (list[list[int]]) — 不允許生成的詞元 ID 列表的列表。
eos_token_id (Union[int, list[int], torch.Tensor], 可選) — 序列結束 詞元的 ID。

強制指定序列永遠不會被選中的 LogitsProcessor。

為了獲取不應出現在生成文字中的詞的 token id，請確保在初始化 tokenizer 時設定 add_prefix_space=True，並使用 tokenizer(bad_words, add_special_tokens=False).input_ids。add_prefix_space 引數僅部分慢速 tokenizer 支援，因為快速 tokenizer 的字首行為來自 pre tokenizers。請在此處閱讀更多內容。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
>>> inputs = tokenizer(["In a word, the cake is a"], return_tensors="pt")

>>> output_ids = model.generate(inputs["input_ids"], max_new_tokens=5, pad_token_id=tokenizer.eos_token_id)
>>> print(tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0])
In a word, the cake is a bit of a mess.

>>> # Now let's take the bad words out. Please note that the tokenizer is initialized differently
>>> tokenizer_with_prefix_space = AutoTokenizer.from_pretrained("openai-community/gpt2", add_prefix_space=True)


>>> def get_tokens_as_list(word_list):
...     "Converts a sequence of words into a list of tokens"
...     tokens_list = []
...     for word in word_list:
...         tokenized_word = tokenizer_with_prefix_space([word], add_special_tokens=False).input_ids[0]
...         tokens_list.append(tokenized_word)
...     return tokens_list


>>> bad_words_ids = get_tokens_as_list(word_list=["mess"])
>>> output_ids = model.generate(
...     inputs["input_ids"], max_new_tokens=5, bad_words_ids=bad_words_ids, pad_token_id=tokenizer.eos_token_id
... )
>>> print(tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0])
In a word, the cake is a bit of a surprise.

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列 token 的索引。什麼是 input ID？
scores (torch.FloatTensor，形狀為 (batch_size, config.vocab_size)) — 語言模型頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，這些是每個詞彙 token 的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.NoRepeatNGramLogitsProcessor

( ngram_size: int )

引數

ngram_size (int) — 所有大小為 ngram_size 的 n-gram 只能出現一次。

N-gram 是從文字序列中提取的“n”個連續詞、字元或 token 的組合。給定句子：“她跑得快”，二元組（n=2）將是（“她”，“跑得”）和（“跑得”，“快”）。在文字生成中，避免詞序列的重複可以提供更多樣化的輸出。此 LogitsProcessor 透過將被禁止的 token 的分數設定為負無窮大來強制不重複 n-gram，從而在進一步處理分數時排除這些 token。請注意，對於僅解碼器模型（如大多數 LLM），提示（prompt）也被考慮用於獲取 n-gram。Fairseq。

請謹慎使用 n-gram 懲罰。例如，在一篇關於紐約市的文章中懲罰 2-gram（二元組），可能會導致不理想的結果，即城市名稱在整個文字中只出現一次。參考

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")
>>> inputs = tokenizer(["Today I"], return_tensors="pt")

>>> output = model.generate(**inputs)
>>> print(tokenizer.decode(output[0], skip_special_tokens=True))
Today I’m not sure if I’m going to be able to do it.

>>> # Now let's add ngram size using `no_repeat_ngram_size`. This stops the repetitions ("I’m") in the output.
>>> output = model.generate(**inputs, no_repeat_ngram_size=2)
>>> print(tokenizer.decode(output[0], skip_special_tokens=True))
Today I’m not sure if I can get a better understanding of the nature of this issue

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列 token 的索引。什麼是 input ID？
scores (torch.FloatTensor，形狀為 (batch_size, config.vocab_size)) — 語言模型頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，這些是每個詞彙 token 的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.PrefixConstrainedLogitsProcessor

( prefix_allowed_tokens_fn: typing.Callable[[int, torch.Tensor], list[int]] num_beams: int )

引數

prefix_allowed_tokens_fn (Callable[[int, torch.Tensor], list[int]]) — 此函式在每一步將束搜尋限制為僅允許的 token。此函式接受 2 個引數：inputs_ids 和批次 ID batch_id。它必須返回一個列表，其中包含根據先前生成的 token inputs_ids 和批次 ID batch_id 條件下的下一個生成步驟所允許的 token。

LogitsProcessor，用於強制執行約束生成，對字首條件約束生成很有用。更多資訊請參閱自迴歸實體檢索。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("bigscience/bloomz-560m")
>>> tokenizer = AutoTokenizer.from_pretrained("bigscience/bloomz-560m")

>>> inputs = tokenizer("Alice and Bob", return_tensors="pt")

>>> # By default, it continues generating according to the model's logits
>>> outputs = model.generate(**inputs, max_new_tokens=5)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
Alice and Bob are friends

>>> # We can constrain it with `prefix_allowed_tokens_fn` to force a certain behavior based on a prefix.
>>> # For instance, we can force an entire entity to be generated when its beginning is detected.
>>> entity = tokenizer(" Bob Marley", return_tensors="pt").input_ids[0]  # 3 tokens
>>> def prefix_allowed_tokens_fn(batch_id, input_ids):
...     '''
...     Attempts to generate 'Bob Marley' when 'Bob' is detected.
...     In this case, `batch_id` is not used, but you can set rules for each batch member.
...     '''
...     if input_ids[-1] == entity[0]:
...         return [entity[1].item()]
...     elif input_ids[-2] == entity[0] and input_ids[-1] == entity[1]:
...         return [entity[2].item()]
...     return list(range(tokenizer.vocab_size))  # If no match, allow all tokens

>>> outputs = model.generate(**inputs, max_new_tokens=5, prefix_allowed_tokens_fn=prefix_allowed_tokens_fn)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
Alice and Bob Marley

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列 token 的索引。什麼是 input ID？
scores (torch.FloatTensor，形狀為 (batch_size, config.vocab_size)) — 語言模型頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，這些是每個詞彙 token 的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.RepetitionPenaltyLogitsProcessor

( penalty: float prompt_ignore_length: typing.Optional[int] = None )

引數

penalty (float) — 重複懲罰的引數。1.0 表示沒有懲罰。大於 1.0 會懲罰先前生成的 token。介於 0.0 和 1.0 之間會獎勵先前生成的 token。
prompt_ignore_length (int, 可選) — 原始輸入 id 序列的長度，如果提供，則不用於懲罰計算。

LogitsProcessor 透過懲罰來防止先前 token 的重複。此懲罰對每個 token 最多應用一次。請注意，對於僅解碼器模型（如大多數 LLM），預設情況下，考慮的 token 包括提示（prompt）。

在原始論文中，作者建議使用大約 1.2 的懲罰值，以在真實生成和無重複之間取得良好平衡。要懲罰並減少重複，請使用大於 1.0 的 penalty 值，值越高懲罰越強。要獎勵並鼓勵重複，請使用 0.0 到 1.0 之間的 penalty 值，值越低獎勵越強。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, RepetitionPenaltyLogitsProcessor

>>> # Initializing the model and tokenizer for it
>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")
>>> inputs = tokenizer(["I'm not going to"], return_tensors="pt")

>>> # This shows a normal generate without any specific parameters
>>> summary_ids = model.generate(**inputs)
>>> print(tokenizer.batch_decode(summary_ids, skip_special_tokens=True)[0])
I'm not going to be able to do that. I'm going to be able to do that

>>> # This generates a penalty for repeated tokens
>>> penalized_ids = model.generate(**inputs, repetition_penalty=1.1)
>>> print(tokenizer.batch_decode(penalized_ids, skip_special_tokens=True)[0])
I'm not going to be able to do that. I'll just have to go out and play

>>> # We can also exclude the input prompt by creating an instance of this class
>>> # with a `prompt_ignore_length` and passing it as a custom logit processor
>>> rep_pen_processor = RepetitionPenaltyLogitsProcessor(
...     penalty=1.1,
...     prompt_ignore_length=inputs["input_ids"].shape[-1]
... )
>>> penalized_ids = model.generate(**inputs, logits_processor=[rep_pen_processor])
>>> print(tokenizer.batch_decode(penalized_ids, skip_special_tokens=True)[0])
I'm not going to be able to do that. I'm going to have to go through a lot of things, and

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列 token 的索引。什麼是 input ID？
scores (torch.FloatTensor，形狀為 (batch_size, config.vocab_size)) — 語言模型頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，這些是每個詞彙 token 的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.SequenceBiasLogitsProcessor

( sequence_bias: list )

引數

sequence_bias (list[list[Union[list[int], float]]]) — 列表的列表，將 token 序列對映到其偏置項（例如 [[[10, 45], -2.0], [[64], -7.5]]）。正偏置會增加序列被選中的機率，而負偏置則相反。如果序列長度為 1，其偏置將始終被應用。否則，偏置只在相關序列即將完成時（在此處理器應用後的 token 選擇步驟中）應用。

LogitsProcessor 對序列施加加性偏置。當下一個生成的 token 可以完成一個序列時，偏置會應用於該序列的最後一個 token。因此，為了充分利用對超過一個 token 的序列進行偏置，請考慮使用束方法（以優雅地處理具有負偏置的部分完成序列）並對其字首應用偏置（以確保偏置更早地應用）。

在 token 級別上，對一個詞進行偏置與對其前面有空格的詞進行偏置是不同的。如果你想在句子中間偏置“foo”，你可能需要新增一個字首空格並偏置“ foo”。請檢視我們的 NLP 課程的 tokenizer 部分，瞭解原因：https://huggingface.co/learn/nlp-course/chapter2/4#pt

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
>>> inputs = tokenizer(["The full name of Donald is Donald"], return_tensors="pt")

>>> summary_ids = model.generate(inputs["input_ids"], max_new_tokens=4, do_sample=False)
>>> print(tokenizer.batch_decode(summary_ids, skip_special_tokens=True)[0])
The full name of Donald is Donald John Trump Sr.

>>> def get_tokens(word):
...     return tokenizer([word], add_special_tokens=False).input_ids[0]

>>> # IMPORTANT: Remember our tip about adding spaces before words to bias them correctly.
>>> sequence_bias = [[get_tokens("Trump"), -10.0],]  # will fail to apply bias
>>> biased_ids = model.generate(
...     inputs["input_ids"], max_new_tokens=4, do_sample=False, sequence_bias=sequence_bias
... )
>>> print(tokenizer.batch_decode(biased_ids, skip_special_tokens=True)[0])
The full name of Donald is Donald John Trump Sr.

>>> sequence_bias = [[get_tokens(" Trump"), -10.0],]  # will work
>>> biased_ids = model.generate(
...     inputs["input_ids"], max_new_tokens=4, do_sample=False, sequence_bias=sequence_bias
... )
>>> print(tokenizer.batch_decode(biased_ids, skip_special_tokens=True)[0])
The full name of Donald is Donald John Harper. He

>>> # We can also add a positive bias to nudge the model towards specific tokens or continuations. This technique
>>> # is also more effective when paired up with beam search.
>>> sequence_bias = [[get_tokens(" Donald Duck"), 10.0],]
>>> biased_ids = model.generate(
...     inputs["input_ids"], max_new_tokens=4, num_beams=4, do_sample=False, sequence_bias=sequence_bias
... )
>>> print(tokenizer.batch_decode(biased_ids, skip_special_tokens=True)[0])
The full name of Donald is Donald Duck. He is

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列 token 的索引。什麼是 input ID？
scores (torch.FloatTensor，形狀為 (batch_size, config.vocab_size)) — 語言模型頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，這些是每個詞彙 token 的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.SuppressTokensAtBeginLogitsProcessor

( begin_suppress_tokens begin_index device: str = 'cpu' )

SuppressTokensAtBeginLogitsProcessor 會在 generate 函式開始生成時，使用 begin_index 個 token 來抑制一個 token 列表。這應確保由 begin_suppress_tokens 定義的 token 不會在開頭生成。最初為 Whisper 建立。

示例

>>> from transformers import AutoProcessor, WhisperForConditionalGeneration
>>> from datasets import load_dataset

>>> processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> inputs = processor(ds[0]["audio"]["array"], return_tensors="pt")

>>> # Whisper has `begin_suppress_tokens` set by default (= `[220, 50256]`). 50256 is the EOS token, so this means
>>> # it can't generate and EOS token in the first iteration, but it can in the others.
>>> outputs = model.generate(**inputs, return_dict_in_generate=True, output_scores=True)
>>> print(outputs.scores[0][0, 50256])
tensor(-inf)
>>> print(outputs.scores[-1][0, 50256])  # in other places we can see some probability mass for EOS
tensor(29.9010)

>>> # If we disable `begin_suppress_tokens`, we can generate EOS in the first iteration.
>>> outputs = model.generate(
...     **inputs, return_dict_in_generate=True, output_scores=True, begin_suppress_tokens=None
... )
>>> print(outputs.scores[0][0, 50256])
tensor(11.2027)

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列 token 的索引。什麼是 input ID？
scores (torch.FloatTensor，形狀為 (batch_size, config.vocab_size)) — 語言模型頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，這些是每個詞彙 token 的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.SuppressTokensLogitsProcessor

( suppress_tokens device: str = 'cpu' )

此處理器可用於抑制 token 列表。處理器會將其對數機率設定為 -inf，從而使它們不會被生成。最初為 Whisper 建立。

示例

>>> from transformers import AutoProcessor, WhisperForConditionalGeneration
>>> from datasets import load_dataset

>>> processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> inputs = processor(ds[0]["audio"]["array"], return_tensors="pt")

>>> # Whisper has a long list of suppressed tokens. For instance, in this case, the token 1 is suppressed by default.
>>> outputs = model.generate(**inputs, return_dict_in_generate=True, output_scores=True)
>>> print(outputs.scores[1][0, 1])  # 1 (and not 0) is the first freely generated token
tensor(-inf)

>>> # If we disable `suppress_tokens`, we can generate it.
>>> outputs = model.generate(**inputs, return_dict_in_generate=True, output_scores=True, suppress_tokens=None)
>>> print(outputs.scores[1][0, 1])
tensor(6.0678)

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列 token 的索引。什麼是 input ID？
scores (torch.FloatTensor，形狀為 (batch_size, config.vocab_size)) — 語言模型頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，這些是每個詞彙 token 的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.SynthIDTextWatermarkLogitsProcessor

( ngram_len: int keys: list sampling_table_size: int sampling_table_seed: int context_history_size: int device: device skip_first_ngram_calls: bool = False debug_mode: bool = False )

引數

ngram_len (int) — N-gram 長度。
keys (list[int]) — 水印金鑰序列，每個深度一個。
sampling_table_size (int) — 取樣表的大小。
sampling_table_seed (int) — 用於生成取樣表的隨機種子。
context_history_size (int) — 用於跟蹤已見上下文的張量大小。
device (torch.device) — 要使用的裝置。
skip_first_ngram_calls (bool, 可選, 預設為 False) — 是否跳過第一次 n-gram 呼叫。
debug_mode (bool, 可選, 可選, 預設為 False) — Logits 在應用水印修改前被修改為均勻分佈。這用於測試實現。

實現文字生成模型水印技術的 Logits 處理器。此類有助於應用 SynthID 文字水印，這是一種將不可感知的訊號嵌入生成文字中以幫助檢測合成內容的方法。它透過在文字生成過程中巧妙地操縱 token 選擇的機率，使其能夠可靠地在之後恢復以進行驗證。

主要功能

狀態管理： 維護內部狀態以跟蹤 token 序列並動態生成水印金鑰。
金鑰生成： 根據 token 序列和水印引數計算雜湊值，為每個位置建立唯一的金鑰。
G 值取樣： 使用預計算的取樣表，根據生成的金鑰對水印值（g 值）進行取樣。
分數調整： 應用計算出的 g 值來修改生成過程中的 token 機率，從而嵌入水印。
上下文重複處理： 包含避免在重複上下文中對 token 加水印的邏輯，以保持自然性。
EOS token 掩碼： 支援對句子結束 token 進行掩碼，以防止它們被包含在水印計算中。
實用函式： 提供直接計算 g 值、檢查上下文重複、建立 EOS token 掩碼以及估計預期平均 g 值的函式。

有關此內容的更多詳細資訊，請參閱論文連結：https://www.nature.com/articles/s41586-024-08025-4

示例

>>> from transformers import AutoModelForCausalLM, AutoTokenizer, SynthIDTextWatermarkingConfig

>>> tokenizer = AutoTokenizer.from_pretrained('google/gemma-2-2b', padding_side="left")
>>> model = AutoModelForCausalLM.from_pretrained('google/gemma-2-2b')

>>> # SynthID Text configuration
>>> watermarking_config = SynthIDTextWatermarkingConfig(
...     keys=[654, 400, 836, 123, 340, 443, 597, 160, 57],
...     ngram_len=5,
... )

>>> # Generation with watermarking
>>> tokenized_prompts = tokenizer(["Once upon a time, "], return_tensors="pt", padding=True)
>>> output_sequences = model.generate(
...     **tokenized_prompts, watermarking_config=watermarking_config, do_sample=True, max_new_tokens=10
... )
>>> watermarked_text = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列 token 的索引。什麼是 input ID？
scores (torch.FloatTensor，形狀為 (batch_size, config.vocab_size)) — 語言模型頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，這些是每個詞彙 token 的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.TemperatureLogitsWarper

( temperature: float )

引數

temperature (float) — 嚴格為正的浮點值，用於調節 logits 分佈。小於 1 的值會減少隨機性（反之亦然），0 等效於將所有機率質量轉移到最可能的 token。

LogitsProcessor 用於溫度（指數縮放輸出機率分佈），這實際上意味著它可以控制預測 token 的隨機性。通常與 TopPLogitsWarper 和 TopKLogitsWarper 一起使用。

請確保在 generate 引數中包含 do_sample=True，否則溫度值將沒有任何效果。

示例

>>> import torch
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

>>> set_seed(0)  # for reproducibility

>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
>>> model.config.pad_token_id = model.config.eos_token_id
>>> inputs = tokenizer(["Hugging Face Company is"], return_tensors="pt")

>>> # With temperature=1.0, the default, we consistently get random outputs due to random sampling.
>>> generate_kwargs = {"max_new_tokens": 10, "do_sample": True, "temperature": 1.0, "num_return_sequences": 2}
>>> outputs = model.generate(**inputs, **generate_kwargs)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
['Hugging Face Company is one of these companies that is going to take a',
"Hugging Face Company is a brand created by Brian A. O'Neil"]

>>> # However, with temperature close to 0, it approximates greedy decoding strategies (invariant)
>>> generate_kwargs["temperature"] = 0.0001
>>> outputs = model.generate(**inputs, **generate_kwargs)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
['Hugging Face Company is a company that has been around for over 20 years',
'Hugging Face Company is a company that has been around for over 20 years']

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列 token 的索引。什麼是 input ID？
scores (torch.FloatTensor，形狀為 (batch_size, config.vocab_size)) — 語言模型頭的預測分數。當不使用束搜尋時，這些可以是每個詞彙的 logits；當使用束搜尋時，這些是每個詞彙 token 的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.TopKLogitsWarper

( top_k: int filter_value: float = -inf min_tokens_to_keep: int = 1 )

引數

top_k (int) — 用於 top-k 過濾的最高機率詞彙標記的數量。
filter_value (float, 可選, 預設為 -inf) — 所有被過濾的值都將被設定為此浮點數值。
min_tokens_to_keep (int, 可選, 預設為 1) — 不能被過濾的最小標記數。

執行 top-k 的 LogitsProcessor，即限制為 k 個最高機率的元素。常與 TemperatureLogitsWarper 和 TopPLogitsWarper 一起使用。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

>>> set_seed(1)
>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

>>> inputs = tokenizer("A sequence: A, B, C, D", return_tensors="pt")

>>> # With sampling, the output is unexpected -- sometimes too unexpected.
>>> outputs = model.generate(**inputs, do_sample=True)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
A sequence: A, B, C, D, E — S — O, P — R

>>> # With `top_k` sampling, the output gets restricted the k most likely tokens.
>>> # Pro tip: In practice, LLMs use `top_k` in the 5-50 range.
>>> outputs = model.generate(**inputs, do_sample=True, top_k=2)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
A sequence: A, B, C, D, E, F, G, H, I

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列標記的索引。什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言模型頭的預測分數。在不使用束搜尋時，這些可以是每個詞彙的 logits；在使用束搜尋時，可以是每個詞彙標記的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.TopPLogitsWarper

( top_p: float filter_value: float = -inf min_tokens_to_keep: int = 1 )

引數

top_p (float) — 如果設定為 < 1，則僅保留機率總和達到 top_p 或更高的最小機率標記集合用於生成。
filter_value (float, 可選, 預設為 -inf) — 所有被過濾的值都將被設定為此浮點數值。
min_tokens_to_keep (int, 可選, 預設為 1) — 不能被過濾的最小標記數。

執行 top-p 的 LogitsProcessor，即限制為機率總和小於等於 prob_cut_off 的頂級標記。常與 TemperatureLogitsWarper 和 TopKLogitsWarper 一起使用。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

>>> set_seed(1)
>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

>>> inputs = tokenizer("A sequence: 1, 2", return_tensors="pt")

>>> # With sampling, the output is unexpected -- sometimes too unexpected.
>>> outputs = model.generate(**inputs, do_sample=True)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
A sequence: 1, 2, 3 | < 4 (left-hand pointer) ;
<BLANKLINE>
<BLANKLINE>

>>> # With `top_p` sampling, the output gets restricted to high-probability tokens.
>>> # Pro tip: In practice, LLMs use `top_p` in the 0.9-0.95 range.
>>> outputs = model.generate(**inputs, do_sample=True, top_p=0.1)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
A sequence: 1, 2, 3, 4, 5, 6, 7, 8, 9

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列標記的索引。什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言模型頭的預測分數。在不使用束搜尋時，這些可以是每個詞彙的 logits；在使用束搜尋時，可以是每個詞彙標記的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.TypicalLogitsWarper

( mass: float = 0.9 filter_value: float = -inf min_tokens_to_keep: int = 1 )

引數

mass (float, 可選, 預設為 0.9) — typical_p 的值，介於 0 和 1 之間（含 0 和 1），預設為 0.9。
filter_value (float, 可選, 預設為 -inf) — 所有被過濾的值都將被設定為此浮點數值。
min_tokens_to_keep (int, 可選, 預設為 1) — 不能被過濾的最小標記數。

執行典型解碼的 LogitsProcessor。受人類使用語言方式的啟發，它優先考慮那些對數機率接近於標記機率分佈熵的標記。這意味著最可能的標記可能會在此過程中被丟棄。

更多資訊請參閱 Typical Decoding for Natural Language Generation。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

>>> model = AutoModelForCausalLM.from_pretrained("bigscience/bloomz-560m")
>>> tokenizer = AutoTokenizer.from_pretrained("bigscience/bloomz-560m")

>>> inputs = tokenizer("1, 2, 3", return_tensors="pt")

>>> # We can see that greedy decoding produces a sequence of numbers
>>> outputs = model.generate(**inputs)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
1, 2, 3, 4, 5, 6, 7, 8, 9, 10,

>>> # For this particular seed, we can see that sampling produces nearly the same low-information (= low entropy)
>>> # sequence
>>> set_seed(18)
>>> outputs = model.generate(**inputs, do_sample=True)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
1, 2, 3, 4, 5, 6, 7, 8, 9 and 10

>>> # With `typical_p` set, the most obvious sequence is no longer produced, which may be good for your problem
>>> set_seed(18)
>>> outputs = model.generate(
...     **inputs, do_sample=True, typical_p=0.1, return_dict_in_generate=True, output_scores=True
... )
>>> print(tokenizer.batch_decode(outputs.sequences, skip_special_tokens=True)[0])
1, 2, 3 and 5

>>> # We can see that the token corresponding to "4" (token 934) in the second position, the most likely token
>>> # as seen with greedy decoding, was entirely blocked out
>>> print(outputs.scores[1][0, 934])
tensor(-inf)

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列標記的索引。什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言模型頭的預測分數。在不使用束搜尋時，這些可以是每個詞彙的 logits；在使用束搜尋時，可以是每個詞彙標記的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.UnbatchedClassifierFreeGuidanceLogitsProcessor

( guidance_scale: float model unconditional_ids: typing.Optional[torch.LongTensor] = None unconditional_attention_mask: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = True )

引數

guidance_scale (float) — 用於無分類器指導（CFG）的指導比例。透過設定 guidance_scale != 1 來啟用 CFG。較高的指導比例會鼓勵模型生成與輸入提示更緊密相關的樣本，但通常會犧牲質量。小於 1 的值效果相反，同時使透過 negative_prompt_ids（如果提供）提供的負面提示起到正面提示的作用。
model (PreTrainedModel) — 計算無條件分數的模型。應與計算有條件分數的模型相同。兩個模型必須使用相同的分詞器。
unconditional_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor, 可選) — 無條件分支的詞彙表中輸入序列標記的索引。如果未設定，將預設為提示的最後一個標記。
unconditional_attention_mask (形狀為 (batch_size, sequence_length) 的 torch.LongTensor, 可選) — 用於 unconditional_ids 的注意力掩碼。
use_cache (bool, 可選, 預設為 True) — 是否在負面提示的前向傳播過程中快取鍵/值。

用於無分類器指導（CFG）的 Logits 處理器。該處理器根據 guidance_scale 引數化的提示條件 logits 和提示無條件（或負面）logits 的分數計算加權平均值。無條件分數是透過使用 unconditional_ids 分支提示 model 在內部計算的。

更多資訊請參閱此論文。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
>>> inputs = tokenizer(["Today, a dragon flew over Paris, France,"], return_tensors="pt")
>>> out = model.generate(inputs["input_ids"], guidance_scale=1.5)
>>> tokenizer.batch_decode(out, skip_special_tokens=True)[0]
'Today, a dragon flew over Paris, France, killing at least 50 people and injuring more than 100'

>>> # with a negative prompt
>>> neg_inputs = tokenizer(["A very happy event happened,"], return_tensors="pt")
>>> out = model.generate(inputs["input_ids"], guidance_scale=2, negative_prompt_ids=neg_inputs["input_ids"])
>>> tokenizer.batch_decode(out, skip_special_tokens=True)[0]
'Today, a dragon flew over Paris, France, killing at least 130 people. French media reported that'

>>> # with a positive prompt
>>> neg_inputs = tokenizer(["A very happy event happened,"], return_tensors="pt")
>>> out = model.generate(inputs["input_ids"], guidance_scale=0, negative_prompt_ids=neg_inputs["input_ids"])
>>> tokenizer.batch_decode(out, skip_special_tokens=True)[0]
"Today, a dragon flew over Paris, France, and I'm very happy to be here. I"

call

( input_ids scores )

class transformers.WhisperTimeStampLogitsProcessor

( generate_config: GenerationConfig begin_index: int _detect_timestamp_from_logprob: typing.Optional[bool] = None )

引數

generate_config (GenerateConfig) — 用於生成輸出的生成配置。需要以下引數： eos_token_id (int, 可選, 預設為 50257): 序列結束 標記的 ID。no_timestamps_token_id (int, 可選, 預設為 50363): "<|notimestamps|>" 標記的 ID。max_initial_timestamp_index (int, 可選, 預設為 1): 用於設定初始時間戳的最大值。這用於防止模型預測未來過遠的時間戳。
begin_index (int) — 模型生成的第一個標記的標記索引。
_detect_timestamp_from_logprob (bool, 可選) — 是否可以從所有時間戳的 logprobs 中預測時間戳。

用於修改轉錄中時間戳生成 logits 的 LogitsProcessor。當輸入標記達到特定閾值時，處理器將分數設定為負無窮大。處理器透過遮蔽會破壞配對模式的 logits 來確保時間戳標記成對出現。這樣做是為了保持生成的時間戳的一致性和結構。它還確保當取樣任何時間戳標記的預測機率大於任何單個非時間戳標記時，這些非時間戳 logits 被設定為負無窮大。這樣做是為了確保在其他潛在標記之上生成時間戳。

更多資訊請參閱此論文。

示例

>>> import torch
>>> from transformers import AutoProcessor, WhisperForConditionalGeneration, GenerationConfig
>>> from datasets import load_dataset

>>> processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> inputs = processor(ds[3]["audio"]["array"], return_tensors="pt")
>>> input_features = inputs.input_features

>>> #Displaying timestamps
>>> generated_ids = model.generate(inputs=input_features, return_timestamps=True)
>>> transcription = processor.batch_decode(generated_ids, decode_with_timestamps=True)[0]
>>> print("Transcription:", transcription)
Transcription: <|startoftranscript|><|0.00|> He has grave doubts whether Sir Frederick Layton's work is really Greek after all, and can<|6.44|><|6.44|> discover in it but little of rocky Ithaca.<|9.44|><|endoftext|>


>>> #No timestamps & change EOS:
>>> #This allows the user to select a specific token to terminate the sequence on, in this case it's the word "can"(460)
>>> model.generation_config.eos_token_id = 460
>>> generated_ids = model.generate(inputs=input_features,return_timestamps=False)
>>> transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
>>> print("Transcription:", transcription)
Transcription:  He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列標記的索引。什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言模型頭的預測分數。在不使用束搜尋時，這些可以是每個詞彙的 logits；在使用束搜尋時，可以是每個詞彙標記的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

class transformers.WatermarkLogitsProcessor

( vocab_size device greenlist_ratio: float = 0.25 bias: float = 2.0 hashing_key: int = 15485863 seeding_scheme: str = 'lefthash' context_width: int = 1 )

引數

vocab_size (int) — 模型分詞器的 vocab_size。用於計算“綠色”標記的比例。
device (str) — 模型分配的裝置。
greenlist_ratio (float, 可選, 可選, 預設為 0.25) — 使用的“綠色”標記與詞彙表大小的比例。預設為 0.25。
bias (float, 可選, 可選, 預設為 2.0) — 新增到所選“綠色”標記 logits 的偏置。如果文字生成質量下降，請考慮降低 bias。推薦值範圍為 [0.5, 2.0]。預設為 2.0。
hashing_key (int, 可選, 可選, 預設為 15485863) — 用於雜湊的金鑰。如果您部署此水印，建議使用另一個私鑰。預設為 15485863（第一百萬個素數）。
seeding_scheme (str, 可選, 可選, 預設為 "lefthash") — 用於選擇“綠色”標記的播種方案。接受值：
- “lefthash”（預設）：“綠色”標記的選擇取決於最後一個標記（論文中的演算法 2）
- “selfhash”：“綠色”標記的選擇取決於當前標記本身（論文中的演算法 3）。此方案的缺點是它會考慮所有可能的下一個標記，可能比“lefthash”慢。用於播種的先前標記的上下文長度。更高的上下文長度使水印更具魯棒性。
context_width (int, 可選, 預設為 1) — 設定種子時使用的先前標記的數量。

用於對生成文字加水印的 Logits 處理器。該處理器透過在生成下一個標記之前向隨機化的“綠色”標記集新增一個小的偏置來修改模型輸出分數。“綠色”標記的選擇過程取決於所使用的 seeding_scheme。程式碼基於原始倉庫。

由該 LogitsProcessor 生成的文字可以使用 WatermarkDetector 進行檢測。詳情請參閱 call()。

更多資訊請參閱此論文。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, WatermarkingConfig

>>> model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
>>> inputs = tokenizer(["Alice and Bob are"], return_tensors="pt")

>>> # normal generation
>>> out = model.generate(inputs["input_ids"], max_length=20, do_sample=False)
>>> tokenizer.batch_decode(out, skip_special_tokens=True)[0]
'Alice and Bob are both in the same room.\n\n"I\'m not sure if you\'re'

>>> # watermarked generation
>>> watermarking_config = WatermarkingConfig(bias=2.5, context_width=2, seeding_scheme="selfhash")
>>> out = model.generate(inputs["input_ids"], watermarking_config=watermarking_config, max_length=20, do_sample=False)
>>> tokenizer.batch_decode(out, skip_special_tokens=True)[0]
'Alice and Bob are both still alive and well and the story is pretty much a one-hour adventure'

>>> # to detect watermarked text use the WatermarkDetector class
>>> from transformers import WatermarkDetector
>>> detector = WatermarkDetector(model_config=model.config, device="cpu", watermarking_config= watermarking_config)
>>> detection_preds = detector(out)
>>> detection_preds
array([ True])

call

( input_ids: LongTensor scores: FloatTensor ) → 形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列標記的索引。什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言模型頭的預測分數。在不使用束搜尋時，這些可以是每個詞彙的 logits；在使用束搜尋時，可以是每個詞彙標記的 log softmax。

形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor

處理後的預測分數。

TensorFlow

class transformers.TFForcedBOSTokenLogitsProcessor

( bos_token_id: int )

引數

bos_token_id (int) — 強制作為第一個生成標記的標記 ID。

強制將指定標記作為第一個生成標記的 TFLogitsProcessor。

call

( input_ids: Tensor scores: Tensor cur_len: int )

class transformers.TFForcedEOSTokenLogitsProcessor

( max_length: int eos_token_id: int )

引數

max_length (int) — 要生成的序列的最大長度。
eos_token_id (int) — 當達到 max_length 時，強制作為最後一個生成標記的標記 ID。

強制將指定標記作為最後一個生成標記的 TFLogitsProcessor，當達到 max_length 時。

call

( input_ids: Tensor scores: Tensor cur_len: int )

class transformers.TFForceTokensLogitsProcessor

( force_token_map: list )

該處理器接收一個整數對列表，該列表指示了從生成索引到將在取樣前被強制生成的詞元索引的對映。該處理器會將其對數機率設定為 `0`，並將所有其他詞元的對數機率設定為 `-inf`，以便它們在對應的索引處被取樣。

call

( input_ids: Tensor scores: Tensor cur_len: int )

class transformers.TFLogitsProcessor

( )

可應用於生成過程中的所有 logit 處理器的抽象基類。

call

( input_ids: Tensor scores: Tensor cur_len: int ) → `tf.Tensor` of shape `(batch_size, config.vocab_size)`

引數

input_ids (tf.Tensor of shape (batch_size, sequence_length)) — 詞彙表中輸入序列詞元的索引。

可以使用 PreTrainedTokenizer 獲取索引。有關詳細資訊，請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.__call__()。

什麼是輸入 ID？
scores (tf.Tensor of shape (batch_size, config.vocab_size)) — 語言模型頭的預測分數。在不使用束搜尋時，這些可以是每個詞彙的 logits；在使用束搜尋時，這些可以是每個詞彙詞元的 log softmax。
cur_len (int) — 有效輸入序列詞元的當前長度。在 TF 實現中，input_ids 的序列長度是 `generate` 方法可以產生的最大長度，我們需要知道哪些詞元是有效的。
kwargs (dict[str, Any], 可選) — 其他 logits 處理器特定的 kwargs 引數。

形狀為 (batch_size, config.vocab_size) 的 `tf.Tensor`

處理後的預測分數。

用於處理 logits 的 TF 方法。

class transformers.TFLogitsProcessorList

( iterable = () )

此類可用於建立 TFLogitsProcessor 的列表，以隨後處理 `scores` 輸入張量。此類繼承自 list，並添加了一個特定的 *__call__* 方法，以將每個 TFLogitsProcessor 應用於輸入。

call

( input_ids: Tensor scores: Tensor cur_len: int **kwargs ) → `tf.Tensor` of shape `(batch_size, config.vocab_size)`

引數

input_ids (tf.Tensor of shape (batch_size, sequence_length)) — 詞彙表中輸入序列詞元的索引。

可以使用 PreTrainedTokenizer 獲取索引。有關詳細資訊，請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.__call__()。

什麼是輸入 ID？
scores (tf.Tensor of shape (batch_size, config.vocab_size)) — 語言模型頭的預測分數。在不使用束搜尋時，這些可以是每個詞彙的 logits；在使用束搜尋時，這些可以是每個詞彙詞元的 log softmax。
cur_len (int) — 有效輸入序列詞元的當前長度。在 TF 實現中，input_ids 的序列長度是 `generate` 方法可以產生的最大長度，我們需要知道哪些詞元是有效的。
kwargs (dict[str, Any], 可選) — 其他 logits 處理器特定的 kwargs 引數。

形狀為 (batch_size, config.vocab_size) 的 `tf.Tensor`

處理後的預測分數。

class transformers.TFLogitsWarper

( )

所有可在多項式取樣生成過程中應用的 logit warper 的抽象基類。

call

( input_ids: Tensor scores: Tensor cur_len: int ) → `tf.Tensor` of shape `(batch_size, config.vocab_size)`

引數

input_ids (tf.Tensor of shape (batch_size, sequence_length)) — 詞彙表中輸入序列詞元的索引。

可以使用 PreTrainedTokenizer 獲取索引。有關詳細資訊，請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.__call__()。

什麼是輸入 ID？
scores (tf.Tensor of shape (batch_size, config.vocab_size)) — 語言模型頭的預測分數。在不使用束搜尋時，這些可以是每個詞彙的 logits；在使用束搜尋時，這些可以是每個詞彙詞元的 log softmax。
cur_len (int) — 有效輸入序列詞元的當前長度。在 TF 實現中，input_ids 的序列長度是 `generate` 方法可以產生的最大長度，我們需要知道哪些詞元是有效的。
kwargs (dict[str, Any], 可選) — 其他 logits 處理器特定的 kwargs 引數。

形狀為 (batch_size, config.vocab_size) 的 `tf.Tensor`

處理後的預測分數。

用於扭曲（warp） logits 的 TF 方法。

class transformers.TFMinLengthLogitsProcessor

( min_length: int eos_token_id: int )

引數

min_length (int) — 最小長度，低於此長度時，`eos_token_id` 的分數將被設定為 `-float("Inf")`。
eos_token_id (int) — *序列結束* 詞元的 ID。

透過將 EOS 機率設定為 0 來強制執行最小長度的 TFLogitsProcessor。

call

( input_ids: Tensor scores: Tensor cur_len: int )

class transformers.TFNoBadWordsLogitsProcessor

( bad_words_ids: list eos_token_id: int )

引數

bad_words_ids (list[list[int]]) — 不允許生成的詞元 ID 列表的列表。為了獲取不應出現在生成文字中的單詞的詞元，請確保在初始化 tokenizer 時設定 `add_prefix_space=True`，並使用 `tokenizer(bad_words, add_special_tokens=False).input_ids`。`add_prefix_space` 引數僅對某些慢速 tokenizer 支援，因為快速 tokenizer 的字首行為來自 `pre_tokenizers`。更多資訊請閱讀這裡。
eos_token_id (int) — *序列結束* 詞元的 ID。

強制指定序列永遠不會被取樣的 TFLogitsProcessor。

call

( input_ids: Tensor scores: Tensor cur_len: int )

class transformers.TFNoRepeatNGramLogitsProcessor

( ngram_size: int )

引數

ngram_size (int) — 所有大小為 `ngram_size` 的 n-gram 只能出現一次。

強制不重複 n-gram 的 TFLogitsProcessor。參見 Fairseq。

call

( input_ids: Tensor scores: Tensor cur_len: int )

class transformers.TFRepetitionPenaltyLogitsProcessor

( penalty: float )

引數

repetition_penalty (float) — 重複懲罰的引數。1.0 表示沒有懲罰。更多細節請參閱這篇論文。

強制對重複序列施加指數懲罰的 TFLogitsProcessor。

call

( input_ids: Tensor scores: Tensor cur_len: int )

class transformers.TFSuppressTokensAtBeginLogitsProcessor

( begin_suppress_tokens begin_index )

TFSuppressTokensAtBeginLogitsProcessor 在 `generate` 函式開始使用 `begin_index` 個詞元生成時，立即抑制一個詞元列表。這應確保由 `begin_suppress_tokens` 定義的詞元不會在生成開始時被取樣。

call

( input_ids: Tensor scores: Tensor cur_len: int )

class transformers.TFSuppressTokensLogitsProcessor

( suppress_tokens )

該處理器可用於抑制一個詞元列表。處理器會將其對數機率設定為 `-inf`，從而使它們不被取樣。

call

( input_ids: Tensor scores: Tensor cur_len: int )

class transformers.TFTemperatureLogitsWarper

( temperature: float )

引數

temperature (float) — 用於調整 logits 分佈的值。

用於溫度縮放（指數縮放輸出機率分佈）的 TFLogitsWarper。

call

( input_ids: Tensor scores: Tensor cur_len: int )

class transformers.TFTopKLogitsWarper

( top_k: int filter_value: float = -inf min_tokens_to_keep: int = 1 )

引數

top_k (int) — 為 top-k 過濾保留的最高機率詞彙詞元的數量。
filter_value (float, 可選, 預設為 -inf) — 所有被過濾的值都將設定為此浮點數值。
min_tokens_to_keep (int, 可選, 預設為 1) — 不能被過濾的最小詞元數量。

執行 top-k 操作的 TFLogitsWarper，即限制為 k 個最高機率的元素。

call

( input_ids: Tensor scores: Tensor cur_len: int )

class transformers.TFTopPLogitsWarper

( top_p: float filter_value: float = -inf min_tokens_to_keep: int = 1 )

引數

top_p (float) — 如果設定為 < 1，則僅保留機率總和達到 `top_p` 或更高的最小機率詞元集合用於生成。
filter_value (float, 可選, 預設為 -inf) — 所有被過濾的值都將設定為此浮點數值。
min_tokens_to_keep (int, 可選, 預設為 1) — 不能被過濾的最小詞元數量。

執行 top-p 操作的 TFLogitsWarper，即限制為機率總和 <= prob_cut_off 的頂層詞元。

call

( input_ids: Tensor scores: Tensor cur_len: int )

FLAX

class transformers.FlaxForcedBOSTokenLogitsProcessor

( bos_token_id: int )

引數

bos_token_id (int) — 強制作為第一個生成詞元的詞元 ID。

強制將指定詞元作為第一個生成詞元的 FlaxLogitsProcessor。

call

( input_ids: Array scores: Array cur_len: int )

class transformers.FlaxForcedEOSTokenLogitsProcessor

( max_length: int eos_token_id: int )

引數

max_length (int) — 要生成的序列的最大長度。
eos_token_id (int) — 當達到 max_length 時，強制作為最後生成詞元的詞元 ID。

FlaxLogitsProcessor，當達到 max_length 時，強制將指定的詞元作為最後生成的詞元。

call

( input_ids: Array scores: Array cur_len: int )

class transformers.FlaxForceTokensLogitsProcessor

( force_token_map )

引數

force_token_map (list) — 給出詞元 ID 和它們將被強制取樣的索引的對映。

FlaxLogitsProcessor 接受一個整數對列表，該列表指示從生成索引到將在取樣前被強制的詞元索引的對映。該處理器會將其對數機率設定為 0，並將所有其他詞元的對數機率設定為 -inf，以便它們能在相應的索引處被取樣。

call

( input_ids: Array scores: Array cur_len: int )

class transformers.FlaxLogitsProcessor

( )

可應用於生成過程中的所有 logit 處理器的抽象基類。

call

( input_ids: Array scores: Array ) → 形狀為 (batch_size, config.vocab_size) 的 jnp.ndarray

引數

input_ids (形狀為 (batch_size, sequence_length) 的 jnp.ndarray) — 詞彙表中輸入序列詞元的索引。

可以使用 PreTrainedTokenizer 獲取索引。有關詳細資訊，請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 jnp.ndarray) — 語言建模頭的預測分數。在不使用束搜尋時，可以是每個詞彙表的 logits；在使用束搜尋時，可以是每個詞彙表詞元的對數 softmax。
kwargs (dict[str, Any], 可選) — 額外的 logits 處理器特定 kwargs。

形狀為 (batch_size, config.vocab_size) 的 jnp.ndarray

處理後的預測分數。

用於處理 logits 的 Flax 方法。

class transformers.FlaxLogitsProcessorList

( iterable = () )

該類可用於建立 FlaxLogitsProcessor 或 FlaxLogitsWarper 的列表，以隨後處理一個 scores 輸入張量。該類繼承自 list，並添加了一個特定的 call 方法，以將每個 FlaxLogitsProcessor 或 FlaxLogitsWarper 應用於輸入。

call

( input_ids: Array scores: Array cur_len: int **kwargs ) → 形狀為 (batch_size, config.vocab_size) 的 jnp.ndarray

引數

input_ids (形狀為 (batch_size, sequence_length) 的 jnp.ndarray) — 詞彙表中輸入序列詞元的索引。

可以使用 PreTrainedTokenizer 獲取索引。有關詳細資訊，請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 jnp.ndarray) — 語言建模頭的預測分數。在不使用束搜尋時，可以是每個詞彙表的 logits；在使用束搜尋時，可以是每個詞彙表詞元的對數 softmax。
kwargs (dict[str, Any], 可選) — 額外的 logits 處理器特定 kwargs。

形狀為 (batch_size, config.vocab_size) 的 jnp.ndarray

處理後的預測分數。

class transformers.FlaxLogitsWarper

( )

所有可在多項式取樣生成過程中應用的 logit warper 的抽象基類。

call

( input_ids: Array scores: Array ) → 形狀為 (batch_size, config.vocab_size) 的 jnp.ndarray

引數

input_ids (形狀為 (batch_size, sequence_length) 的 jnp.ndarray) — 詞彙表中輸入序列詞元的索引。

可以使用 PreTrainedTokenizer 獲取索引。有關詳細資訊，請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 jnp.ndarray) — 語言建模頭的預測分數。在不使用束搜尋時，可以是每個詞彙表的 logits；在使用束搜尋時，可以是每個詞彙表詞元的對數 softmax。
kwargs (dict[str, Any], 可選) — 額外的 logits 處理器特定 kwargs。

形狀為 (batch_size, config.vocab_size) 的 jnp.ndarray

處理後的預測分數。

用於扭曲 logits 的 Flax 方法。

class transformers.FlaxMinLengthLogitsProcessor

( min_length: int eos_token_id: int )

引數

min_length (int) — 最小長度，低於此長度時，eos_token_id 的分數將被設定為 -float("Inf")。
eos_token_id (int) — 序列結束 詞元的 ID。

透過將 EOS 機率設定為 0 來強制執行最小長度的 FlaxLogitsProcessor。

call

( input_ids: Array scores: Array cur_len: int )

class transformers.FlaxSuppressTokensAtBeginLogitsProcessor

( begin_suppress_tokens begin_index )

引數

begin_suppress_tokens (list[int]) — 不進行取樣的詞元。
begin_index (int) — 抑制詞元的起始索引。

FlaxLogitsProcessor，當 generate 函式開始使用 begin_index 詞元生成時，抑制一個詞元列表。這應確保由 begin_suppress_tokens 定義的詞元不會在生成開始時被取樣。

call

( input_ids scores cur_len: int )

class transformers.FlaxSuppressTokensLogitsProcessor

( suppress_tokens: list )

引數

suppress_tokens (list) — 不進行取樣的詞元。

FlaxLogitsProcessor 在每個解碼步驟中抑制一個詞元列表。該處理器會將其對數機率設定為 -inf，從而不會被取樣。

call

( input_ids: Array scores: Array cur_len: int )

class transformers.FlaxTemperatureLogitsWarper

( temperature: float )

引數

temperature (float) — 用於調節 logits 分佈的值。

用於溫度（指數縮放輸出機率分佈）的 FlaxLogitsWarper。

call

( input_ids: Array scores: Array cur_len: int )

class transformers.FlaxTopKLogitsWarper

( top_k: int filter_value: float = -inf min_tokens_to_keep: int = 1 )

引數

top_k (int) — 為 top-k 過濾保留的最高機率詞彙詞元的數量。
filter_value (float, 可選, 預設為 -inf) — 所有被過濾的值將被設定為此浮點數值。
min_tokens_to_keep (int, 可選, 預設為 1) — 不能被過濾的最小詞元數量。

執行 top-k 操作的 FlaxLogitsWarper，即將機率限制在 k 個最高機率的元素中。

call

( input_ids: Array scores: Array cur_len: int )

class transformers.FlaxTopPLogitsWarper

( top_p: float filter_value: float = -inf min_tokens_to_keep: int = 1 )

引數

top_p (float) — 如果設定為 < 1，則僅保留機率總和達到 top_p 或更高的最小最可能詞元集用於生成。
filter_value (float, 可選, 預設為 -inf) — 所有被過濾的值將被設定為此浮點數值。
min_tokens_to_keep (int, 可選, 預設為 1) — 不能被過濾的最小詞元數量。

執行 top-p 操作的 FlaxLogitsWarper，即將機率限制在機率總和 <= prob_cut_off 的頂級詞元中。

call

( input_ids: Array scores: Array cur_len: int )

class transformers.FlaxWhisperTimeStampLogitsProcessor

( generate_config model_config decoder_input_length )

引數

generate_config (GenerateConfig) — 用於生成輸出的生成配置。需要以下引數： eos_token_id (int, 可選, 預設為 50257): 序列結束 詞元的 ID。no_timestamps_token_id (int, 可選, 預設為 50363): "<|notimestamps|>" 詞元的 ID。max_initial_timestamp_index (int, 可選, 預設為 1): 用於設定初始時間戳的最大值。這用於防止模型預測未來太遠的時間戳。

Whisper 特定的處理器。此處理器可用於強制一個詞元列表。該處理器會將其對數機率設定為 inf，以便它們能在相應的索引處被取樣。

call

( input_ids scores cur_len )

StoppingCriteria

一個 StoppingCriteria 可以用來改變何時停止生成（除了 EOS 詞元）。請注意，這僅適用於我們的 PyTorch 實現。

class transformers.StoppingCriteria

( )

所有可在生成過程中應用的停止條件的抽象基類。

如果你的停止條件依賴於 scores 輸入，請確保在呼叫 generate 時傳遞 return_dict_in_generate=True, output_scores=True。

call

( input_ids: LongTensor scores: FloatTensor **kwargs ) → torch.BoolTensor. (形狀為 (batch_size, 1) 的 torch.BoolTensor)

引數

input_ids (形狀為 (batch_size, sequence_length) 的 torch.LongTensor) — 詞彙表中輸入序列詞元的索引。

可以使用 AutoTokenizer 獲取索引。有關詳細資訊，請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
scores (形狀為 (batch_size, config.vocab_size) 的 torch.FloatTensor) — 語言建模頭的預測分數。這可以是 SoftMax 之前每個詞彙詞元的分數，也可以是 SoftMax 之後每個詞彙詞元的分數。如果此停止條件依賴於 scores 輸入，請確保在呼叫 generate 時傳遞 return_dict_in_generate=True, output_scores=True。
kwargs (dict[str, Any], 可選) — 額外的停止條件特定 kwargs。

torch.BoolTensor. (形狀為 (batch_size, 1) 的 torch.BoolTensor)

True 表示我們停止對特定行的生成。False 表示我們應該繼續。

class transformers.StoppingCriteriaList

( iterable = () )

call

( input_ids: LongTensor scores: FloatTensor **kwargs ) → torch.BoolTensor. (形狀為 (batch_size, 1) 的 torch.BoolTensor)

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列標記的索引。

索引可以使用 AutoTokenizer 獲取。詳情請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
scores (torch.FloatTensor，形狀為 (batch_size, config.vocab_size)) — 語言模型頭的預測分數。這些可以是 SoftMax 前每個詞彙標記的分數，也可以是 SoftMax 後每個詞彙標記的分數。如果此停止標準依賴於 scores 輸入，請確保在呼叫 generate 時傳遞 return_dict_in_generate=True, output_scores=True。
kwargs (dict[str, Any]，可選) — 其他特定於停止標準的關鍵字引數。

torch.BoolTensor. (形狀為 (batch_size, 1) 的 torch.BoolTensor)

True 表示我們停止對特定行的生成。False 表示我們應該繼續。

class transformers.MaxLengthCriteria

( max_length: int max_position_embeddings: typing.Optional[int] = None )

引數

max_length (int) — 輸出序列所能包含的最大標記數。
max_position_embeddings (int，可選) — 模型的最大長度，由模型的 config.max_position_embeddings 屬性定義。

該類可用於在生成的標記總數超過 max_length 時停止生成。請注意，對於僅解碼器型別的 Transformer，這包括初始的提示標記。

call

( input_ids: LongTensor scores: FloatTensor **kwargs ) → torch.BoolTensor. (形狀為 (batch_size, 1) 的 torch.BoolTensor)

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列標記的索引。

索引可以使用 AutoTokenizer 獲取。詳情請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
scores (torch.FloatTensor，形狀為 (batch_size, config.vocab_size)) — 語言模型頭的預測分數。這些可以是 SoftMax 前每個詞彙標記的分數，也可以是 SoftMax 後每個詞彙標記的分數。如果此停止標準依賴於 scores 輸入，請確保在呼叫 generate 時傳遞 return_dict_in_generate=True, output_scores=True。
kwargs (dict[str, Any]，可選) — 其他特定於停止標準的關鍵字引數。

torch.BoolTensor. (形狀為 (batch_size, 1) 的 torch.BoolTensor)

True 表示我們停止對特定行的生成。False 表示我們應該繼續。

class transformers.MaxTimeCriteria

( max_time: float initial_timestamp: typing.Optional[float] = None )

引數

max_time (float) — 生成所允許的最大時間（以秒為單位）。
initial_time (float，可選，預設為 time.time()) — 允許的生成時間的開始時間。

當總生成時間超過某個時間量時，該類可用於停止生成。預設情況下，時間將在初始化此函式時開始計算。您可以透過傳遞 initial_time 來覆蓋此行為。

call

( input_ids: LongTensor scores: FloatTensor **kwargs ) → torch.BoolTensor. (形狀為 (batch_size, 1) 的 torch.BoolTensor)

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列標記的索引。

索引可以使用 AutoTokenizer 獲取。詳情請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
scores (torch.FloatTensor，形狀為 (batch_size, config.vocab_size)) — 語言模型頭的預測分數。這些可以是 SoftMax 前每個詞彙標記的分數，也可以是 SoftMax 後每個詞彙標記的分數。如果此停止標準依賴於 scores 輸入，請確保在呼叫 generate 時傳遞 return_dict_in_generate=True, output_scores=True。
kwargs (dict[str, Any]，可選) — 其他特定於停止標準的關鍵字引數。

torch.BoolTensor. (形狀為 (batch_size, 1) 的 torch.BoolTensor)

True 表示我們停止對特定行的生成。False 表示我們應該繼續。

class transformers.StopStringCriteria

( tokenizer: PreTrainedTokenizerBase stop_strings: typing.Union[str, list[str]] )

引數

tokenizer (PreTrainedTokenizer) — 模型關聯的 tokenizer（用於提取詞彙表和對終止序列進行分詞）。
stop_strings (Union[str, list[str]]) — 應該結束生成的字串列表。如果傳遞一個字串，它將被視為一個只有一個元素的列表。

當生成特定的字串序列時，該類可用於停止生成。它將字串與 tokenizer 詞彙表一起進行預處理，以找到標記可以有效完成停止字串的位置。

一旦生成了一個能完成任何停止字串的標記，生成就會停止。我們希望捕獲任何解碼輸出中存在停止字串的情況，這意味著我們還必須捕獲一端或兩端都有“懸垂”的情況。為了更具體，對於停止字串“stop”，以下任何標記序列都會觸發匹配：

[“st”, “op”]
[“stop”]
[“st”, “opera”]
[“sto”, “pper”]
[“las”, “topper”]
[“s”, “to”, “pped”]

請注意，只有當停止字串位於生成序列的末尾時才會觸發匹配。換句話說，這些序列不會觸發匹配：

[“stop”, “at”]
[“st”, “op”, “at”]
[“st”, “opera”, “tion”]

這些不匹配的原因是停止字串不與最後一個標記重疊。如果您可以從序列末尾移除一個或多個標記而不破壞停止字串，則此標準將不會匹配該停止字串。這是設計使然；因為每次生成標記後都會執行此檢查，所以如果生成了有效的停止字串，我們不會錯過它，但我們不希望僅僅因為停止字串存在於過去的 input_ids 中就停止生成。

那麼，匹配實際上是如何執行的呢？我們採用一種相當令人困惑的方式，因為我們希望整個匹配過程都能與 Torch 或 XLA 相容，這意味著我們不能使用標準的字串方法。然而，經過一些工作，是可以用純張量操作來進行字串匹配的。我們將首先描述我們使用標準字串操作的演算法，然後在最後解釋如何將其轉換為純張量操作。

該演算法的關鍵在於一個觀察：因為停止字串必須與標記序列的末尾重疊，所以我們可以從序列的末尾開始向後工作。具體來說，我們檢查最後一個標記的開頭和停止字串的結尾之間是否存在重疊，或者換句話說，對於某個 i > 0，stop_string[-i:] == token[:i]。如果您檢視上面的正例，您會發現它們中的最後一個標記都滿足此屬性：

[“st”, “op”] (重疊部分是“op”，重疊長度 == 2)
[“stop”] (重疊部分是“stop”，重疊長度 == 4)
[“st”, “opera”] (重疊部分是“op”，重疊長度 == 2)
[“sto”, “pper”] (重疊部分是“p”，重疊長度 == 1)
[“las”, “topper”] (重疊部分是“top”，重疊長度 == 3)
[“s”, “to”, “pped”] (重疊部分是“p”，重疊長度 == 1)

不可能構造一個不具有此屬性的匹配序列（您可以自行驗證）。然而，儘管最後一個標記的開頭和停止字串的結尾之間的這種重疊是匹配的必要條件，但它不是充分條件。我們還需要檢查標記序列的其餘部分是否與停止字串一致。

我們該怎麼做呢？讓我們以 [“s”, “to”, “pped”] 為例。我們知道最後一個標記“pped”與停止字串“stop”有 1 個字元的重疊。然後我們回到前一個標記“to”。由於我們已經從停止字串中匹配了 1 個字元，因此要檢查的剩餘部分是“sto”。我們檢查下一個標記“to”是否匹配剩餘部分的結尾，結果是匹配的。我們現在已經從停止字串中匹配了 3 個字元，要匹配的剩餘部分是“s”。我們再次回到前一個標記，它也是“s”。這是一個匹配，因此我們已經匹配了整個停止字串。

但是，當標記超出停止字串的開頭時，它又是如何工作的呢？讓我們考慮 [“las”, “topper”] 的例子。最後一個標記“topper”與停止字串“stop”有 3 個字元的重疊。因此，要匹配的剩餘停止字串是“s”。我們回到前一個標記“las”。因為要匹配的剩餘部分只是“s”，長度為 1，我們只考慮標記的最後一個字元，即“s”。這與停止字串匹配，因此整個字串都匹配了。

那麼，我們如何用張量操作來計算這些匹配呢？很簡單：我們高效地為所有標記預計算必要的資訊！對於每個標記，我們計算：

它與停止字串結尾的重疊部分（如果有）
標記在停止字串內匹配的位置，包括超出開頭的匹配。
標記的總長度

例如，對於標記“pped”，我們將計算出 1 的結尾重疊，沒有內部匹配位置，長度為 4。對於標記“to”，我們將計算出沒有結尾重疊，一個內部匹配位置為 1（從結尾開始計數），長度為 2。對於標記“s”，我們將計算出沒有結尾重疊，一個內部匹配位置為 3（同樣從結尾開始計數），長度為 1。

只要我們有這些資訊，我們就可以執行上述演算法而無需任何字串比較操作。我們只需執行以下步驟：

檢查最後一個標記是否與開始字串有結尾重疊
繼續向後，記錄到目前為止我們已經匹配了多少停止字串
在每一步，檢查下一個標記是否將當前位置作為其有效位置之一
繼續直到匹配失敗，或者我們完全匹配了整個停止字串

再次以 [“s”, “to”, “pped”] 為例。“pped” 的結尾重疊為 1，所以我們可以開始匹配。到目前為止我們已經匹配了 1 個字元，所以我們檢查下一個標記“to”是否以 1 作為有效位置（同樣，從結尾開始計數）。它確實是，所以我們將“to”的長度加到我們的位置跟蹤器上。我們現在已經匹配了 3 個字元，所以我們檢查下一個標記“s”是否以 3 作為有效位置。它確實是，所以我們將其長度加到位置跟蹤器上。位置跟蹤器現在是 4，也就是停止字串的長度。我們已經匹配了整個停止字串。

在第二種情況 [“las”, “topper”] 中，“topper” 的結尾重疊為 3，所以我們可以開始匹配。到目前為止我們已經匹配了 3 個字元，所以我們檢查下一個標記“las”是否以 3 作為有效位置。它確實是，因為我們允許標記匹配超出停止字串開頭的那些位置。我們將其長度加到位置跟蹤器上。位置跟蹤器現在是 6，大於停止字串的長度！不過別慌，這也算作是停止字串的匹配。我們已經匹配了整個停止字串。

示例

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
>>> model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")
>>> inputs = tokenizer("The biggest states in the USA by land area:", return_tensors="pt")

>>> gen_out = model.generate(**inputs)
>>> print(tokenizer.batch_decode(gen_out, skip_special_tokens=True)[0])
The biggest states in the USA by land area:
- Alaska
- Texas
- California

>>> # Passing one or more stop strings will halt generation after those strings are emitted
>>> # Note that generating with stop strings requires you to pass the tokenizer too
>>> gen_out = model.generate(**inputs, stop_strings=["Texas"], tokenizer=tokenizer)
>>> print(tokenizer.batch_decode(gen_out, skip_special_tokens=True)[0])
The biggest states in the USA by land area:
- Alaska
- Texas

call

( input_ids: LongTensor scores: FloatTensor **kwargs ) → torch.BoolTensor. (形狀為 (batch_size, 1) 的 torch.BoolTensor)

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列標記的索引。

索引可以使用 AutoTokenizer 獲取。詳情請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
scores (torch.FloatTensor，形狀為 (batch_size, config.vocab_size)) — 語言模型頭的預測分數。這些可以是 SoftMax 前每個詞彙標記的分數，也可以是 SoftMax 後每個詞彙標記的分數。如果此停止標準依賴於 scores 輸入，請確保在呼叫 generate 時傳遞 return_dict_in_generate=True, output_scores=True。
kwargs (dict[str, Any]，可選) — 其他特定於停止標準的關鍵字引數。

torch.BoolTensor. (形狀為 (batch_size, 1) 的 torch.BoolTensor)

True 表示我們停止對特定行的生成。False 表示我們應該繼續。

class transformers.EosTokenCriteria

( eos_token_id: typing.Union[int, list[int], torch.Tensor] )

引數

eos_token_id (Union[int, list[int], torch.Tensor]) — *序列結束*標記的 ID。

當生成“序列結束”標記時，該類可用於停止生成。預設情況下，它使用 model.generation_config.eos_token_id。

call

( input_ids: LongTensor scores: FloatTensor **kwargs ) → torch.BoolTensor. (形狀為 (batch_size, 1) 的 torch.BoolTensor)

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列標記的索引。

索引可以使用 AutoTokenizer 獲取。詳情請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
scores (torch.FloatTensor，形狀為 (batch_size, config.vocab_size)) — 語言模型頭的預測分數。這些可以是 SoftMax 前每個詞彙標記的分數，也可以是 SoftMax 後每個詞彙標記的分數。如果此停止標準依賴於 scores 輸入，請確保在呼叫 generate 時傳遞 return_dict_in_generate=True, output_scores=True。
kwargs (dict[str, Any]，可選) — 其他特定於停止標準的關鍵字引數。

torch.BoolTensor. (形狀為 (batch_size, 1) 的 torch.BoolTensor)

True 表示我們停止對特定行的生成。False 表示我們應該繼續。

約束

Constraint 可用於強制生成在輸出中包含特定的標記或序列。請注意，這僅適用於我們的 PyTorch 實現。

class transformers.Constraint

( )

所有可在生成過程中應用的約束的抽象基類。它必須定義如何滿足約束。

所有繼承 Constraint 的類都必須遵循以下要求：

completed = False
while not completed:
    _, completed = constraint.update(constraint.advance())

將始終終止（停止）。

advance

( ) → token_ids (Union[int, list[int], None])

token_ids (Union[int, list[int], None])

一個推進約束的單個標記 ID (int)，或
一個可以推進約束的標記 ID 列表
如果約束已完成或無法推進，則為 None

呼叫時，返回將此約束向滿足狀態推進一步的標記。

複製

( stateful = False ) → constraint(Constraint)

引數

stateful(bool) — 是否不僅為新例項複製約束，還複製其狀態。

constraint(Constraint)

與被呼叫的約束相同的約束。

建立此約束的新例項。

does_advance

( token_id: int )

讀入一個標記並返回它是否取得進展。

remaining

( )

返回完成此約束所需的 advance() 剩餘步驟數。

重置

( )

將此約束的狀態重置為其初始化狀態。當約束的滿足被一個不想要的標記打斷時，我們會呼叫此方法。

test

( )

測試此約束是否已正確定義。

update

( token_id: int ) → stepped(bool)

引數

token_id(int) — 集束搜尋中新生成的標記的 ID。

stepped(bool)

此約束是否向滿足狀態邁進了一步。 completed(bool)：此約束是否因生成此標記而完全滿足。 reset (bool)：此約束是否因生成此標記而重置了其進度。

讀入一個標記並返回指示其進展的布林值。與 does_advance(self, token_id: int) 不同，此函式將更新此物件的狀態。

這不是為了測試某個標記是否會推進進度；而是為了更新其狀態，就好像它已經被生成了一樣。如果 token_id != desired token，這一點變得很重要（請參考 PhrasalConstraint 中的 else 語句）。

class transformers.PhrasalConstraint

( token_ids: list )

引數

token_ids (list[int]) — 必須由輸出生成的標記的 ID。

強制輸出中包含一個有序標記序列的 Constraint。

class transformers.DisjunctiveConstraint

( nested_token_ids: list )

引數

nested_token_ids (list[list[int]]) — 一個單詞列表，其中每個單詞都是一個 ID 列表。透過生成列表中的一個單詞即可滿足此約束。

一種特殊的 Constraint，透過滿足多個約束中的一個即可滿足。

class transformers.ConstraintListState

( constraints: list )

引數

constraints (list[Constraint]) — 一個 Constraint 物件列表，集束評分器必須滿足這些約束。

一個供集束評分器追蹤約束列表進度的類。

advance

( )

可以取得進展的待生成詞元列表。這裡的“列表”並非指能完全滿足一個約束的詞元列表。

給定約束 c_i = {t_ij | j == # of tokens}，如果我們沒有正在處理某個特定約束 c_i，則返回

[t_k1 for k in indices of unfulfilled constraints]

如果我們正在處理一個約束，則返回：[t_ij]，其中 i 是正在處理的約束的索引，j 是該約束的下一步。

儘管我們不關心哪個約束先被滿足，但如果我們正在滿足一個約束，那麼只會返回與該約束相關的詞元。

重置

( token_ids: typing.Optional[list[int]] )

token_ids: 迄今為止生成的詞元，用於重置約束處理進度的狀態。

BeamSearch

class transformers.BeamScorer

( )

用於 ~PreTrainedModel.beam_search 和 ~PreTrainedModel.beam_sample 的所有集束評分器的抽象基類。

process

( input_ids: LongTensor next_scores: FloatTensor next_tokens: LongTensor next_indices: LongTensor **kwargs ) → UserDict

引數

input_ids (torch.LongTensor，形狀為 (batch_size * num_beams, sequence_length)) — 詞彙表中輸入序列詞元的索引。

可以使用任何繼承自 PreTrainedTokenizer 的類來獲取索引。有關詳細資訊，請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
next_scores (torch.FloatTensor，形狀為 (batch_size, 2 * num_beams)) — 2 * num_beams 個未完成集束假設的當前分數。
next_tokens (torch.LongTensor，形狀為 (batch_size, 2 * num_beams)) — 對應於 2 * num_beams 個未完成集束假設的詞元的 input_ids。
next_indices (torch.LongTensor，形狀為 (batch_size, 2 * num_beams)) — 指示 next_tokens 對應於哪個集束假設的集束索引。
pad_token_id (int, 可選) — 填充詞元的 ID。
eos_token_id (Union[int, list[int]], 可選) — 序列結束 詞元的 ID。可選擇使用列表來設定多個 序列結束 詞元。
beam_indices (torch.LongTensor, 可選) — 指示每個詞元對應哪個集束假設的集束索引。
group_index (int, 可選) — 集束組的索引。與 ~PreTrainedModel.group_beam_search 一起使用。

UserDict

由上述欄位組成的字典

next_beam_scores (torch.FloatTensor，形狀為 (batch_size * num_beams)) — 所有未完成集束的更新後分數。
next_beam_tokens (torch.FloatTensor，形狀為 (batch_size * num_beams)) — 將要新增到未完成集束假設中的下一個詞元。
next_beam_indices (torch.FloatTensor，形狀為 (batch_size * num_beams)) — 指示下一個詞元應新增到哪個集束的集束索引。

finalize

( input_ids: LongTensor next_scores: FloatTensor next_tokens: LongTensor next_indices: LongTensor max_length: int **kwargs ) → torch.LongTensor，形狀為 (batch_size * num_return_sequences, sequence_length)

引數

input_ids (torch.LongTensor，形狀為 (batch_size * num_beams, sequence_length)) — 詞彙表中輸入序列詞元的索引。

可以使用任何繼承自 PreTrainedTokenizer 的類來獲取索引。有關詳細資訊，請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
final_beam_scores (torch.FloatTensor，形狀為 (batch_size * num_beams)) — 所有未完成集束的最終分數。
final_beam_tokens (torch.FloatTensor，形狀為 (batch_size * num_beams)) — 將要新增到未完成集束假設中的最後一個詞元。
final_beam_indices (torch.FloatTensor，形狀為 (batch_size * num_beams)) — 指示 final_beam_tokens 應新增到哪個集束的集束索引。
pad_token_id (int, 可選) — 填充詞元的 ID。
eos_token_id (Union[int, list[int]], 可選) — 序列結束 詞元的 ID。可選擇使用列表來設定多個 序列結束 詞元。

torch.LongTensor，形狀為 (batch_size * num_return_sequences, sequence_length)

生成的序列。如果所有批次因 eos_token_id 而提前結束，第二個維度（sequence_length）等於或短於 max_length。

class transformers.BeamSearchScorer

( batch_size: int num_beams: int device: device length_penalty: typing.Optional[float] = 1.0 do_early_stopping: typing.Union[str, bool, NoneType] = False num_beam_hyps_to_keep: typing.Optional[int] = 1 num_beam_groups: typing.Optional[int] = 1 max_length: typing.Optional[int] = None )

引數

batch_size (int) — 並行執行標準集束搜尋解碼的 input_ids 的批次大小。
num_beams (int) — 集束搜尋的集束數量。
device (torch.device) — 定義此 BeamSearchScorer 例項將分配到的裝置型別（例如 "cpu" 或 "cuda"）。
length_penalty (float, 可選, 預設為 1.0) — 用於基於集束的生成的長度指數懲罰。它作為序列長度的指數，用於除以序列的分數。由於分數是序列的對數似然（即負數），length_penalty > 0.0 會傾向於生成更長的序列，而 length_penalty < 0.0 會鼓勵生成更短的序列。
do_early_stopping (bool 或 str, 可選, 預設為 False) — 控制基於集束的方法（如集束搜尋）的停止條件。它接受以下值：True，當有 num_beams 個完整候選時，生成停止；False，應用啟發式方法，當找到更好候選的可能性很小時，生成停止；"never"，集束搜尋過程僅在不可能有更好候選時才停止（標準的集束搜尋演算法）。
num_beam_hyps_to_keep (int, 可選, 預設為 1) — 呼叫 finalize() 時應返回的集束假設數量。
num_beam_groups (int, 可選, 預設為 1) — 將 num_beams 分成的組數，以確保不同集束組之間的多樣性。更多詳情請參閱這篇論文。
max_length (int, 可選) — 待生成序列的最大長度。

實現標準集束搜尋解碼的 BeamScorer。

部分改編自 Facebook 的 XLM 集束搜尋程式碼。

多樣化集束搜尋演算法及實現的參考：Ashwin Kalyan 的 DBS 實現

process

( input_ids: LongTensor next_scores: FloatTensor next_tokens: LongTensor next_indices: LongTensor pad_token_id: typing.Union[int, torch.Tensor, NoneType] = None eos_token_id: typing.Union[int, list[int], torch.Tensor, NoneType] = None beam_indices: typing.Optional[torch.LongTensor] = None group_index: typing.Optional[int] = 0 decoder_prompt_len: typing.Optional[int] = 0 )

finalize

( input_ids: LongTensor final_beam_scores: FloatTensor final_beam_tokens: LongTensor final_beam_indices: LongTensor max_length: int pad_token_id: typing.Union[int, torch.Tensor, NoneType] = None eos_token_id: typing.Union[int, list[int], torch.Tensor, NoneType] = None beam_indices: typing.Optional[torch.LongTensor] = None decoder_prompt_len: typing.Optional[int] = 0 )

class transformers.ConstrainedBeamSearchScorer

( batch_size: int num_beams: int constraints: list device: device length_penalty: typing.Optional[float] = 1.0 do_early_stopping: typing.Union[str, bool, NoneType] = False num_beam_hyps_to_keep: typing.Optional[int] = 1 num_beam_groups: typing.Optional[int] = 1 max_length: typing.Optional[int] = None )

引數

batch_size (int) — 並行執行標準集束搜尋解碼的 input_ids 的批次大小。
num_beams (int) — 集束搜尋的集束數量。
constraints (list[Constraint]) — 一個表示為 Constraint 物件的正向約束列表，這些約束必須在生成輸出中得到滿足。更多資訊，請閱讀 Constraint 的文件。
device (torch.device) — 定義此 BeamSearchScorer 例項將分配到的裝置型別（例如 "cpu" 或 "cuda"）。
length_penalty (float, 可選, 預設為 1.0) — 用於基於集束的生成的長度指數懲罰。它作為序列長度的指數，用於除以序列的分數。由於分數是序列的對數似然（即負數），length_penalty > 0.0 會傾向於生成更長的序列，而 length_penalty < 0.0 會鼓勵生成更短的序列。
do_early_stopping (bool 或 str, 可選, 預設為 False) — 控制基於集束的方法（如集束搜尋）的停止條件。它接受以下值：True，當有 num_beams 個完整候選時，生成停止；False，應用啟發式方法，當找到更好候選的可能性很小時，生成停止；"never"，集束搜尋過程僅在不可能有更好候選時才停止（標準的集束搜尋演算法）。
num_beam_hyps_to_keep (int, 可選, 預設為 1) — 呼叫 finalize() 時應返回的集束假設數量。
num_beam_groups (int, 可選, 預設為 1) — 將 num_beams 分成的組數，以確保不同集束組之間的多樣性。更多詳情請參閱這篇論文。
max_length (int, optional) — 要生成的序列的最大長度。

BeamScorer 實現約束集束搜尋解碼。

process

( input_ids: LongTensor next_scores: FloatTensor next_tokens: LongTensor next_indices: LongTensor scores_for_all_vocab: FloatTensor pad_token_id: typing.Union[int, torch.Tensor, NoneType] = None eos_token_id: typing.Union[int, list[int], torch.Tensor, NoneType] = None beam_indices: typing.Optional[torch.LongTensor] = None decoder_prompt_len: typing.Optional[int] = 0 ) → UserDict

引數

input_ids (torch.LongTensor，形狀為 (batch_size * num_beams, sequence_length)) — 詞彙表中輸入序列標記的索引。

可以使用任何繼承自 PreTrainedTokenizer 的類來獲取索引。有關詳細資訊，請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
next_scores (torch.FloatTensor，形狀為 (batch_size, 2 * num_beams)) — 前 2 * num_beams 個未完成的束假設的當前得分。
next_tokens (torch.LongTensor，形狀為 (batch_size, 2 * num_beams)) — 對應於前 2 * num_beams 個未完成的束假設的標記的 input_ids。
next_indices (torch.LongTensor，形狀為 (batch_size, 2 * num_beams)) — 束索引，指示 next_tokens 對應於哪個束假設。
scores_for_all_vocab (torch.FloatTensor，形狀為 (batch_size * num_beams, sequence_length)) — 每個束假設中詞彙表所有標記的得分。
pad_token_id (int, optional) — 填充標記的 ID。
eos_token_id (Union[int, list[int]], optional) — 序列結束標記的 ID。可選地，使用列表設定多個序列結束標記。
beam_indices (torch.LongTensor, optional) — 束索引，指示每個標記對應於哪個束假設。
decoder_prompt_len (int, optional) — 解碼器輸入中包含的提示長度。

UserDict

由上述欄位組成的字典

next_beam_scores (torch.FloatTensor，形狀為 (batch_size * num_beams)) — 所有未完成集束的更新後分數。
next_beam_tokens (torch.FloatTensor，形狀為 (batch_size * num_beams)) — 將要新增到未完成集束假設中的下一個詞元。
next_beam_indices (torch.FloatTensor，形狀為 (batch_size * num_beams)) — 指示下一個詞元應新增到哪個集束的集束索引。

finalize

流式處理器

class transformers.TextStreamer

( tokenizer: AutoTokenizer skip_prompt: bool = False **decode_kwargs )

引數

tokenizer (AutoTokenizer) — 用於解碼標記的分詞器。
skip_prompt (bool, optional, 預設為 False) — 是否跳過 .generate() 的提示。例如，對於聊天機器人很有用。
decode_kwargs (dict, optional) — 傳遞給分詞器 decode 方法的附加關鍵字引數。

簡單的文字流式處理器，一旦形成完整的單詞，就將標記列印到標準輸出。

流式處理器類的 API 仍在開發中，將來可能會發生變化。

示例

>>> from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

>>> tok = AutoTokenizer.from_pretrained("openai-community/gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
>>> inputs = tok(["An increasing sequence: one,"], return_tensors="pt")
>>> streamer = TextStreamer(tok)

>>> # Despite returning the usual output, the streamer will also print the generated text to stdout.
>>> _ = model.generate(**inputs, streamer=streamer, max_new_tokens=20)
An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven,

結束

( )

重新整理所有剩餘的快取，並向標準輸出列印一個換行符。

on_finalized_text

( text: str stream_end: bool = False )

將新文字列印到標準輸出。如果流結束，則同時列印一個換行符。

put

( value )

接收標記，解碼它們，並在它們形成完整的單詞後立即將其列印到標準輸出。

class transformers.TextIteratorStreamer

( tokenizer: AutoTokenizer skip_prompt: bool = False timeout: Optional[float] = None **decode_kwargs )

引數

tokenizer (AutoTokenizer) — 用於解碼標記的分詞器。
skip_prompt (bool, optional, 預設為 False) — 是否跳過 .generate() 的提示。例如，對於聊天機器人很有用。
timeout (float, optional) — 文字佇列的超時時間。如果為 None，佇列將無限期阻塞。在單獨的執行緒中呼叫 .generate() 時，這對於處理異常很有用。
decode_kwargs (dict, optional) — 傳遞給分詞器 decode 方法的附加關鍵字引數。

將可列印文字儲存在佇列中的流式處理器，供下游應用程式作為迭代器使用。這對於需要以非阻塞方式訪問生成文字的應用程式（例如，在互動式 Gradio 演示中）非常有用。

流式處理器類的 API 仍在開發中，將來可能會發生變化。

示例

>>> from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
>>> from threading import Thread

>>> tok = AutoTokenizer.from_pretrained("openai-community/gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
>>> inputs = tok(["An increasing sequence: one,"], return_tensors="pt")
>>> streamer = TextIteratorStreamer(tok)

>>> # Run the generation in a separate thread, so that we can fetch the generated text in a non-blocking way.
>>> generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=20)
>>> thread = Thread(target=model.generate, kwargs=generation_kwargs)
>>> thread.start()
>>> generated_text = ""
>>> for new_text in streamer:
...     generated_text += new_text
>>> generated_text
'An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven,'

on_finalized_text

( text: str stream_end: bool = False )

將新文字放入佇列。如果流結束，則同時在佇列中放入一個停止訊號。

class transformers.AsyncTextIteratorStreamer

( tokenizer: AutoTokenizer skip_prompt: bool = False timeout: Optional[float] = None **decode_kwargs )

引數

tokenizer (AutoTokenizer) — 用於解碼標記的分詞器。
skip_prompt (bool, optional, 預設為 False) — 是否跳過 .generate() 的提示。例如，對於聊天機器人很有用。
timeout (float, optional) — 文字佇列的超時時間。如果為 None，佇列將無限期阻塞。在單獨的執行緒中呼叫 .generate() 時，這對於處理異常很有用。
decode_kwargs (dict, optional) — 傳遞給分詞器 decode 方法的附加關鍵字引數。

引發

TimeoutError

TimeoutError — 如果標記生成時間超過超時值。

將可列印文字儲存在佇列中的流式處理器，供下游應用程式作為非同步迭代器使用。這對於需要非同步訪問生成文字的應用程式（例如，在互動式 Gradio 演示中）非常有用。

流式處理器類的 API 仍在開發中，將來可能會發生變化。

示例

>>> from transformers import AutoModelForCausalLM, AutoTokenizer, AsyncTextIteratorStreamer
>>> from threading import Thread
>>> import asyncio

>>> tok = AutoTokenizer.from_pretrained("openai-community/gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
>>> inputs = tok(["An increasing sequence: one,"], return_tensors="pt")

>>> # Run the generation in a separate thread, so that we can fetch the generated text in a non-blocking way.
>>> async def main():
...     # Important: AsyncTextIteratorStreamer must be initialized inside a coroutine!
...     streamer = AsyncTextIteratorStreamer(tok)
...     generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=20)
...     thread = Thread(target=model.generate, kwargs=generation_kwargs)
...     thread.start()
...     generated_text = ""
...     async for new_text in streamer:
...         generated_text += new_text
>>>     print(generated_text)
>>> asyncio.run(main())
An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven,

on_finalized_text

( text: str stream_end: bool = False )

將新文字放入佇列。如果流結束，則同時在佇列中放入一個停止訊號。

快取

class transformers.Cache

( )

所有快取的基類、抽象類。實際的資料結構特定於每個子類。

update

( key_states: Tensor value_states: Tensor layer_idx: int cache_kwargs: typing.Optional[dict[str, typing.Any]] = None )

引數

key_states (torch.Tensor) — 要快取的新鍵狀態。
value_states (torch.Tensor) — 要快取的新值狀態。
layer_idx (int) — 要快取狀態的層的索引。
cache_kwargs (dict[str, Any], `optional`) — 快取子類的附加引數。這些引數特定於每個子類，並允許建立新型快取。

使用層 layer_idx 的新 key_states 和 value_states 更新快取。

class transformers.CacheConfig

( cache_implementation: None )

快取配置的基類

update

( **kwargs ) → dict[str, Any]

引數

kwargs (dict[str, Any]) — 用於嘗試更新此類屬性的字典。

dict[str, Any]

包含所有未用於更新例項的鍵值對的字典。

如果 kwargs 中的屬性與現有屬性匹配，則使用它們更新此類例項的屬性，並返回所有未使用的 kwargs。

class transformers.QuantizedCacheConfig

( backend: str = 'quanto' nbits: typing.Optional[int] = 4 axis_key: typing.Optional[int] = 0 axis_value: typing.Optional[int] = 0 q_group_size: typing.Optional[int] = 64 residual_length: typing.Optional[int] = 128 compute_dtype: typing.Optional[torch.dtype] = torch.float16 device: typing.Optional[str] = 'cpu' )

引數

backend (str, optional, 預設為 "quanto") — 執行量化時使用的後端，可以是 [quanto, HQQ] 之一
nbits (Optional[int], optional, 預設為 4) — 位元數，對於 quanto 後端可以是 2 或 4，對於 HQQ 後端可以是 [1, 2, 3, 4, 8] 中的一個。預設為 2。
axis_key (int, optional, 預設為 0) — 對鍵張量執行分組的軸。對於 quanto 後端可以是 [0, -1]，對於 HQQ 後端可以是 [0, 1]。
axis_value (int, optional, 預設為 0) — 對值張量執行分組的軸。對於 quanto 後端可以是 [0, -1]，對於 HQQ 後端可以是 [0, 1]。
q_group_size (Optional[int], optional, 預設為 64) — 量化組的大小，應為模型隱藏維度的除數。預設為 64。
residual_length (Optional[int], 可選, 預設為 128) — 殘差快取的長度，該部分將始終以原始精度儲存。預設為 128。
compute_dtype (torch.dtype, 可選, 預設為 torch.float16) — 模型中用於計算的預設 dtype。鍵（Keys）和值（Values）在反量化後將被轉換為此 dtype。
device (str, 可選, 預設為 "cpu") — 執行計算的裝置，應與模型的裝置相同。

用於量化快取設定的配置類。

validate

( )

驗證傳入的引數是否正確

class transformers.DynamicCache

( _distributed_cache_data: typing.Optional[collections.abc.Iterable] = None )

一種隨著生成更多令牌而動態增長的快取。這是生成模型的預設設定。

它將鍵（Key）和值（Value）狀態儲存為張量列表，每層一個。每個張量的預期形狀為 [batch_size, num_heads, seq_len, head_dim]。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache

>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")

>>> inputs = tokenizer(text="My name is Qwen2", return_tensors="pt")

>>> # Prepare a cache class and pass it to model's forward
>>> past_key_values = DynamicCache()
>>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True)
>>> outputs.past_key_values # access cache filled with key/values from generation
DynamicCache()

update

( key_states: Tensor value_states: Tensor layer_idx: int cache_kwargs: typing.Optional[dict[str, typing.Any]] = None )

引數

key_states (torch.Tensor) — 要快取的新鍵狀態。
value_states (torch.Tensor) — 要快取的新值狀態。
layer_idx (int) — 要快取狀態的層的索引。
cache_kwargs (dict[str, Any], 可選) — 快取子類的附加引數。在 DynamicCache 中不使用附加引數。

使用層 layer_idx 的新 key_states 和 value_states 更新快取。

get_seq_length

( layer_idx: typing.Optional[int] = 0 )

返回快取狀態的序列長度。可以有選擇地傳入層索引。

reorder_cache

( beam_idx: LongTensor )

根據選定的束索引，為束搜尋（beam search）重新排序快取。

to_legacy_cache

( )

將 DynamicCache 例項轉換為其等效的舊版快取格式。用於向後相容。

from_legacy_cache

( past_key_values: typing.Optional[tuple[tuple[torch.FloatTensor, torch.FloatTensor]]] = None )

將舊版快取格式的快取轉換為等效的 DynamicCache。用於向後相容。

class transformers.QuantizedCache

( cache_config: QuantizedCacheConfig )

一種量化器快取，類似於 KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache 論文中描述的方法。它透過應用量化，使模型能夠生成更長的序列，而無需為鍵（Key）和值（Value）快取分配過多記憶體。

快取有兩種儲存型別，一種用於原始精度，一種用於量化快取。設定了一個 `residual length` 作為原始精度快取的最大容量。當長度超過最大容量時，原始精度快取將被丟棄並移入量化快取。與論文中描述的不同，鍵（Keys）和值（Values）的量化是按通道進行的，並設定了 `q_group_size`。

它將鍵（Keys）和值（Values）儲存為量化張量的列表（如果需要儲存元資料，則為元組），每層一個。此外，它還將原始精度的鍵和值狀態儲存為張量列表，每層一個。每個張量的大小為 `[batch_size, num_heads, seq_len - residual_length, head_dim]`

update

( key_states: Tensor value_states: Tensor layer_idx: int cache_kwargs: typing.Optional[dict[str, typing.Any]] = None )

get_seq_length

( layer_idx: typing.Optional[int] = 0 )

返回快取狀態的序列長度。可以有選擇地傳入層索引。

class transformers.QuantoQuantizedCache

( cache_config: CacheConfig )

引數

cache_config (QuantizedCacheConfig) — 包含量化器使用的所有引數的配置，包括軸、qtype 和組大小。

使用 quanto 作為後端執行量化的量化快取類。當前實現僅支援 int2 和 int4 dtypes。

示例

>>> # Run pip install quanto first if you don't have it yet
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, QuantoQuantizedCache, QuantizedCacheConfig

>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")

>>> inputs = tokenizer(text="My name is Qwen2", return_tensors="pt")

>>> # Prepare a cache class and pass it to model's forward
>>> cache_config = QuantizedCacheConfig(nbits=4)
>>> past_key_values = QuantoQuantizedCache(cache_config=cache_config)
>>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True)
>>> outputs.past_key_values # access cache filled with key/values from generation
QuantoQuantizedCache()

class transformers.HQQQuantizedCache

( cache_config: CacheConfig )

引數

cache_config (QuantizedCacheConfig) — 包含量化器使用的所有引數的配置，包括軸、qtype 和組大小。

使用 HQQ 作為後端執行量化的量化快取類。當前實現支援 int2、int4、int8 dtypes。

示例

>>> # Run pip install hqq first if you don't have it yet
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, HQQQuantizedCache, QuantizedCacheConfig

>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")

>>> inputs = tokenizer(text="My name is Qwen2", return_tensors="pt")

>>> # Prepare a cache class and pass it to model's forward
>>> cache_config = QuantizedCacheConfig(nbits=4, axis_key=1, axis_value=1)
>>> past_key_values = HQQQuantizedCache(cache_config=cache_config)
>>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True)
>>> outputs.past_key_values # access cache filled with key/values from generation
HQQQuantizedCache()

class transformers.OffloadedCache

( )

DynamicCache 的直接替代品，以更多 CPU 記憶體為代價來節省加速器（GPU、XPU）記憶體。適用於從具有非常長上下文的模型生成內容。

除了所有 forward() 計算都發生的預設加速器流之外，此類還使用另一個它自己建立的流，即預取流。由於在不同流上的操作排程是獨立的，此類使用預取流在層 k 執行時非同步預取層 k+1 的 KV 快取。層 k-1 快取向 CPU 的移動由預設流處理，這是一種確保在該快取上的所有計算完成後排程驅逐的簡單方法。

update

( key_states: Tensor value_states: Tensor layer_idx: int cache_kwargs: typing.Optional[dict[str, typing.Any]] = None )

引數

key_states (torch.Tensor) — 要快取的新鍵狀態。
value_states (torch.Tensor) — 要快取的新值狀態。
layer_idx (int) — 要快取狀態的層的索引。
cache_kwargs (dict[str, Any], 可選) — 快取子類的附加引數。在 OffloadedCache 中不使用附加引數。

使用層 layer_idx 的新 key_states 和 value_states 更新快取。

prefetch_layer

( layer_idx: int )

開始預取下一層快取

evict_previous_layer

( layer_idx: int )

將上一層快取移動到 CPU

class transformers.StaticCache

( config: PretrainedConfig max_batch_size: int max_cache_len: typing.Optional[int] = None device: typing.Union[torch.device, str, NoneType] = None dtype: dtype = torch.float32 layer_device_map: typing.Optional[dict[int, typing.Union[str, torch.device, int]]] = None )

引數

config (PretrainedConfig) — 定義初始化靜態快取所需形狀相關屬性的配置檔案。
max_batch_size (int) — 模型將使用的最大批處理大小。請注意，如果使用較小的批處理大小，則必須例項化一個新例項。如果您手動設定批處理大小，請確保在執行束搜尋時考慮束的數量
max_cache_len (int, 可選) — 模型將使用的最大序列長度。
device (torch.device 或 str, 可選) — 快取應在其上初始化的裝置。如果您使用超過 1 個計算裝置，則應改用 layer_device_map 引數。
dtype (torch.dtype, 可選, 預設為 torch.float32) — 初始化層時使用的預設 dtype。
layer_device_map (Optional[dict[int, Union[str, torch.device, int]]], 可選) — 層與其裝置之間的對映。當您手動初始化快取且模型分佈在不同 GPU 之間時，此引數是必需的。您可以透過檢查關聯的 device_map 來了解哪些層對映到哪個裝置：model.hf_device_map。

靜態快取類，用於 torch.compile(model) 和 torch.export()。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, StaticCache

>>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

>>> inputs = tokenizer(text="My name is Llama", return_tensors="pt")

>>> # Prepare a cache class and pass it to model's forward
>>> # Leave empty space for 10 new tokens, which can be used when calling forward iteratively 10 times to generate
>>> max_generated_length = inputs.input_ids.shape[1] + 10
>>> past_key_values = StaticCache(config=model.config, max_batch_size=1, max_cache_len=max_generated_length, device=model.device, dtype=model.dtype)
>>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True)
>>> outputs.past_key_values # access cache filled with key/values from generation
StaticCache()

update

( key_states: Tensor value_states: Tensor layer_idx: int cache_kwargs: typing.Optional[dict[str, typing.Any]] = None )

引數

key_states (torch.Tensor) — 要快取的新鍵狀態。
value_states (torch.Tensor) — 要快取的新值狀態。
layer_idx (int) — 要快取狀態的層的索引。
cache_kwargs (dict[str, Any], 可選) — 快取子類的附加引數。StaticCache 需要 cache_position 輸入來知道在快取中的寫入位置。

使用層 layer_idx 的新 key_states 和 value_states 更新快取。使用張量進行索引非常重要，否則會引入到裝置的副本。

get_seq_length

( layer_idx: typing.Optional[int] = 0 )

返回模型已見過的快取狀態的序列長度。

重置

( )

在保留物件的同時重置快取值

class transformers.OffloadedStaticCache

( config: PretrainedConfig max_batch_size: int max_cache_len: typing.Optional[int] device: typing.Union[str, torch.device] dtype: typing.Optional[torch.dtype] = None offload_device: typing.Union[str, torch.device] = device(type='cpu') layer_device_map: typing.Optional[dict[int, typing.Union[str, torch.device, int]]] = None )

引數

config (`PretrainedConfig) — 定義初始化靜態快取所需形狀相關屬性的配置檔案。
max_batch_size (int) — 模型將使用的最大批處理大小。
max_cache_len (int) — 模型將使用的最大序列長度。
device (Union[str, torch.device]) — 快取應在其上初始化的裝置。如果您使用超過 1 個計算裝置，則應改用 layer_device_map 引數。
dtype (torch.dtype, 可選) — 初始化快取時使用的預設 dtype。
offload_device (Union[str, torch.device], 可選, 預設為 cpu) — 要解除安裝到的裝置。預設為 CPU。
layer_device_map (dict[int, Union[str, torch.device, int]], 可選) — 層與其裝置之間的對映。當您手動初始化快取且模型分佈在不同 GPU 之間時，此引數是必需的。您可以透過檢查關聯的 device_map 來了解哪些層對映到哪個裝置：model.hf_device_map。

靜態快取類，用於 torch.compile(model)，可解除安裝到 CPU 或其他裝置。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, OffloadedStaticCache

>>> model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")

>>> inputs = tokenizer(text="My name is GPT2", return_tensors="pt")

>>> # Prepare a cache class and pass it to model's forward
>>> # Leave empty space for 10 new tokens, which can be used when calling forward iteratively 10 times to generate
>>> max_generated_length = inputs.input_ids.shape[1] + 10
>>> past_key_values = OffloadedStaticCache(config=model.config, max_batch_size=1, max_cache_len=max_generated_length, device=model.device, dtype=model.dtype)
>>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True)
>>> past_kv_length = outputs.past_key_values # access cache filled with key/values from generation

update

( key_states: Tensor value_states: Tensor layer_idx: int cache_kwargs: typing.Optional[dict[str, typing.Any]] = None )

引數

key_states (torch.Tensor) — 要快取的新鍵狀態。
value_states (torch.Tensor) — 要快取的新值狀態。
layer_idx (int) — 要快取狀態的層的索引。
cache_kwargs (dict[str, Any], optional) — 快取子類的附加引數。OffloadedStaticCache 需要 cache_position 輸入來知道在快取中的寫入位置。

使用層 layer_idx 的新 key_states 和 value_states 更新快取。使用張量進行索引非常重要，否則會引入到裝置的副本。

get_seq_length

( layer_idx: typing.Optional[int] = 0 )

返回模型已見過的快取狀態的序列長度。

重置

( )

重置快取值，同時保留物件。

class transformers.HybridCache

引數

config (`PretrainedConfig) — 定義初始化靜態快取所需形狀相關屬性的配置檔案。
max_batch_size (int) — 模型將使用的最大批處理大小。請注意，如果使用較小的批處理大小，則必須例項化一個新例項。
max_cache_len (int, optional) — 模型將使用的最大序列長度。
device (torch.device or str, optional) — 快取應初始化的裝置。如果您使用超過 1 個計算裝置，則應傳遞 layer_device_map 引數。
dtype (torch.dtype, optional, defaults to torch.float32) — 初始化層時使用的預設 dtype。
layer_device_map (Optional[dict[int, Union[str, torch.device, int]]]], optional) — 層與其裝置之間的對映。當您手動初始化快取且模型在不同 GPU 之間拆分時，此引數是必需的。您可以透過檢查關聯的 device_map 來了解哪些層對映到哪個裝置：model.hf_device_map。

混合快取類，與 torch.compile 一起使用，適用於在每隔一層中交替使用區域性滑動視窗注意力和全域性注意力的模型（最初為 Gemma2 實現）。在底層，混合快取利用 [“SlidingWindowCache”] 實現滑動視窗注意力，利用 [“StaticCache”] 實現全域性注意力。有關更多資訊，請參閱每個子元件快取類的文件。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, HybridCache

>>> model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b")
>>> tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")

>>> inputs = tokenizer(text="My name is Gemma", return_tensors="pt")

>>> # Prepare a cache class and pass it to model's forward
>>> # Leave empty space for 10 new tokens, which can be used when calling forward iteratively 10 times to generate
>>> max_generated_length = inputs.input_ids.shape[1] + 10
>>> past_key_values = HybridCache(config=model.config, max_batch_size=1, max_cache_len=max_generated_length, device=model.device, dtype=model.dtype)
>>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True)
>>> outputs.past_key_values # access cache filled with key/values from generation
HybridCache()

update

( key_states: Tensor value_states: Tensor layer_idx: int cache_kwargs: typing.Optional[dict[str, typing.Any]] = None )

get_seq_length

( layer_idx: typing.Optional[int] = 0 )

重置

( )

在保留物件的同時重置快取值

class transformers.SlidingWindowCache

引數

config (PretrainedConfig) — 定義初始化靜態快取所需形狀相關屬性的配置檔案。
max_batch_size (int) — 模型將使用的最大批處理大小。請注意，如果使用較小的批處理大小，則必須例項化一個新例項。
max_cache_len (int, optional) — 模型將使用的最大序列長度。
device (torch.device or str, optional) — 快取應初始化的裝置。如果您使用超過 1 個計算裝置，則應傳遞 layer_device_map 引數。
dtype (torch.dtype, optional, defaults to torch.float32) — 初始化層時使用的預設 dtype。
layer_device_map (Optional[dict[int, Union[str, torch.device, int]]]], optional) — 層與其裝置之間的對映。當您手動初始化快取且模型在不同 GPU 之間拆分時，此引數是必需的。您可以透過檢查關聯的 device_map 來了解哪些層對映到哪個裝置：model.hf_device_map。

滑動視窗快取類，用於支援滑動視窗注意力的模型（如 Mistral），並與 torch.compile 一起使用。每次我們嘗試更新快取時，我們都會根據 cache_position >= self.config.sliding_window - 1 計算 indices，如果為 true（這意味著由於滑動視窗約束，快取無法同時容納所有舊的鍵值狀態和新狀態），我們需要根據 indices 進行迴圈移位，以用傳入的新鍵值狀態替換最舊的狀態。

只有當我們超過 sliding_window 時，`to_shift` 才為 true。因此，當 `sliding_window==64` 時

indices = (slicing + to_shift[-1].sum()-1) % self.config.sliding_window tensor([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 0])

我們使用這些索引來覆蓋快取，然後我們總是在 cache_position (被限制在 `sliding_window` 內) 寫入。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, SlidingWindowCache

>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

>>> inputs = tokenizer(text="My name is Mistral", return_tensors="pt")

>>> # Prepare a cache class and pass it to model's forward
>>> # Leave empty space for 10 new tokens, which can be used when calling forward iteratively 10 times to generate
>>> max_generated_length = inputs.input_ids.shape[1] + 10
>>> past_key_values = SlidingWindowCache(config=model.config, max_batch_size=1, max_cache_len=max_generated_length, device=model.device, dtype=model.dtype)
>>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True)
>>> outputs.past_key_values # access cache filled with key/values from generation
SlidingWindowCache()

update

( key_states: Tensor value_states: Tensor layer_idx: int cache_kwargs: typing.Optional[dict[str, typing.Any]] = None )

重置

( )

class transformers.EncoderDecoderCache

( self_attention_cache: Cache cross_attention_cache: Cache )

所有編碼器-解碼器快取的基類、抽象類。可用於儲存自注意力和交叉注意力快取的組合。

示例

>>> from transformers import AutoProcessor, AutoModelForCausalLM, DynamicCache, EncoderDecoderCache

>>> model = AutoModelForCausalLM.from_pretrained("openai/whisper-small")
>>> processor = AutoProcessor.from_pretrained("openai/whisper-small")

>>> inputs = processor(audio=YOUR-AUDIO, return_tensors="pt")

>>> # Prepare cache classes for encoder and decoder and pass it to model's forward
>>> self_attention_cache = DynamicCache()
>>> cross_attention_cache = DynamicCache()
>>> past_key_values = EncoderDecoderCache(self_attention_cache, cross_attention_cache)
>>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True)
>>> outputs.past_key_values # access cache filled with key/values from generation
EncoderDecoderCache()

get_seq_length

( layer_idx: typing.Optional[int] = 0 )

返回快取狀態的序列長度。可以有選擇地傳入層索引。

to_legacy_cache

( )

將 EncoderDecoderCache 例項轉換為其等效的舊版快取格式。

from_legacy_cache

( past_key_values: typing.Optional[tuple[tuple[torch.FloatTensor]]] = None )

將舊版快取格式的快取轉換為等效的 EncoderDecoderCache。

重置

( )

reorder_cache

( beam_idx: LongTensor )

根據選定的束索引，為束搜尋（beam search）重新排序快取。

class transformers.MambaCache

( config: PretrainedConfig max_batch_size: int dtype: dtype = torch.float16 device: typing.Union[torch.device, str, NoneType] = None )

引數

config (`PretrainedConfig) — 定義初始化靜態快取所需形狀相關屬性的配置檔案。
max_batch_size (int) — 模型將使用的最大批處理大小。請注意，如果使用較小的批處理大小，則必須例項化一個新例項。
dtype (torch.dtype, optional, defaults to torch.float16) — 初始化層時使用的預設 dtype。
device (torch.device or str, optional) — 快取應初始化的裝置。應與層裝置相同。

用於 mamba 模型的快取，該模型沒有注意力機制和鍵值狀態。

示例

>>> from transformers import AutoTokenizer, MambaForCausalLM, MambaCache

>>> model = MambaForCausalLM.from_pretrained("state-spaces/mamba-130m-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("state-spaces/mamba-130m-hf")

>>> inputs = tokenizer(text="My name is Mamba", return_tensors="pt")

>>> # Prepare a cache class and pass it to model's forward
>>> past_key_values = MambaCache(config=model.config, max_batch_size=1, device=model.device, dtype=model.dtype)
>>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True)
>>> outputs.past_key_values
MambaCache()

update_conv_state

( layer_idx: int new_conv_state: Tensor cache_position: LongTensor )

update_ssm_state

( layer_idx: int new_ssm_state: Tensor )

重置

( )

水印工具

class transformers.WatermarkingConfig

( greenlist_ratio: typing.Optional[float] = 0.25 bias: typing.Optional[float] = 2.0 hashing_key: typing.Optional[int] = 15485863 seeding_scheme: typing.Optional[str] = 'lefthash' context_width: typing.Optional[int] = 1 )

該類持有水印生成引數，應在 `generate` 期間傳遞給 `GenerationConfig`。有關引數的更多詳細資訊，請參閱這篇論文。

接受以下鍵

greenlist_ratio (float): 用於水印。使用的“綠色”詞元與詞彙表大小的比例。預設為 0.25。
bias (float): 與水印一起使用。新增到所選“綠色”詞元 logits 的偏置。預設為 2.0。
hashing_key (int): 用於水印的雜湊鍵。預設為 15485863 (第一百萬個素數)。
seeding_scheme (str): 用於水印的演算法。接受的值有
- “lefthash” (預設): “綠色”詞元的選擇取決於最後一個詞元 (論文中的演算法 2)
- “selfhash”: “綠色”詞元的選擇取決於當前詞元本身 (論文中的演算法 3)。該方案的缺點是它會考慮所有可能的下一個詞元，並且可能比“lefthash”慢。
context_width(int): 用於播種的前一個詞元的上下文長度。更高的上下文長度使水印更具魯棒性。

call

( *args **kwargs )

將自身作為函式呼叫。

class transformers.WatermarkDetector

( model_config: PretrainedConfig device: str watermarking_config: typing.Union[transformers.generation.configuration_utils.WatermarkingConfig, dict] ignore_repeated_ngrams: bool = False max_cache_size: int = 128 )

引數

model_config (PretrainedConfig) — 模型配置，將用於獲取生成時使用的模型特定引數。
device (str) — 在生成帶水印文字期間使用的裝置。
watermarking_config (Union[WatermarkingConfig, Dict]) — 在生成文字時使用的完全相同的水印配置和引數。
ignore_repeated_ngrams (bool, optional, defaults to False) — 是否只計算每個唯一的 ngram 一次。
max_cache_size (int, optional, defaults to 128) — 用於每個詞元呼叫的播種/取樣演算法的 LRU 快取的最大大小。

用於檢測生成的水印文字的檢測器。檢測器需要被賦予與文字生成期間完全相同的設定，以複製水印綠名單的生成，從而檢測水印。這包括在文字生成期間使用的正確裝置、正確的水印引數和正確的分詞器詞彙表大小。該程式碼基於原始倉庫。

更多資訊請參閱此論文。

示例

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, WatermarkDetector, WatermarkingConfig

>>> model_id = "openai-community/gpt2"
>>> model = AutoModelForCausalLM.from_pretrained(model_id)
>>> tok = AutoTokenizer.from_pretrained(model_id)
>>> tok.pad_token_id = tok.eos_token_id
>>> tok.padding_side = "left"

>>> inputs = tok(["This is the beginning of a long story", "Alice and Bob are"], padding=True, return_tensors="pt")
>>> input_len = inputs["input_ids"].shape[-1]

>>> # first generate text with watermark and without
>>> watermarking_config = WatermarkingConfig(bias=2.5, seeding_scheme="selfhash")
>>> out_watermarked = model.generate(**inputs, watermarking_config=watermarking_config, do_sample=False, max_length=20)
>>> out = model.generate(**inputs, do_sample=False, max_length=20)

>>> # now we can instantiate the detector and check the generated text
>>> detector = WatermarkDetector(model_config=model.config, device="cpu", watermarking_config=watermarking_config)
>>> detection_out_watermarked = detector(out_watermarked, return_dict=True)
>>> detection_out = detector(out, return_dict=True)
>>> detection_out_watermarked.prediction
array([ True,  True])

>>> detection_out.prediction
array([False,  False])

call

( input_ids: LongTensor z_threshold: float = 3.0 return_dict: bool = False ) → WatermarkDetectorOutput or np.array

引數

input_ids (torch.LongTensor) — 帶水印的生成文字。建議移除提示，因為它會影響檢測。
z_threshold (Dict, optional, defaults to 3.0) — 更改此閾值將改變檢測器的靈敏度。較高的 z 閾值靈敏度較低，反之亦然。
return_dict (bool, optional, defaults to False) — 是否返回 ~generation.WatermarkDetectorOutput。如果不返回，將返回布林預測。

WatermarkDetectorOutput or np.array

如果 `return_dict=True` 則返回 `WatermarkDetectorOutput`，否則返回 `np.array`。

class transformers.BayesianDetectorConfig

( watermarking_depth: typing.Optional[int] = None base_rate: float = 0.5 **kwargs )

引數

watermarking_depth (int, optional) — 錦標賽層數。
base_rate (float1, optional, defaults to 0.5) — 文字帶水印的先驗機率 P(w)。

這是用於儲存 BayesianDetectorModel 配置的配置類。它用於根據指定的引數例項化貝葉斯檢測器模型。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請閱讀 PretrainedConfig 的文件。

class transformers.BayesianDetectorModel

( config )

引數

config (BayesianDetectorConfig) — 模型配置類，包含模型的所有引數。使用配置檔案進行初始化不會載入與模型相關的權重，僅載入配置。請檢視 from_pretrained() 方法以載入模型權重。

用於水印檢測的貝葉斯分類器。

該檢測器使用貝葉斯定理計算水印分數，該分數是後驗機率 P(watermarked|g_values) 與 P(unwatermarked|g_values) 之比的對數的 sigmoid 函式值。更多詳情，請參閱論文中關於 BayesianScore 的部分。論文 URL：https://www.nature.com/articles/s41586-024-08025-4

請注意，此檢測器僅適用於使用 Bernoulli(0.5) g-value 分佈的無失真、基於錦標賽（Tournament-based）的水印技術。

該模型繼承自 PreTrainedModel。請查閱超類文件，瞭解該庫為所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪注意力頭等）。

該模型也是 PyTorch torch.nn.Module 的子類。您可以像使用常規 PyTorch 模組一樣使用它，並參考 PyTorch 文件瞭解所有與通用用法和行為相關的事項。

forward

( g_values: Tensor mask: Tensor labels: typing.Optional[torch.Tensor] = None loss_batch_weight = 1 return_dict = False )

引數

g_values (torch.Tensor，形狀為 (batch_size, seq_len, watermarking_depth, ...)) — g 值（值為 0 或 1）
mask — 形狀為 [batch_size, seq_len] 的二進位制陣列，指示應使用哪些 g 值。mask 值為 0 的 g 值將被丟棄。

計算帶水印的後驗機率 P(watermarked|g_values)。

class transformers.SynthIDTextWatermarkingConfig

( ngram_len: int keys: list context_history_size: int = 1024 sampling_table_seed: int = 0 sampling_table_size: int = 65536 skip_first_ngram_calls: bool = False debug_mode: bool = False )

引數

ngram_len (int) — Ngram 長度。
keys (list[int]) — 水印金鑰序列，每個深度對應一個。
context_history_size (int, 可選, 預設為 1024) — 用於跟蹤已見上下文的張量大小。
sampling_table_seed (int, 可選, 預設為 0) — 用於生成取樣表的隨機種子。
sampling_table_size (int, 可選, 預設為 65536) — 取樣表的大小。
skip_first_ngram_calls (bool, 可選, 預設為 False) — 是否跳過首次 ngram 呼叫。
debug_mode (bool, optional, 可選, 預設為 False) — Logits 會在應用水印修改前被修改為均勻分佈。這用於測試實現。

該類包含水印生成的引數，並應在 `generate` 期間傳遞給 `GenerationConfig`。有關引數的更多詳細資訊，請參閱這篇論文。

示例

>>> from transformers import AutoModelForCausalLM, AutoTokenizer, SynthIDTextWatermarkingConfig

>>> tokenizer = AutoTokenizer.from_pretrained('google/gemma-2-2b', padding_side="left")
>>> model = AutoModelForCausalLM.from_pretrained('google/gemma-2-2b')

>>> # SynthID Text configuration
>>> watermarking_config = SynthIDTextWatermarkingConfig(
...     keys=[654, 400, 836, 123, 340, 443, 597, 160, 57],
...     ngram_len=5,
... )

>>> # Generation with watermarking
>>> tokenized_prompts = tokenizer(["Once upon a time, "], return_tensors="pt", padding=True)
>>> output_sequences = model.generate(
...     **tokenized_prompts, watermarking_config=watermarking_config, do_sample=True, max_new_tokens=10
... )
>>> watermarked_text = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)

class transformers.SynthIDTextWatermarkDetector

( detector_module: BayesianDetectorModel logits_processor: SynthIDTextWatermarkLogitsProcessor tokenizer: typing.Any )

引數

detector_module (BayesianDetectorModel) — 使用引數初始化的貝葉斯檢測器模組物件。請檢視 https://github.com/huggingface/transformers-research-projects/tree/main/synthid_text 以瞭解用法。
logits_processor (SynthIDTextWatermarkLogitsProcessor) — 用於新增水印的 logits 處理器。
tokenizer (Any) — 用於模型的分詞器。

SynthID 文字水印檢測器類。

該類必須使用經過訓練的貝葉斯檢測器模組進行初始化，請檢視 examples/synthid_text/detector_training.py 中的指令碼，瞭解訓練/儲存/載入此檢測器模組的示例。該資料夾還展示了此檢測器的示例用例。

示例

>>> from transformers import (
...     AutoTokenizer, BayesianDetectorModel, SynthIDTextWatermarkLogitsProcessor, SynthIDTextWatermarkDetector
... )

>>> # Load the detector. See https://github.com/huggingface/transformers-research-projects/tree/main/synthid_text for training a detector.
>>> detector_model = BayesianDetectorModel.from_pretrained("joaogante/dummy_synthid_detector")
>>> logits_processor = SynthIDTextWatermarkLogitsProcessor(
...     **detector_model.config.watermarking_config, device="cpu"
... )
>>> tokenizer = AutoTokenizer.from_pretrained(detector_model.config.model_name)
>>> detector = SynthIDTextWatermarkDetector(detector_model, logits_processor, tokenizer)

>>> # Test whether a certain string is watermarked
>>> test_input = tokenizer(["This is a test input"], return_tensors="pt")
>>> is_watermarked = detector(test_input.input_ids)

call

( tokenized_outputs: Tensor )

編譯工具

class transformers.CompileConfig