Transformers 文件

DeepSeek-V3

Transformers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

DeepSeek-V3

概述

DeepSeek-V3 模型在 DeepSeek-AI 團隊的《DeepSeek-V3 技術報告》中提出。

論文摘要如下：我們推出了 DeepSeek-V3，一個強大的專家混合（MoE）語言模型，總引數量為 671B，每個 token 啟用 37B 引數。為實現高效推理和經濟的訓練，DeepSeek-V3 採用了在 DeepSeek-V2 中經過充分驗證的多頭潛在注意力（MLA）和 DeepSeekMoE 架構。此外，DeepSeek-V3 首創了無輔助損失的負載均衡策略，並設定了多 token 預測訓練目標以提升效能。我們在 14.8 萬億個多樣化、高質量的 token 上對 DeepSeek-V3 進行了預訓練，隨後透過監督微調和強化學習階段充分發揮其能力。全面評估顯示，DeepSeek-V3 優於其他開源模型，並達到了與領先閉源模型相當的效能。儘管效能卓越，DeepSeek-V3 完成全部訓練僅需 278.8 萬 H800 GPU 小時。此外，其訓練過程非常穩定。在整個訓練過程中，我們沒有遇到任何不可恢復的損失飆升，也沒有進行任何回滾。模型檢查點可在 https://github.com/deepseek-ai/DeepSeek-V3 獲取。

侷限性與貢獻呼籲！

我們非常高興能讓這段程式碼由社群驅動，並期待看到您如何能最好地最佳化以下內容

當前實現使用了“樸素”的注意力計算（因此並非真正的 MLA）
當前實現透過迴圈遍歷專家。這應該被替換掉。建議使用 `integrations/tensor_parallel` 中的 `get_packed_weights`。
當前實現使用了 eleuther 的 ROPE 公式，使用原始公式會更高效！（但仍應遵循我們的 API）
靜態快取不受支援（這應該只是一個生成配置問題/配置形狀問題）

使用技巧

該模型使用多頭潛在注意力（MLA）和 DeepSeekMoE 架構，以實現高效推理和經濟的訓練。它採用無輔助損失策略進行負載均衡，並使用多 token 預測訓練目標。該模型在 14.8 萬億個 token 上進行了預訓練，並經過監督微調和強化學習階段，可用於各種語言任務。

您可以使用 `FP8` 模式自動執行模型，使用 2 個包含 8 個 H100 的節點應該綽綽有餘！

# `run_deepseek_v1.py`
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(30)

tokenizer = AutoTokenizer.from_pretrained("deepseek-r1")

chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]


model = AutoModelForCausalLM.from_pretrained("deepseek-r1", device_map="auto", torch_dtype=torch.bfloat16)
inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
import time
start = time.time()
outputs = model.generate(inputs, max_new_tokens=50)
print(tokenizer.batch_decode(outputs))
print(time.time()-start)

這生成了

<｜Assistant｜><think>
Okay, the user wants to demonstrate how chat templating works. Let me break down what that means. Chat templating is about structuring the conversation data, especially for models that need specific input formats. Maybe they're referring to something like how messages are formatted with roles (user, assistant, system) in APIs like OpenAI.

First, I should explain what chat templating is. It's the process of formatting conversation data into a structured format that the model can understand. This usually includes roles and content. For example, user messages, assistant responses, and system messages each have their own role tags.

They might want an example. Let me think of a simple conversation. The user says "Hello, how are you?" and the assistant responds "I'm doing great. How can I help you today?" Then the user follows up with wanting to show off chat templating. So the example should include the history and the new message.

In some frameworks, like Hugging Face's Transformers, chat templates are applied using Jinja2 templates. The template might look something like combining system messages, then looping through user and assistant messages with appropriate tags. For instance, using {% for message in messages %} and assigning roles like <|user|>, <|assistant|>, etc.

I should structure the example with the messages array, showing each role and content. Then apply a hypothetical template to convert that into a formatted string the model uses. Also, mention that different models have different templating requirements, like using special tokens or varying role labels.

Wait, the user mentioned "chat templating" in the context of showing off. Maybe they want a practical example they can present. So providing a code snippet or a structured data example would be helpful. Let me outline a typical messages array and then the templated output.

Also, it's important to note that proper templating ensures the model knows the conversation flow, which is crucial for generating coherent responses. Maybe include a note about why it's important, like maintaining context and role-specific processing.

Let me check if there are any common mistakes or things to avoid. For example, not closing tags properly, or mismatching roles. But maybe that's too detailed unless the user asks. Focus on the positive example first.

Putting it all together, the response should have an example messages array, the applied template, and the final formatted string. Maybe use angle brackets or special tokens as placeholders. Also, mention that this helps in training or fine-tuning models with structured data.

I think that's a solid approach. Let me structure it step by step to make it clear.
</think>

Chat templating is a way to structure conversation data (e.g., user/assistant interactions) into a format that language models understand. This is especially important for models trained to handle multi-turn dialogues, where the input must explicitly separate roles (user, assistant, system, etc.) and messages. Let’s break this down with an example!

---

### **Step 1: Raw Conversation History**
Suppose we have this conversation:
- **User**: "Hello, how are you?"
- **Assistant**: "I'm doing great. How can I help you today?"
- **User**: "I'd like to show off how chat templating works!"

---

### **Step 2: Structured Messages**
In frameworks like Hugging Face Transformers or OpenAI, conversations are often formatted as a list of dictionaries with `role` and `content`:
```python
messages = [
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
    {"role": "user", "content": "I'd like to show off how chat templating works!"},
]
```

---

### **Step 3: Apply a Chat Template**
A **chat template** converts this structured data into a single string formatted for the model. For example, using a Jinja-style template (common in Hugging Face):

```jinja
{% for message in messages %}
    {% if message['role'] == 'user' %}
        <|user|>{{ message['content'] }}<|end|>
    {% elif message['role'] == 'assistant' %}
        <|assistant|>{{ message['content'] }}<|end|>
    {% endif %}
{% endfor %}
<|assistant|>
```

---

### **Step 4: Final Templated Output**
Applying the template to our `messages` list would produce:
```text
<|user|>Hello, how are you?<|end|>
<|assistant|>I'm doing great. How can I help you today?<|end|>
<|user|>I'd like to show off how chat templating works!<|end|>
<|assistant|>
```

This tells the model:  
1. The conversation history (user/assistant turns).  
2. The model’s turn to generate a response (`<|assistant|>` at the end).  

---

### **Key Notes**:
- **Role Separation**: Tags like `<|user|>` and `<|assistant|>` help the model distinguish speakers.
- **Special Tokens**: Models often use unique tokens (e.g., `<|end|>`) to mark message boundaries.
- **Flexibility**: Templates vary by model (e.g., OpenAI uses `{"role": "user", "content": "..."}` instead of tags).

---

### **Why This Matters**:
- **Consistency**: Ensures the model understands dialogue structure.
- **Context Preservation**: Maintains the flow of multi-turn conversations.
- **Alignment**: Matches the format the model was trained on for better performance.

Want to dive deeper or see a specific framework’s implementation (e.g., OpenAI, Llama, Mistral)? Let me know! 😊<｜end▁of▁sentence｜>

使用以下命令執行它

torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0|1 --rdzv-id an_id --rdzv-backend c10d --rdzv-endpoint master_addr:master_port run_deepseek_r1.py

如果您遇到

[rank0]: ncclInternalError: Internal check failed.
[rank0]: Last error:
[rank0]: Bootstrap : no socket interface found

錯誤，意味著 NCCL 可能沒有載入。

DeepseekV3Config

class transformers.DeepseekV3Config

< 原始碼 >

( vocab_size = 129280 hidden_size = 7168 intermediate_size = 18432 moe_intermediate_size = 2048 num_hidden_layers = 61 num_attention_heads = 128 num_key_value_heads = 128 n_shared_experts = 1 n_routed_experts = 256 routed_scaling_factor = 2.5 kv_lora_rank = 512 q_lora_rank = 1536 qk_rope_head_dim = 64 v_head_dim = 128 qk_nope_head_dim = 128 n_group = 8 topk_group = 4 num_experts_per_tok = 8 first_k_dense_replace = 3 norm_topk_prob = True hidden_act = 'silu' max_position_embeddings = 4096 initializer_range = 0.02 rms_norm_eps = 1e-06 use_cache = True pad_token_id = None bos_token_id = 0 eos_token_id = 1 pretraining_tp = 1 tie_word_embeddings = False rope_theta = 10000.0 rope_scaling = None rope_interleave = True attention_bias = False attention_dropout = 0.0 **kwargs )

引數

vocab_size (int, 可選, 預設為 129280) — Deep 模型的詞彙表大小。定義了在呼叫 DeepseekV3Model 時傳遞的 inputs_ids 可以表示的不同 token 的數量
hidden_size (int, 可選, 預設為 7168) — 隱藏表示的維度。
intermediate_size (int, 可選, 預設為 18432) — MLP 表示的維度。
moe_intermediate_size (int, 可選, 預設為 2048) — MoE 表示的維度。
num_hidden_layers (int, 可選, 預設為 61) — Transformer 解碼器中的隱藏層數量。
num_attention_heads (int, 可選, 預設為 128) — Transformer 解碼器中每個注意力層的注意力頭數量。
num_key_value_heads (int, 可選, 預設為 128) — 這是用於實現分組查詢注意力 (Grouped Query Attention) 的鍵值頭數量。如果 num_key_value_heads=num_attention_heads，模型將使用多頭注意力 (MHA)；如果 num_key_value_heads=1，模型將使用多查詢注意力 (MQA)；否則使用 GQA。將多頭檢查點轉換為 GQA 檢查點時，每個組的鍵和值頭應透過對該組內所有原始頭進行平均池化來構建。更多詳情，請參閱[這篇論文](https://huggingface.co/papers/2305.13245)。如果未指定，將預設為 `num_attention_heads`。
n_shared_experts (int, 可選, 預設為 1) — 共享專家的數量。
n_routed_experts (int, 可選, 預設為 256) — 路由專家的數量。
routed_scaling_factor (float, 可選, 預設為 2.5) — 路由專家的縮放因子。
kv_lora_rank (int, 可選, 預設為 512) — 鍵和值投影的 LoRA 矩陣的秩。
q_lora_rank (int, 可選, 預設為 1536) — 查詢投影的 LoRA 矩陣的秩。
qk_rope_head_dim (int, 可選, 預設為 64) — 使用旋轉位置嵌入的查詢/鍵頭的維度。
v_head_dim (int, 可選, 預設為 128) — 值頭的維度。
qk_nope_head_dim (int, 可選, 預設為 128) — 不使用旋轉位置嵌入的查詢/鍵頭的維度。
n_group (int, 可選, 預設為 8) — 路由專家的組數。
topk_group (int, 可選, 預設為 4) — 每個 token 選擇的組數（確保每個 token 選擇的專家僅在 `topk_group` 組內）。
num_experts_per_tok (int, 可選, 預設為 8) — 選擇的專家數量，None 表示密集模型。
first_k_dense_replace (int, 可選, 預設為 3) — 淺層中的密集層數量 (embed->dense->dense->…->dense->moe->moe…->lm_head)。 --k 個密集層—/
norm_topk_prob (bool, 可選, 預設為 True) — 是否對路由專家的權重進行歸一化。
hidden_act (str 或 function, 可選, 預設為 "silu") — 解碼器中的非線性啟用函式（函式或字串）。
max_position_embeddings (int, 可選, 預設為 4096) — 此模型可能使用的最大序列長度。
initializer_range (float, 可選, 預設為 0.02) — 用於初始化所有權重矩陣的 truncated_normal_initializer 的標準差。
rms_norm_eps (float, 可選, 預設為 1e-06) — rms 歸一化層使用的 epsilon。
use_cache (bool, 可選, 預設為 True) — 模型是否應返回最後一個鍵/值注意力（並非所有模型都使用）。僅當 `config.is_decoder=True` 時相關。
pad_token_id (int, 可選) — 填充 token 的 ID。
bos_token_id (int, 可選, 預設為 0) — 序列開始 token 的 ID。
eos_token_id (int, 可選, 預設為 1) — 序列結束符的 ID。
pretraining_tp (int, 可選, 預設為 1) — 實驗性功能。預訓練期間使用的張量並行等級。請參閱此文件以瞭解更多資訊。此值對於確保預訓練結果的精確復現是必需的。請參閱此問題。
tie_word_embeddings (bool, 可選, 預設為 False) — 是否繫結詞嵌入權重
rope_theta (float, 可選, 預設為 10000.0) — RoPE 嵌入的基礎週期。
rope_scaling (Dict, 可選) — 包含 RoPE 嵌入縮放配置的字典。目前支援兩種縮放策略：`linear` 和 `dynamic`。它們的縮放因子必須是大於 1 的浮點數。期望的格式是 `{"type": 策略名稱, "factor": 縮放因子}`。使用此標誌時，不要將 `max_position_embeddings` 更新為期望的新最大值。
rope_interleave (bool, 可選, 預設為 True) — 是否交錯旋轉位置嵌入。
attention_bias (bool, 預設為 False, 可選, 預設為 False) — 在自注意力機制中，是否在 query、key、value 和輸出投影層中使用偏置。
attention_dropout (float, 可選, 預設為 0.0) — 注意力機率的 dropout 比率。

這是用於儲存 DeepseekV3Model 配置的配置類。它用於根據指定的引數例項化一個 DeepSeek 模型，定義模型架構。使用預設值例項化配置將產生與 DeepSeek-V3 類似的配置，例如 bzantium/tiny-deepseek-v3。配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請閱讀 PretrainedConfig 的文件。

>>> from transformers import DeepseekV3Model, DeepseekV3Config

>>> # Initializing a Deepseek-V3 style configuration
>>> configuration = DeepseekV3Config()

>>> # Accessing the model configuration
>>> configuration = model.config

DeepseekV3Model

class transformers.DeepseekV3Model

< 源 >

( config: DeepseekV3Config )

引數

config (DeepseekV3Config) — 包含模型所有引數的模型配置類。使用配置檔案進行初始化不會載入與模型相關的權重，只會載入配置。請檢視 from_pretrained() 方法來載入模型權重。

基礎的 Deepseek V3 模型，輸出原始的隱藏狀態，頂部沒有任何特定的頭。

該模型繼承自 PreTrainedModel。請檢視超類的文件，瞭解該庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

該模型也是 PyTorch 的 torch.nn.Module 子類。可以像常規的 PyTorch Module 一樣使用它，並參考 PyTorch 文件瞭解所有與常規用法和行為相關的事項。

forward

< 源 >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None **flash_attn_kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs] ) → transformers.modeling_outputs.BaseModelOutputWithPast 或 tuple(torch.FloatTensor)

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)，可選) — 詞彙表中輸入序列詞元的索引。預設情況下，填充將被忽略。

可以使用 AutoTokenizer 獲取索引。有關詳細資訊，請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
attention_mask (torch.Tensor，形狀為 (batch_size, sequence_length)，可選) — 用於避免對填充詞元索引執行注意力的掩碼。掩碼值選自 [0, 1]：
- 1 表示未被掩碼的詞元，
- 0 表示被掩碼的詞元。
什麼是注意力掩碼？
position_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)，可選) — 每個輸入序列詞元在位置嵌入中的位置索引。選自範圍 [0, config.n_positions - 1]。

什麼是位置 ID？
past_key_values (~cache_utils.Cache, 可選) — 預先計算的隱藏狀態（自注意力塊和交叉注意力塊中的鍵和值），可用於加速序列解碼。這通常包括模型在解碼的前一個階段返回的 `past_key_values`，當 `use_cache=True` 或 `config.use_cache=True` 時。

允許兩種格式：
- Cache 例項，請參閱我們的 kv 快取指南；
- 長度為 `config.n_layers` 的 `tuple(torch.FloatTensor)` 元組，每個元組包含 2 個形狀為 `(batch_size, num_heads, sequence_length, embed_size_per_head)` 的張量。這也被稱為傳統快取格式。
模型將輸出與輸入相同的快取格式。如果沒有傳遞 `past_key_values`，將返回傳統快取格式。

如果使用 `past_key_values`，使用者可以選擇只輸入最後一個 `input_ids`（即那些沒有為其提供過去鍵值狀態的 `input_ids`），形狀為 `(batch_size, 1)`，而不是所有形狀為 `(batch_size, sequence_length)` 的 `input_ids`。
inputs_embeds (torch.FloatTensor，形狀為 (batch_size, sequence_length, hidden_size)，可選) — 可選地，您可以選擇直接傳遞嵌入表示，而不是傳遞 `input_ids`。如果您希望比模型內部嵌入查詢矩陣更多地控制如何將 `input_ids` 索引轉換為關聯向量，這將非常有用。
use_cache (bool, 可選) — 如果設定為 `True`，則返回 `past_key_values` 鍵值狀態，可用於加速解碼（參見 `past_key_values`）。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。有關詳細資訊，請參閱返回張量下的 `attentions`。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關詳細資訊，請參閱返回張量下的 `hidden_states`。
cache_position (torch.LongTensor，形狀為 (sequence_length)，可選) — 描繪輸入序列詞元在序列中位置的索引。與 `position_ids` 不同，此張量不受填充影響。它用於在正確的位置更新快取並推斷完整的序列長度。

transformers.modeling_outputs.BaseModelOutputWithPast 或 tuple(torch.FloatTensor)

一個 transformers.modeling_outputs.BaseModelOutputWithPast 或一個 `torch.FloatTensor` 的元組（如果傳遞了 `return_dict=False` 或當 `config.return_dict=False` 時），根據配置（DeepseekV3Config）和輸入，包含各種元素。

last_hidden_state (torch.FloatTensor, 形狀為 (batch_size, sequence_length, hidden_size)) — 模型最後一層輸出的隱藏狀態序列。

如果使用了 past_key_values，則只輸出形狀為 (batch_size, 1, hidden_size) 的序列的最後一個隱藏狀態。
past_key_values (Cache, 可選, 當傳遞 `use_cache=True` 或 `config.use_cache=True` 時返回) — 這是一個 Cache 例項。更多詳情請參閱我們的 kv 快取指南。

包含預計算的隱藏狀態（自注意力塊中的鍵和值，如果 `config.is_encoder_decoder=True`，則還包括交叉注意力塊中的鍵和值），可用於（參見 `past_key_values` 輸入）加速序列解碼。
hidden_states (tuple(torch.FloatTensor), 可選, 當傳遞 `output_hidden_states=True` 或 `config.output_hidden_states=True` 時返回) — `torch.FloatTensor` 的元組（如果模型有嵌入層，則一個用於嵌入層的輸出，+ 每個層的輸出一個），形狀為 `(batch_size, sequence_length, hidden_size)`。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。
attentions (tuple(torch.FloatTensor), 可選, 當傳遞 `output_attentions=True` 或 `config.output_attentions=True` 時返回) — `torch.FloatTensor` 的元組（每層一個），形狀為 `(batch_size, num_heads, sequence_length, sequence_length)`。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

DeepseekV3Model 的 forward 方法，覆蓋了 `__call__` 特殊方法。

儘管前向傳播的流程需要在此函式內定義，但之後應呼叫 `Module` 例項而不是此函式，因為前者會處理預處理和後處理步驟，而後者會靜默地忽略它們。

DeepseekV3ForCausalLM

class transformers.DeepseekV3ForCausalLM

< 源 >

( config )

引數

config (DeepseekV3ForCausalLM) — 包含模型所有引數的模型配置類。使用配置檔案進行初始化不會載入與模型相關的權重，只會載入配置。請檢視 from_pretrained() 方法來載入模型權重。

用於因果語言建模的 Deepseek V3 模型。

該模型繼承自 PreTrainedModel。請檢視超類的文件，瞭解該庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

該模型也是 PyTorch 的 torch.nn.Module 子類。可以像常規的 PyTorch Module 一樣使用它，並參考 PyTorch 文件瞭解所有與常規用法和行為相關的事項。

forward

< 源 >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **kwargs: typing_extensions.Unpack[transformers.models.deepseek_v3.modeling_deepseek_v3.KwargsForCausalLM] ) → transformers.modeling_outputs.CausalLMOutputWithPast 或 tuple(torch.FloatTensor)

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)，可選) — 詞彙表中輸入序列詞元的索引。預設情況下，填充將被忽略。

可以使用 AutoTokenizer 獲取索引。有關詳細資訊，請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
attention_mask (torch.Tensor，形狀為 (batch_size, sequence_length)，可選) — 用於避免對填充詞元索引執行注意力的掩碼。掩碼值選自 [0, 1]：
- 1 表示未被掩碼的詞元，
- 0 表示被掩碼的詞元。
什麼是注意力掩碼？
position_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)，可選) — 每個輸入序列詞元在位置嵌入中的位置索引。選自範圍 [0, config.n_positions - 1]。

什麼是位置 ID？
past_key_values (~cache_utils.Cache, 可選) — 預先計算的隱藏狀態（自注意力塊和交叉注意力塊中的鍵和值），可用於加速序列解碼。這通常包括模型在解碼的前一個階段返回的 `past_key_values`，當 `use_cache=True` 或 `config.use_cache=True` 時。

允許兩種格式：
- Cache 例項，請參閱我們的 kv 快取指南；
- 長度為 `config.n_layers` 的 `tuple(torch.FloatTensor)` 元組，每個元組包含 2 個形狀為 `(batch_size, num_heads, sequence_length, embed_size_per_head)` 的張量。這也被稱為傳統快取格式。
模型將輸出與輸入相同的快取格式。如果沒有傳遞 `past_key_values`，將返回傳統快取格式。

如果使用 `past_key_values`，使用者可以選擇只輸入最後一個 `input_ids`（即那些沒有為其提供過去鍵值狀態的 `input_ids`），形狀為 `(batch_size, 1)`，而不是所有形狀為 `(batch_size, sequence_length)` 的 `input_ids`。
inputs_embeds (torch.FloatTensor，形狀為 (batch_size, sequence_length, hidden_size)，可選) — 可選地，您可以選擇直接傳遞嵌入表示，而不是傳遞 `input_ids`。如果您希望比模型內部嵌入查詢矩陣更多地控制如何將 `input_ids` 索引轉換為關聯向量，這將非常有用。
labels (torch.LongTensor，形狀為 (batch_size, sequence_length)，可選) — 用於計算掩碼語言建模損失的標籤。索引應在 [0, ..., config.vocab_size] 或 -100（參見 `input_ids` 文件字串）之間。索引設定為 -100 的詞元將被忽略（掩碼），損失僅對標籤在 [0, ..., config.vocab_size] 中的詞元計算。
use_cache (bool, 可選) — 如果設定為 `True`，則返回 `past_key_values` 鍵值狀態，可用於加速解碼（參見 `past_key_values`）。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。有關詳細資訊，請參閱返回張量下的 `attentions`。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關詳細資訊，請參閱返回張量下的 `hidden_states`。
cache_position (torch.LongTensor，形狀為 (sequence_length)，可選) — 描繪輸入序列詞元在序列中位置的索引。與 `position_ids` 不同，此張量不受填充影響。它用於在正確的位置更新快取並推斷完整的序列長度。
logits_to_keep (Union[int, torch.Tensor], 預設為 0) — 如果是 `int`，則計算最後 `logits_to_keep` 個詞元的 logits。如果為 `0`，則計算所有 `input_ids` 的 logits（特殊情況）。生成時只需要最後一個詞元的 logits，僅為其計算可以節省記憶體，這對於長序列或大詞彙表來說非常重要。如果是 `torch.Tensor`，則必須是一維的，對應於序列長度維度中要保留的索引。這在使用打包張量格式（批次和序列長度為單個維度）時很有用。

transformers.modeling_outputs.CausalLMOutputWithPast 或 tuple(torch.FloatTensor)

一個 transformers.modeling_outputs.CausalLMOutputWithPast 或一個 `torch.FloatTensor` 的元組（如果傳遞了 `return_dict=False` 或當 `config.return_dict=False` 時），根據配置（DeepseekV3Config）和輸入，包含各種元素。

loss (torch.FloatTensor 形狀為 (1,)，可選，當提供 labels 時返回) — 語言建模損失（用於下一個 token 預測）。
logits (形狀為 (batch_size, sequence_length, config.vocab_size) 的 torch.FloatTensor) — 語言建模頭部的預測分數（SoftMax 之前的每個詞彙標記的分數）。
past_key_values (Cache, 可選, 當傳遞 `use_cache=True` 或 `config.use_cache=True` 時返回) — 這是一個 Cache 例項。更多詳情請參閱我們的 kv 快取指南。

包含預計算的隱藏狀態（自注意力塊中的鍵和值），可用於（參見 past_key_values 輸入）加速順序解碼。
hidden_states (tuple(torch.FloatTensor), 可選, 當傳遞 `output_hidden_states=True` 或 `config.output_hidden_states=True` 時返回) — `torch.FloatTensor` 的元組（如果模型有嵌入層，則一個用於嵌入層的輸出，+ 每個層的輸出一個），形狀為 `(batch_size, sequence_length, hidden_size)`。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。
attentions (tuple(torch.FloatTensor), 可選, 當傳遞 `output_attentions=True` 或 `config.output_attentions=True` 時返回) — `torch.FloatTensor` 的元組（每層一個），形狀為 `(batch_size, num_heads, sequence_length, sequence_length)`。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

DeepseekV3ForCausalLM 的 forward 方法，覆蓋了 `__call__` 特殊方法。

儘管前向傳播的流程需要在此函式內定義，但之後應呼叫 `Module` 例項而不是此函式，因為前者會處理預處理和後處理步驟，而後者會靜默地忽略它們。

示例

>>> from transformers import AutoTokenizer, DeepseekV3ForCausalLM

>>> model = DeepseekV3ForCausalLM.from_pretrained("meta-deepseek_v3/DeepseekV3-2-7b-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("meta-deepseek_v3/DeepseekV3-2-7b-hf")

>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."

< > 在 GitHub 上更新

←DeBERTa-v2 DialoGPT→