InternVL

InternVL3 系列視覺語言模型在 InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models 中被提出。

論文摘要如下：

我們引入了 InternVL3，這是 InternVL 系列中的一項重大進展，其特點是採用了原生的多模態預訓練正規化。InternVL3 不再是將純文字大型語言模型 (LLM) 改編成支援視覺輸入的多模態大型語言模型 (MLLM)，而是在一個預訓練階段中，透過多樣化的多模態資料和純文字語料庫共同獲取多模態和語言能力。這種統一的訓練正規化有效地解決了傳統 MLLM 後期訓練管道中常見的複雜性和對齊挑戰。為了進一步提高效能和可擴充套件性，InternVL3 融合了可變視覺位置編碼 (V2PE) 以支援擴充套件的多模態上下文，採用了先進的後訓練技術，如監督微調 (SFT) 和混合偏好最佳化 (MPO)，並採用了測試時縮放策略以及最佳化的訓練基礎設施。廣泛的實證評估表明，InternVL3 在各種多模態任務中都表現出卓越的效能。特別是，InternVL3-78B 在 MMMU 基準測試中取得了 72.2 分的成績，在開源 MLLM 中樹立了新的SOTA。它的能力與領先的專有模型（包括 ChatGPT-4o、Claude 3.5 Sonnet 和 Gemini 2.5 Pro）保持高度競爭力，同時還保持了強大的純語言能力。為了秉承開放科學原則，我們將公開訓練資料和模型權重，以促進下一代 MLLM 的進一步研究和開發。

InternVL3 模型架構概述，與 InternVL2.5 相同。摘自原始檢查點。 drawing

InternVL3 在 OpenCompass 上與其他 SOTA VLLM 的效能比較。摘自原始檢查點。

此模型由 yonigozlan 貢獻。原始程式碼可在此處找到。

使用示例

使用管道進行推理

以下是如何使用 image-text-to-text 管道，僅用幾行程式碼即可對 InternVL3 模型執行推理：

>>> from transformers import pipeline

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "image",
...                 "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
...             },
...             {"type": "text", "text": "Describe this image."},
...         ],
...     },
... ]

>>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-1B-hf")
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
>>> outputs[0]["generated_text"]
'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n   - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'

單影像推理

此示例演示瞭如何使用聊天模板對 InternVL 模型進行單影像推理。

[!注意] 請注意，該模型已針對特定的聊天提示格式進行了訓練。請使用 processor.apply_chat_template(my_conversation_dict) 來正確格式化您的提示。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
...             {"type": "text", "text": "Please describe the image explicitly."},
...         ],
...     }
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, max_new_tokens=50)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)

>>> decoded_output
'The image shows two cats lying on a pink blanket. The cat on the left is a tabby with a mix of brown, black, and white fur, and it appears to be sleeping with its head resting on the blanket. The cat on the'

純文字生成

此示例展示瞭如何在使用 InternVL 模型時，在不提供任何影像輸入的情況下生成文字。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {"type": "text", "text": "Write a haiku"},
...         ],
...     }
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(torch_device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, max_new_tokens=50)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)

>>> print(decoded_output)
"Whispers of dawn,\nSilent whispers of the night,\nNew day's light begins."

批次影像和文字輸入

InternVL 模型也支援批次影像和文字輸入。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
...                 {"type": "text", "text": "Describe this image"},
...             ],
...         },
...     ],
... ]


>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace.",
 'user\n\nDescribe this image\nassistant\nThe image shows a street scene with a traditional Chinese archway, known as a "Chinese Gate" or "Chinese Gate of']

批次多影像輸入

InternVL 模型的此實現支援批次文字影像輸入，其中每個文字的影像數量不同。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
...             ],
...         },
...     ],
>>> ]

>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace.",
 'user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nYes, these images depict the Statue of Liberty and the Golden Gate Bridge.']

影片輸入

InternVL 模型還可以處理影片輸入。以下是使用聊天模板對影片輸入執行推理的示例。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig

>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, quantization_config=quantization_config)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "video",
...                 "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4",
...             },
...             {"type": "text", "text": "What type of shot is the man performing?"},
...         ],
...     }
>>> ]
>>> inputs = processor.apply_chat_template(
...     messages,
...     return_tensors="pt",
...     add_generation_prompt=True,
...     tokenize=True,
...     return_dict=True,
...     num_frames=8,
>>> ).to(model.device, dtype=torch.float16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
>>> decoded_output
'The man is performing a forehand shot.'

交錯影像和影片輸入

此示例展示瞭如何使用聊天模板處理包含交錯影像和影片輸入的批次聊天對話。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "video", "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4"},
...                 {"type": "text", "text": "What type of shot is the man performing?"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
>>> ]
>>> inputs = processor.apply_chat_template(
...     messages,
...     padding=True,
...     add_generation_prompt=True,
...     tokenize=True,
...     return_dict=True,
...     return_tensors="pt",
>>> ).to(model.device, dtype=torch.bfloat16)

>>> outputs = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(outputs, skip_special_tokens=True)
>>> decoded_outputs
['user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nThe images depict the Statue of Liberty and the Golden Gate Bridge.',
 'user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nA forehand shot',
 "user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace."]

InternVLVisionConfig

class transformers.InternVLVisionConfig

< 原始碼 >

( hidden_size = 1024 num_hidden_layers = 24 num_attention_heads = 16 attention_bias = False use_qk_norm = False intermediate_size = 4096 hidden_act = 'gelu' hidden_dropout_prob = 0.0 attention_dropout = 0.0 projection_dropout = 0.0 initializer_range = 0.02 norm_type = 'layer_norm' layer_norm_eps = 1e-06 image_size = [448, 448] patch_size = [14, 14] num_channels = 3 use_mask_token = False use_absolute_position_embeddings = True layer_scale_init_value = 0.1 use_mean_pooling = True **kwargs )

引數

hidden_size (int, 可選, 預設為 1024) — 編碼器層和池化層的維度。
num_hidden_layers (int, 可選, 預設為 24) — Transformer 編碼器中的隱藏層數量。
num_attention_heads (int, 可選, 預設為 16) — Transformer 編碼器中每個注意力層的注意力頭數量。
attention_bias (bool, 可選, 預設為 False) — 是否為查詢、鍵和值新增偏置。
use_qk_norm (bool, 可選, 預設為 False) — 是否在注意力操作之前對查詢和鍵應用歸一化。
intermediate_size (int, 可選, 預設為 4096) — Transformer 編碼器中“中間”（即，前饋）層的維度。
hidden_act (str 或 function, 可選, 預設為 "gelu") — 編碼器和池化器中的非線性啟用函式（函式或字串）。如果為字串，則支援 "gelu", "relu", "selu" 和 "gelu_new"。
hidden_dropout_prob (float, 可選, 預設為 0.0) — 嵌入、編碼器和池化器中所有全連線層的 dropout 機率。
attention_dropout (float, 可選, 預設為 0.0) — 注意力權重的 dropout 機率。
projection_dropout (float, 可選, 預設為 0.0) — 投影層的 dropout 機率。
initializer_range (float, 可選, 預設為 0.02) — 用於初始化所有權重矩陣的截斷正態初始化器的標準差。
norm_type (str, 可選, 預設為 "layer_norm") — 編碼器中使用的歸一化型別。可以是 "layer_norm" 或 "rms_norm"。
layer_norm_eps (float, 可選, 預設為 1e-06) — 層歸一化層使用的 epsilon 值。
image_size (int 或 list[int], 可選, 預設為 [448, 448]) — 每張影像的大小（解析度）。
patch_size (int 或 list[int], 可選, 預設為 [14, 14]) — 每個補丁的大小（解析度）。
num_channels (int, 可選, 預設為 3) — 輸入通道的數量。
use_mask_token (bool, 可選, 預設為 False) — 是否使用掩碼標記進行掩碼影像建模。
use_absolute_position_embeddings (bool, 可選, 預設為 True) — 是否使用 BERT 風格的絕對位置嵌入。
layer_scale_init_value (float, 可選, 預設為 0.1) — 自注意力層中使用的縮放比例。基礎模型為 0.1，大型模型為 1e-5。設定為 0 則停用層縮放。
use_mean_pooling (bool, 可選, 預設為 True) — 在應用分類頭之前，是否對補丁的最終隱藏狀態進行平均池化，而不是使用 CLS 標記的最終隱藏狀態。

這是用於儲存 InternVLVisionModel 配置的配置類。它用於根據指定的引數例項化 InternVLVisionModel 模型，定義模型架構。使用預設值例項化配置將產生與 InternVL3-1B 類似的配置。例如 OpenGVLab/InternVL3-1B-hf

示例

>>> from transformers import InternVLVisionConfig, InternVLVisionModel

>>> # Initializing a InternVLVisionModel OpenGVLab/InternVL3-1B-hf style configuration
>>> configuration = InternVLVisionConfig()

>>> # Initializing a model (with random weights) from the OpenGVLab/InternVL3-1B-hf configuration
>>> model = InternVLVisionModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

InternVLConfig

class transformers.InternVLConfig

< 原始碼 >

( vision_config = None text_config = None image_token_id = 151667 image_seq_length = 256 downsample_ratio = 0.5 projector_hidden_act = 'gelu' vision_feature_layer = -1 vision_feature_select_strategy = 'default' **kwargs )

引數

vision_config (Union[AutoConfig, dict], 可選, 預設為 InternVisonConfig) — 視覺骨幹的配置物件或字典。
text_config (Union[AutoConfig, dict], 可選, 預設為 Qwen2Config) — 文字主幹的配置物件或字典。
image_token_id (int, 可選, 預設為 151667) — 用於編碼影像提示的影像標記索引。
image_seq_length (int, 可選, 預設為 256) — 每個影像塊使用的影像標記數。
downsample_ratio (float, 可選, 預設為 0.5) — 影像的下采樣因子。
projector_hidden_act (str 或 function, 可選, 預設為 "gelu") — 投影儀中的非線性啟用函式（函式或字串）。
vision_feature_layer (int, 可選, 預設為 -1) — 用作影像特徵的層索引。
vision_feature_select_strategy (str, 可選, 預設為 "default") — 用於從視覺主幹中選擇視覺特徵的特徵選擇策略。可以是 "default" 或 "full" 之一。

這是用於儲存 InternVLForConditionalGeneration 配置的配置類。它用於根據指定引數例項化 InternVL 模型，定義模型架構。使用預設值例項化配置將產生與 InternVL3-1B 相似的配置。例如 OpenGVLab/InternVL3-1B-hf

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請參閱 PretrainedConfig 的文件。

>>> from transformers import InternVLForConditionalGeneration, InternVLConfig

>>> # Initializing a InternVL style configuration
>>> configuration = InternVLConfig()

>>> # Initializing a model (with random weights) from the OpenGVLab/InternVL3-1B-hf configuration
>>> model = InternVLForConditionalGeneration(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

InternVLVisionModel

class transformers.InternVLVisionModel

< source >

( config: InternVLVisionConfig )

引數

config (InternVLVisionConfig) — 包含模型所有引數的模型配置類。使用配置檔案初始化不載入與模型關聯的權重，只加載配置。請查閱 from_pretrained() 方法來載入模型權重。

裸 Internvl 模型，輸出原始隱藏狀態，不帶任何特定頭部。

此模型繼承自 PreTrainedModel。請查閱超類文件，瞭解庫為其所有模型實現的一般方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件中有關一般使用和行為的所有事項。

forward

< source >

( pixel_values: Tensor bool_masked_pos: typing.Optional[torch.BoolTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None ) → transformers.models.internvl.modeling_internvl.InternVLVisionModelOutputWithPooling 或 tuple(torch.FloatTensor)

引數

pixel_values (torch.Tensor，形狀為 (batch_size, num_channels, image_size, image_size)) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲取。有關詳細資訊，請參閱 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
bool_masked_pos (torch.BoolTensor，形狀為 (batch_size, num_patches), 可選) — 布林掩碼位置。指示哪些塊被掩碼（1）哪些未被掩碼（0）。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。有關更多詳細資訊，請參閱返回張量下的 attentions。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關更多詳細資訊，請參閱返回張量下的 hidden_states。

transformers.models.internvl.modeling_internvl.InternVLVisionModelOutputWithPooling 或 tuple(torch.FloatTensor)

一個 transformers.models.internvl.modeling_internvl.InternVLVisionModelOutputWithPooling 或一個 torch.FloatTensor 的元組（如果傳入 return_dict=False 或 config.return_dict=False），包含根據配置 (InternVLConfig) 和輸入的不同元素。

last_hidden_state (形狀為 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor, 可選) — 模型最後一層輸出的隱藏狀態序列。
pooler_output (torch.FloatTensor，形狀為 (batch_size, hidden_size)) — 如果 config.use_mean_pooling 設定為 True，則為補丁標記（不包括 [CLS] 標記）的最後一層隱藏狀態的平均值。如果設定為 False，則返回 [CLS] 標記的最終隱藏狀態。
hidden_states (tuple[torch.FloatTensor, ...], 可選, 當傳入 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — torch.FloatTensor 的元組（如果模型有嵌入層，則一個用於嵌入輸出，加上每個層的輸出），形狀為 (batch_size, sequence_length, hidden_size)。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。
attentions (tuple[torch.FloatTensor, ...], 可選, 當傳入 output_attentions=True 或 config.output_attentions=True 時返回) — torch.FloatTensor 的元組（每個層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

InternVLVisionModel 前向方法，覆蓋 __call__ 特殊方法。

雖然前向傳播的配方需要在該函式中定義，但在此之後應呼叫 Module 例項，因為前者負責執行預處理和後處理步驟，而後者則默默地忽略它們。

InternVLModel

class transformers.InternVLModel

< source >

( config: InternVLConfig )

引數

config (InternVLConfig) — 包含模型所有引數的模型配置類。使用配置檔案初始化不載入與模型關聯的權重，只加載配置。請查閱 from_pretrained() 方法來載入模型權重。

InternVL 模型，由一個視覺主幹和一個語言模型組成，不帶語言建模頭部。

此模型繼承自 PreTrainedModel。請查閱超類文件，瞭解庫為其所有模型實現的一般方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件中有關一般使用和行為的所有事項。

forward

< source >

( input_ids: LongTensor = None pixel_values: FloatTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[list[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None vision_feature_layer: typing.Union[int, list[int], NoneType] = None vision_feature_select_strategy: typing.Optional[str] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None **kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs] ) → transformers.models.internvl.modeling_internvl.InternVLModelOutputWithPast 或 tuple(torch.FloatTensor)

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列標記的索引。預設情況下會忽略填充。

索引可以使用 AutoTokenizer 獲取。有關詳細資訊，請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
pixel_values (torch.FloatTensor，形狀為 (batch_size, num_channels, image_size, image_size)) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲取。有關詳細資訊，請參閱 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
attention_mask (torch.Tensor，形狀為 (batch_size, sequence_length), 可選) — 避免對填充標記索引執行注意力的掩碼。掩碼值選擇在 [0, 1] 之間：
- 1 表示 未被掩碼 的標記，
- 0 表示 被掩碼 的標記。
什麼是注意力掩碼？
position_ids (torch.LongTensor，形狀為 (batch_size, sequence_length), 可選) — 每個輸入序列標記在位置嵌入中的位置索引。選擇範圍為 [0, config.n_positions - 1]。

什麼是位置 ID？
past_key_values (list[torch.FloatTensor], 可選) — 預先計算的隱藏狀態（自注意力塊和交叉注意力塊中的鍵和值），可用於加速順序解碼。這通常包括模型在先前解碼階段返回的 past_key_values，當 use_cache=True 或 config.use_cache=True 時。

允許兩種格式：
- 一個 Cache 例項，請參閱我們的 kv 快取指南；
- 長度為 config.n_layers 的 tuple(torch.FloatTensor) 元組，每個元組包含 2 個形狀為 (batch_size, num_heads, sequence_length, embed_size_per_head) 的張量）。這也被稱為舊版快取格式。
模型將輸出與作為輸入提供的快取格式相同的快取格式。如果未傳入 past_key_values，則將返回舊版快取格式。

如果使用 past_key_values，使用者可以選擇只輸入形狀為 (batch_size, 1) 的最後一個 input_ids（那些沒有將其過去的鍵值狀態提供給此模型的），而不是形狀為 (batch_size, sequence_length) 的所有 input_ids。
inputs_embeds (torch.FloatTensor，形狀為 (batch_size, sequence_length, hidden_size), 可選) — 可選地，除了傳遞 input_ids，您還可以選擇直接傳遞嵌入表示。如果您想對如何將 input_ids 索引轉換為關聯向量（而不是模型的內部嵌入查詢矩陣）有更多控制，這會很有用。
vision_feature_layer (Union[int, list[int], NoneType]) — 選擇視覺特徵的層索引。如果提供了多個索引，則相應索引的視覺特徵將連線起來形成視覺特徵。
vision_feature_select_strategy (str, 可選) — 用於從視覺主幹中選擇視覺特徵的特徵選擇策略。可以是 "default" 或 "full" 之一。
use_cache (bool, 可選) — 如果設定為 True，則返回 past_key_values 鍵值狀態，可用於加速解碼（請參閱 past_key_values）。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。有關更多詳細資訊，請參閱返回張量下的 attentions。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關更多詳細資訊，請參閱返回張量下的 hidden_states。
return_dict (bool, 可選) — 是否返回 ModelOutput 而不是普通元組。
cache_position (torch.LongTensor，形狀為 (sequence_length), 可選) — 描述輸入序列標記在序列中位置的索引。與 position_ids 不同，此張量不受填充影響。它用於在正確位置更新快取並推斷完整的序列長度。

transformers.models.internvl.modeling_internvl.InternVLModelOutputWithPast 或 tuple(torch.FloatTensor)

一個 transformers.models.internvl.modeling_internvl.InternVLModelOutputWithPast 或一個 torch.FloatTensor 的元組（如果傳入 return_dict=False 或 config.return_dict=False），包含根據配置 (InternVLConfig) 和輸入的不同元素。

last_hidden_state (形狀為 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor, 可選) — 模型最後一層輸出的隱藏狀態序列。
past_key_values (tuple(tuple(torch.FloatTensor)), 可選, 當傳入 use_cache=True 或 config.use_cache=True 時返回) — 長度為 config.n_layers 的 tuple(torch.FloatTensor) 元組，每個元組包含 2 個形狀為 (batch_size, num_heads, sequence_length, embed_size_per_head) 的張量）。

包含預計算的隱藏狀態（自注意力塊中的鍵和值），可用於（參見 past_key_values 輸入）加速順序解碼。
hidden_states (tuple[torch.FloatTensor, ...], 可選, 當傳入 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — torch.FloatTensor 的元組（如果模型有嵌入層，則一個用於嵌入輸出，加上每個層的輸出），形狀為 (batch_size, sequence_length, hidden_size)。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。
attentions (tuple[torch.FloatTensor, ...], 可選, 當傳入 output_attentions=True 或 config.output_attentions=True 時返回) — torch.FloatTensor 的元組（每個層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。
image_hidden_states (torch.FloatTensor, 可選) — 形狀為 (batch_size, num_images, sequence_length, hidden_size) 的 torch.FloatTensor。模型由視覺編碼器生成並經過最後隱藏狀態投影后的 image_hidden_states。

InternVLModel 的前向方法，覆蓋了 __call__ 特殊方法。

雖然前向傳播的配方需要在該函式中定義，但在此之後應呼叫 Module 例項，因為前者負責執行預處理和後處理步驟，而後者則默默地忽略它們。

InternVLForConditionalGeneration

class transformers.InternVLForConditionalGeneration

< source >

( config: InternVLConfig )

引數

config (InternVLConfig) — 包含模型所有引數的模型配置類。使用配置檔案初始化不載入與模型關聯的權重，只加載配置。請查閱 from_pretrained() 方法來載入模型權重。

INTERNVL 模型，由一個視覺主幹和一個語言模型組成。

此模型繼承自 PreTrainedModel。請查閱超類文件，瞭解庫為其所有模型實現的一般方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件中有關一般使用和行為的所有事項。

forward

< source >

( input_ids: LongTensor = None pixel_values: FloatTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[list[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None vision_feature_layer: typing.Union[int, list[int], NoneType] = None vision_feature_select_strategy: typing.Optional[str] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 image_sizes: typing.Optional[torch.Tensor] = None **kwargs: typing_extensions.Unpack[transformers.models.internvl.modeling_internvl.KwargsForCausalLM] ) → transformers.models.internvl.modeling_internvl.InternVLCausalLMOutputWithPast 或 tuple(torch.FloatTensor)

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)) — 詞彙表中輸入序列標記的索引。預設情況下會忽略填充。

索引可以使用 AutoTokenizer 獲取。有關詳細資訊，請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
pixel_values (torch.FloatTensor，形狀為 (batch_size, num_channels, image_size, image_size)) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲取。有關詳細資訊，請參閱 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
attention_mask (torch.Tensor，形狀為 (batch_size, sequence_length), 可選) — 避免對填充標記索引執行注意力的掩碼。掩碼值選擇在 [0, 1] 之間：
- 1 表示 未被掩碼 的標記，
- 0 表示 被掩碼 的標記。
什麼是注意力掩碼？
position_ids (torch.LongTensor，形狀為 (batch_size, sequence_length), 可選) — 每個輸入序列標記在位置嵌入中的位置索引。選擇範圍為 [0, config.n_positions - 1]。

什麼是位置 ID？
past_key_values (list[torch.FloatTensor], 可選) — 預先計算的隱藏狀態（自注意力塊和交叉注意力塊中的鍵和值），可用於加速順序解碼。這通常包括模型在先前解碼階段返回的 past_key_values，當 use_cache=True 或 config.use_cache=True 時。

允許兩種格式：
- 一個 Cache 例項，請參閱我們的 kv 快取指南；
- 長度為 config.n_layers 的 tuple(torch.FloatTensor) 元組，每個元組包含 2 個形狀為 (batch_size, num_heads, sequence_length, embed_size_per_head) 的張量）。這也被稱為舊版快取格式。
模型將輸出與作為輸入提供的快取格式相同的快取格式。如果未傳入 past_key_values，則將返回舊版快取格式。

如果使用 past_key_values，使用者可以選擇只輸入形狀為 (batch_size, 1) 的最後一個 input_ids（那些沒有將其過去的鍵值狀態提供給此模型的），而不是形狀為 (batch_size, sequence_length) 的所有 input_ids。
inputs_embeds (torch.FloatTensor，形狀為 (batch_size, sequence_length, hidden_size)，可選) — 可選地，您可以選擇直接傳遞嵌入表示，而不是傳遞 input_ids。如果您希望對如何將 input_ids 索引轉換為關聯向量有比模型內部嵌入查詢矩陣更強的控制，這將非常有用。
vision_feature_layer (Union[int, list[int], NoneType]) — 用於選擇視覺特徵的層索引。如果提供了多個索引，則相應索引的視覺特徵將連線起來形成視覺特徵。
vision_feature_select_strategy (str, 可選) — 用於從視覺骨幹網中選擇視覺特徵的特徵選擇策略。可以是 "default" 或 "full" 之一。
labels (torch.LongTensor，形狀為 (batch_size, sequence_length)，可選) — 用於計算掩碼語言模型損失的標籤。索引應在 [0, ..., config.vocab_size] 或 -100 (參見 input_ids 文件字串) 之間。索引設定為 -100 的標記將被忽略（掩碼），損失僅針對標籤在 [0, ..., config.vocab_size] 中的標記計算。
use_cache (bool, 可選) — 如果設定為 True，則返回 past_key_values 鍵值狀態，可用於加速解碼（參見 past_key_values）。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。更多詳細資訊請參見返回張量下的 attentions。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。更多詳細資訊請參見返回張量下的 hidden_states。
return_dict (bool, 可選) — 是否返回 ModelOutput 而不是普通的元組。
cache_position (torch.LongTensor，形狀為 (sequence_length)，可選) — 指示輸入序列標記在序列中的位置的索引。與 position_ids 不同，此張量不受填充影響。它用於在正確位置更新快取並推斷完整的序列長度。
logits_to_keep (Union[int, torch.Tensor]，預設為 0) — 如果是 int，則計算最後 logits_to_keep 個標記的 logits。如果是 0，則計算所有 input_ids 的 logits（特殊情況）。生成時只需要最後一個標記的 logits，並且僅為該標記計算可以節省記憶體，這對於長序列或大詞彙量來說非常重要。如果是 torch.Tensor，則必須是 1D，對應於序列長度維度中要保留的索引。這在使用打包張量格式（批次和序列長度的單維度）時很有用。
image_sizes (torch.Tensor，形狀為 (batch_size, 2)，可選) — 批處理中影像的大小，每個影像為 (高度, 寬度)。

transformers.models.internvl.modeling_internvl.InternVLCausalLMOutputWithPast 或 tuple(torch.FloatTensor)

一個 transformers.models.internvl.modeling_internvl.InternVLCausalLMOutputWithPast 或一個 torch.FloatTensor 元組（如果傳遞了 return_dict=False 或當 config.return_dict=False 時），包含根據配置（InternVLConfig）和輸入的不同元素。

loss (torch.FloatTensor 形狀為 (1,)，可選，當提供 labels 時返回) — 語言建模損失（用於下一個 token 預測）。
logits (形狀為 (batch_size, sequence_length, config.vocab_size) 的 torch.FloatTensor) — 語言建模頭部的預測分數（SoftMax 之前的每個詞彙標記的分數）。
past_key_values (tuple(tuple(torch.FloatTensor)), 可選, 當傳入 use_cache=True 或 config.use_cache=True 時返回) — 長度為 config.n_layers 的 tuple(torch.FloatTensor) 元組，每個元組包含 2 個形狀為 (batch_size, num_heads, sequence_length, embed_size_per_head) 的張量）。

包含預計算的隱藏狀態（自注意力塊中的鍵和值），可用於（參見 past_key_values 輸入）加速順序解碼。
hidden_states (tuple[torch.FloatTensor]，可選，當傳遞 output_hidden_states=True 或當 config.output_hidden_states=True 時返回) — torch.FloatTensor 的元組（如果模型有嵌入層，則其中一個用於嵌入層輸出，加上每個層輸出一個），形狀為 (batch_size, sequence_length, hidden_size)。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。
attentions (tuple[torch.FloatTensor]，可選，當傳遞 output_attentions=True 或當 config.output_attentions=True 時返回) — torch.FloatTensor 的元組（每層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。
image_hidden_states (torch.FloatTensor, 可選) — 形狀為 (batch_size, num_images, sequence_length, hidden_size) 的 torch.FloatTensor。模型由視覺編碼器生成並經過最後隱藏狀態投影后的 image_hidden_states。

InternVLForConditionalGeneration 的前向方法，覆蓋了 __call__ 特殊方法。

雖然前向傳播的配方需要在該函式中定義，但在此之後應呼叫 Module 例項，因為前者負責執行預處理和後處理步驟，而後者則默默地忽略它們。

示例

>>> import torch
>>> from transformers import AutoProcessor, AutoModelForImageTextToText

>>> torch_device = "cuda"
>>> processor = AutoProcessor.from_pretrained("OpenGVLab/InternVL3-1B-hf")
>>> model = AutoModelForImageTextToText.from_pretrained(
...     "OpenGVLab/InternVL3-1B-hf", torch_dtype=torch.bfloat16, device_map=torch_device
... )

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "image",
...                 "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
...             },
...             {
...                 "type": "image",
...                 "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg",
...             },
...             {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
...         ],
...     },
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(torch_device)
>>> generate_ids = model.generate(**inputs, max_new_tokens=200)
>>> print(processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True))
The images depict the Statue of Liberty and the Golden Gate Bridge.

InternVLProcessor

類 transformers.InternVLProcessor

< 源 >

( image_processor = None tokenizer = None video_processor = None image_seq_length: int = 256 chat_template = None **kwargs )

引數

image_processor (AutoImageProcessor，可選) — 影像處理器是必需輸入。
tokenizer ([PreTrainedTokenizer, PreTrainedTokenizerFast]，可選) — 分詞器是必需輸入。
video_processor (AutoVideoProcessor，可選) — 影片處理器是必需輸入。
image_seq_length (int，可選，預設為 256) — 每幅影像補丁使用的影像標記數。應設定為：image_seq_length = (config.image_size // config.patch_size) ** 2 * (config.scale_factor**2)
chat_template (str，可選) — 一個 Jinja 模板，用於將聊天中的訊息列表轉換為可標記化的字串。

構建一個 InternVL 處理器，該處理器將 AutoImageProcessor 和 PretrainedTokenizerFast 分詞器封裝到一個繼承了影像處理器和分詞器功能的單個處理器中。有關更多資訊，請參見 __call__() 和 decode()。

batch_decode

< 源 >

( *args **kwargs )

此方法將其所有引數轉發給 PreTrainedTokenizerFast 的 batch_decode()。有關更多資訊，請參閱此方法的文件字串。

decode

< 源 >

( *args **kwargs )

此方法將其所有引數轉發給 PreTrainedTokenizerFast 的 decode()。有關更多資訊，請參閱此方法的文件字串。

InternVLVideoProcessor

類 transformers.InternVLVideoProcessor

< 源 >

( **kwargs: typing_extensions.Unpack[transformers.models.internvl.video_processing_internvl.InternVLVideoProcessorInitKwargs] )

sample_frames

< 源 >

( video: torch.Tensor metadata: typing.Union[transformers.video_utils.VideoMetadata, dict, NoneType] = None num_frames: typing.Optional[int] = None fps: typing.Optional[int] = None initial_shift: typing.Union[bool, float, int, NoneType] = None ) → torch.Tensor

引數

video (torch.Tensor) — 需要取樣的影片。
metadata (VideoMetadata，可選) — 影片的元資料，包含總時長、幀率和總幀數等資訊。
num_frames (int，可選) — 要取樣的最大幀數。預設為 self.num_frames。
fps (int，可選) — 每秒取樣的目標幀數。預設為 self.fps。
initial_shift (bool、float 或 int，預設為 self.initial_shift) — 取樣幀時要應用的初始偏移。如果為 True，則偏移量設定為從影片中間取樣幀。

torch.Tensor

取樣的影片幀。

預設取樣函式，用於在 0 和總幀數之間均勻取樣所需數量的幀。如果同時傳遞了 fps 和元資料，則均勻取樣每秒 fps 幀。引數 num_frames 和 fps 互斥。

< > 在 GitHub 上更新

Transformers

InternVL

使用示例

使用管道進行推理

單影像推理

純文字生成

批次影像和文字輸入

批次多影像輸入

影片輸入

交錯影像和影片輸入

InternVLVisionConfig

class transformers.InternVLVisionConfig

InternVLConfig

class transformers.InternVLConfig

InternVLVisionModel

class transformers.InternVLVisionModel

forward

InternVLModel

class transformers.InternVLModel

forward

InternVLForConditionalGeneration

class transformers.InternVLForConditionalGeneration

forward

InternVLProcessor

類 transformers.InternVLProcessor

batch_decode

decode

InternVLVideoProcessor

類 transformers.InternVLVideoProcessor

sample_frames