Transformers 文件

Qwen2.5-Omni

Transformers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

Qwen2.5-Omni

概述

Qwen2.5-Omni 模型是阿里巴巴集團通義團隊在Qwen2.5-Omni 技術報告中提出的統一多模態模型。

該技術報告的摘要如下：

我們推出 Qwen2.5-Omni，這是一款端到端的多模態模型，旨在感知包括文字、影像、音訊和影片在內的多種模態，同時以流式方式生成文字和自然語音響應。為了實現多模態資訊輸入的流式處理，音訊和視覺編碼器均採用分塊處理方法。這種策略有效地解耦了長序列多模態資料的處理，將感知職責分配給多模態編碼器，並將長序列的建模委託給大型語言模型。這種分工透過共享注意力機制增強了不同模態的融合。為了同步影片輸入與音訊的時間戳，我們以交錯方式組織音訊和影片序列，並提出了一種新穎的位置嵌入方法，名為 TMRoPE（時間對齊多模態 RoPE）。為了同時生成文字和語音，並避免兩種模態之間的干擾，我們提出了 Thinker-Talker 架構。在這個框架中，Thinker 作為大型語言模型負責文字生成，而 Talker 是一種雙軌自迴歸模型，直接利用來自 Thinker 的隱藏表示來輸出音訊標記。Thinker 和 Talker 模型都設計為以端到端方式進行訓練和推理。為了以流式方式解碼音訊標記，我們引入了一個限制感受野的滑動視窗 DiT，旨在減少初始包延遲。Qwen2.5-Omni 在影像和音訊能力方面均優於同等大小的 Qwen2-VL 和 Qwen2-Audio。此外，Qwen2.5-Omni 在 Omni-Bench 等多模態基準測試中取得了最先進的效能。值得注意的是，Qwen2.5-Omni 是第一個在端到端語音指令遵循方面達到與文字輸入能力相當效能的開源模型，MMLU 和 GSM8K 等基準測試證明了這一點。至於語音生成，Qwen2.5-Omni 的流式 Talker 在魯棒性和自然度方面優於大多數現有的流式和非流式替代方案。

注意事項

使用 Qwen2_5OmniForConditionalGeneration 生成音訊和文字輸出。要只生成一種輸出型別，文字專用請使用 Qwen2_5OmniThinkerForConditionalGeneration，音訊專用請使用 Qwen2_5OmniTalkersForConditionalGeneration。
目前，Qwen2_5OmniForConditionalGeneration 進行音訊生成僅支援單個批處理大小。
如果處理影片輸入時出現記憶體不足錯誤，請減少 processor.max_pixels。預設情況下，最大值設定得非常大，除非解析度超過 processor.max_pixels，否則高解析度視覺效果將不會被調整大小。
處理器擁有自己的 apply_chat_template() 方法，可將聊天訊息轉換為模型輸入。

使用示例

Qwen2.5-Omni 可以在 Huggingface Hub 上找到。

單媒體推理

該模型可以接受文字、影像、音訊和影片作為輸入。以下是推理的示例程式碼。

import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    torch_dtype="auto",
    device_map="auto"
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")

conversations = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "text", "text": "What cant you hear and see in this video?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversations,
    load_audio_from_video=True,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    video_fps=1,

    # kwargs to be passed to `Qwen2-5-OmniProcessor`
    padding=True,
    use_audio_in_video=True,
).to(model.device)

# Generation params for audio or text can be different and have to be prefixed with `thinker_` or `talker_`
text_ids, audio = model.generate(**inputs, use_audio_in_video=True, thinker_do_sample=False, talker_do_sample=True)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

sf.write(
    "output.wav",
    audio.reshape(-1).detach().cpu().numpy(),
    samplerate=24000,
)
print(text)

僅文字生成

為了只生成文字輸出並透過不載入音訊生成模型來節省計算量，我們可以使用 Qwen2_5OmniThinkerForConditionalGeneration 模型。

from transformers import Qwen2_5OmniThinkerForConditionalGeneration, Qwen2_5OmniProcessor

model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    torch_dtype="auto",
    device_map="auto",
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")

conversations = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "text", "text": "What cant you hear and see in this video?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversations,
    load_audio_from_video=True,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    video_fps=1,

    # kwargs to be passed to `Qwen2-5-OmniProcessor`
    padding=True,
    use_audio_in_video=True,
).to(model.device)


text_ids = model.generate(**inputs, use_audio_in_video=True)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

sf.write(
    "output.wav",
    audio.reshape(-1).detach().cpu().numpy(),
    samplerate=24000,
)
print(text)

批次混合媒體推理

使用 Qwen2_5OmniThinkerForConditionalGeneration 模型時，該模型可以批次處理由文字、影像、音訊和影片等各種型別的混合樣本組成的輸入。這是一個示例。

import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    torch_dtype="auto",
    device_map="auto"
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")

# Conversation with video only
conversation1 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "/path/to/video.mp4"},
        ]
    }
]

# Conversation with audio only
conversation2 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "path": "/path/to/audio.wav"},
        ]
    }
]

# Conversation with pure text
conversation3 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [{"type": "text", "text": "who are you?"}],
    }
]


# Conversation with mixed media
conversation4 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "path": "/path/to/image.jpg"},
            {"type": "video", "path": "/path/to/video.mp4"},
            {"type": "audio", "path": "/path/to/audio.wav"},
            {"type": "text", "text": "What are the elements can you see and hear in these medias?"},
        ],
    }
]

conversations = [conversation1, conversation2, conversation3, conversation4]

inputs = processor.apply_chat_template(
    conversations,
    load_audio_from_video=True,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    video_fps=1,

    # kwargs to be passed to `Qwen2-5-OmniProcessor`
    padding=True,
    use_audio_in_video=True,
).to(model.thinker.device)

text_ids = model.generate(**inputs, use_audio_in_video=True)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

print(text)

使用技巧

影像解析度權衡

該模型支援多種解析度輸入。預設情況下，它使用原始解析度進行輸入，但更高解析度可以提升效能，代價是增加計算量。使用者可以設定最小和最大畫素數，以根據其需求實現最佳配置。

min_pixels = 128*28*28
max_pixels = 768*28*28
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B", min_pixels=min_pixels, max_pixels=max_pixels)

音訊輸出提示

如果使用者需要音訊輸出，系統提示必須設定為“您是通義團隊開發的虛擬人 Qwen，能夠感知聽覺和視覺輸入，並生成文字和語音。”，否則音訊輸出可能無法按預期工作。

{
    "role": "system",
    "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
}

是否使用音訊輸出

該模型支援文字和音訊輸出。如果使用者不需要音訊輸出，可以在 from_pretrained 函式中設定 enable_audio_output。此選項將節省約 ~2GB 的 GPU 記憶體，但 generate 函式的 return_audio 選項將只能設定為 False。

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    torch_dtype="auto",
    device_map="auto",
    enable_audio_output=False,
)

為了獲得靈活的體驗，我們建議使用者在透過 from_pretrained 函式初始化模型時將 enable_audio_output 設定為 True，然後在呼叫 generate 函式時決定是否返回音訊。當 return_audio 設定為 False 時，模型將僅返回文字輸出以更快地獲取文字響應。

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    torch_dtype="auto",
    device_map="auto",
    enable_audio_output=True,
)
...
text_ids = model.generate(**inputs, return_audio=False)

改變輸出音訊的音色型別

Qwen2.5-Omni 支援改變輸出音訊的音色。使用者可以使用 generate 函式的 spk 引數指定音色型別。"Qwen/Qwen2.5-Omni-7B" 檢查點支援兩種音色型別：Chelsie 和 Ethan，其中 Chelsie 為女聲，Ethan 為男聲。預設情況下，如果未指定 spk，則預設音色型別為 Chelsie。

text_ids, audio = model.generate(**inputs, spk="Chelsie")

text_ids, audio = model.generate(**inputs, spk="Ethan")

使用 Flash-Attention 2 加速生成

首先，請確保安裝最新版本的 Flash Attention 2

pip install -U flash-attn --no-build-isolation

此外，您應該擁有與 FlashAttention 2 相容的硬體。有關更多資訊，請參閱Flash Attention 儲存庫的官方文件。FlashAttention-2 只能在模型以 torch.float16 或 torch.bfloat16 載入時使用。

要使用 FlashAttention-2 載入和執行模型，請在載入模型時新增 attn_implementation="flash_attention_2"

from transformers import Qwen2_5OmniForConditionalGeneration

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

Qwen2_5OmniConfig

class transformers.Qwen2_5OmniConfig

< 來源 >

( thinker_config = None talker_config = None token2wav_config = None enable_audio_output: bool = True **kwargs )

引數

thinker_config (dict, 可選) — 基礎思考者子模型的配置。
talker_config (dict, 可選) — 基礎說話者子模型的配置。
token2wav_config (dict, 可選) — 基礎編解碼器子模型的配置。
enable_audio_output (bool, 可選, 預設為 True) — 是否啟用音訊輸出並載入說話者和 token2wav 模組。

這是用於儲存 Qwen2_5OmniForConditionalGeneration 配置的配置類。它用於根據指定的子模型配置例項化 Qwen2.5Omni 模型，定義模型架構。

使用預設值例項化配置將生成與 Qwen/Qwen2.5-Omni-7B 架構類似的配置。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請閱讀 PretrainedConfig 的文件。

示例

>>> from transformers import (
...     Qwen2_5OmniThinkerConfig,
...     Qwen2_5OmniTalkerConfig,
...     Qwen2_5OmniToken2WavConfig,
...     Qwen2_5OmniForConditionalGeneration,
...     Qwen2_5OmniConfig,
... )

>>> # Initializing sub-modules configurations.
>>> thinker_config = Qwen2_5OmniThinkerConfig()
>>> talker_config = Qwen2_5OmniTalkerConfig()
>>> token2wav_config = Qwen2_5OmniToken2WavConfig()


>>> # Initializing a module style configuration
>>> configuration = Qwen2_5OmniConfig.from_sub_model_configs(
...     thinker_config, talker_config, token2wav_config
... )

>>> # Initializing a model (with random weights)
>>> model = Qwen2_5OmniForConditionalGeneration(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

Qwen2_5OmniProcessor

class transformers.Qwen2_5OmniProcessor

< 來源 >

( image_processor = None video_processor = None feature_extractor = None tokenizer = None chat_template = None )

引數

image_processor (Qwen2VLImageProcessor, 可選) — 影像處理器。
video_processor (Qwen2VLVideoProcessor, 可選) — 影片處理器。
feature_extractor (WhisperFeatureExtractor, 可選) — 音訊特徵提取器。
tokenizer (Qwen2TokenizerFast, 可選) — 文字分詞器。
chat_template (Optional[str], 可選) — 用於格式化對話的 Jinja 模板。如果未提供，則使用預設聊天模板。

構建一個 Qwen2.5Omni 處理器。Qwen2_5OmniProcessor 提供了 Qwen2VLImageProcessor、WhisperFeatureExtractor 和 Qwen2TokenizerFast 的所有功能。有關更多資訊，請參閱 __call__() 和 decode()。

批次解碼

< 來源 >

( *args **kwargs )

此方法將其所有引數轉發給 Qwen2TokenizerFast 的 batch_decode()。有關更多資訊，請參閱此方法的文件字串。

解碼

< 來源 >

( *args **kwargs )

此方法將其所有引數轉發至 Qwen2TokenizerFast 的 decode()。有關更多資訊，請參閱此方法的文件字串。

獲取分塊索引

< 來源 >

( token_indices: ndarray tokens_per_chunk: int ) → list[tuple[int, int]]

引數

token_indices (np.ndarray) — 單調遞增的標記索引值列表。
t_ntoken_per_chunk (int) — 每塊標記的數量（用作塊大小閾值）。

list[tuple[int, int]]

一個元組列表，每個元組代表 token_indices 中塊的開始（包含）和結束（不包含）索引。

根據標記值範圍將標記索引列表拆分為塊。

給定標記索引列表，返回 (start, end) 索引元組列表，表示列表中標記值落在連續 t_ntoken_per_chunk 範圍內的切片。

例如，如果 t_ntoken_per_chunk 為 1000，則函式將建立這樣的塊：

第一個塊包含標記值 < 1000，
第二個塊包含值 >= 1000 且 < 2000，依此類推。

Qwen2_5OmniForConditionalGeneration

class transformers.Qwen2_5OmniForConditionalGeneration

< 來源 >

( config )

引數

config (Qwen2_5OmniForConditionalGeneration) — 模型的配置類，包含模型的所有引數。使用配置檔案初始化並不會載入與模型相關的權重，只加載配置。要載入模型權重，請檢視 from_pretrained() 方法。

完整的 Qwen2.5Omni 模型，一個由 3 個子模型組成的多模態模型

Qwen2_5OmniThinkerForConditionalGeneration: 一個因果自迴歸轉換器，接受文字、音訊、影像、影片作為輸入並預測文字標記。
Qwen2_5OmniTalkerForConditionalGeneration: 一個因果自迴歸轉換器，接受思考者的隱藏狀態和響應作為輸入並預測語音標記。
Qwen2_5OmniToken2WavModel：一個 DiT 模型，以語音 token 作為輸入，預測梅爾頻譜圖，以及一個 BigVGAN 聲碼器，以梅爾頻譜圖作為輸入，預測波形。

此模型繼承自 PreTrainedModel。請檢視超類的文件，瞭解該庫為其所有模型實現的一般方法（例如下載或儲存、調整輸入嵌入大小、修剪頭等）。

此模型也是 PyTorch torch.nn.Module 子類。將其用作常規 PyTorch 模組，並參考 PyTorch 文件，瞭解與一般用法和行為相關的所有事項。

Transformers

Qwen2.5-Omni

概述

注意事項

使用示例

單媒體推理

僅文字生成

批次混合媒體推理

使用技巧

影像解析度權衡

音訊輸出提示

是否使用音訊輸出

改變輸出音訊的音色型別

使用 Flash-Attention 2 加速生成

Qwen2_5OmniConfig

class transformers.Qwen2_5OmniConfig

Qwen2_5OmniProcessor

class transformers.Qwen2_5OmniProcessor

批次解碼

解碼

獲取分塊索引

Qwen2_5OmniForConditionalGeneration

class transformers.Qwen2_5OmniForConditionalGeneration

_forward_unimplemented

Qwen2_5OmniPreTrainedModelForConditionalGeneration

class transformers.Qwen2_5OmniPreTrainedModelForConditionalGeneration

獲取分塊索引

get_rope_index

Qwen2_5OmniThinkerConfig

class transformers.Qwen2_5OmniThinkerConfig

Qwen2_5OmniThinkerForConditionalGeneration

class transformers.Qwen2_5OmniThinkerForConditionalGeneration

forward

get_audio_features

get_image_features

get_video_features

Qwen2_5OmniThinkerTextModel

class transformers.Qwen2_5OmniThinkerTextModel

forward

Qwen2_5OmniTalkerConfig

class transformers.Qwen2_5OmniTalkerConfig

Qwen2_5OmniTalkerForConditionalGeneration

class transformers.Qwen2_5OmniTalkerForConditionalGeneration

forward

Qwen2_5OmniTalkerModel

class transformers.Qwen2_5OmniTalkerModel

forward

Qwen2_5OmniToken2WavConfig

class transformers.Qwen2_5OmniToken2WavConfig

Qwen2_5OmniToken2WavModel

class transformers.Qwen2_5OmniToken2WavModel

forward

Qwen2_5OmniToken2WavDiTModel

class transformers.Qwen2_5OmniToken2WavDiTModel

Qwen2_5OmniToken2WavBigVGANModel

class transformers.Qwen2_5OmniToken2WavBigVGANModel