Gemma 3n 現已全面登陸開源生態系統！

釋出於 2025 年 6 月 26 日

在 GitHub 上更新

贊

114

Aritra Roy Gosthipaty

Christopher Fleetwood

Gemma 3n 在 Google I/O 大會上作為 *預覽版* 宣佈。端側裝置社群對此感到非常興奮，因為這是一款從零開始設計，旨在您的硬體上 **本地執行** 的模型。更重要的是，它原生支援 **多模態**，支援影像、文字、音訊和影片輸入 🤯

今天，Gemma 3n 終於在最常用的開源庫中可用了。這包括 transformers 和 timm、MLX、llama.cpp（文字輸入）、transformers.js、ollama、Google AI Edge 等。

本文將透過實用的程式碼片段快速演示如何使用這些庫來執行該模型，以及如何輕鬆地為其他領域對其進行微調。

今日釋出的模型

這是 Gemma 3n 釋出模型合集

今天釋出了兩種尺寸的模型，每種尺寸都有兩種變體（基礎版和指令版）。模型名稱遵循非標準命名法：它們被稱為 `gemma-3n-E2B` 和 `gemma-3n-E4B`。引數數量前的 `E` 代表 `Effective`（等效）。它們的實際引數數量分別為 `5B` 和 `8B`，但由於記憶體效率的提升，它們在 VRAM（GPU 視訊記憶體）中分別只需要 2B 和 4B。

因此，這些模型在硬體支援方面表現得像 2B 和 4B 模型，但在質量方面卻超越了 2B/4B 的水平。`E2B` 模型僅需 2GB 的 GPU RAM 即可執行，而 `E4B` 僅需 3GB 的 GPU RAM 即可執行。

大小	基礎版	指令版
2B	google/gemma-3n-e2b	google/gemma-3n-e2b-it
4B	google/gemma-3n-e4b	google/gemma-3n-e4b-it

模型詳情

除了語言解碼器，Gemma 3n 還使用了一個 **音訊編碼器** 和一個 **視覺編碼器**。我們在下面重點介紹它們的主要特性，並描述它們是如何被新增到 `transformers` 和 `timm` 中的，因為它們是其他實現的參考。

視覺編碼器 (MobileNet-V5)。Gemma 3n 使用了新版本的 MobileNet：MobileNet-v5-300，該版本已新增到今天釋出的 `timm` 新版本中。
- 擁有 3 億引數。
- 支援 `256x256`、`512x512` 和 `768x768` 的解析度。
- 在 Google Pixel 上達到 60 FPS，效能優於 ViT Giant，而引數量減少了 3 倍。
音訊編碼器
- 基於通用語音模型 (USM)。
- 以 `160ms` 的塊處理音訊。
- 支援語音轉文字和翻譯功能（例如，英語到西班牙語/法語）。
Gemma 3n 架構和語言模型。該架構本身已新增到今天釋出的 `transformers` 新版本中。此實現會呼叫 `timm` 進行影像編碼，因此 MobileNet 架構只有一個參考實現。

架構亮點

MatFormer 架構
- 這是一種巢狀式 transformer 設計，類似於 Matryoshka 嵌入，允許提取層的不同子集，就好像它們是獨立的模型一樣。
- E2B 和 E4B 是聯合訓練的，將 E2B 配置為 E4B 的子模型。
- 使用者可以根據其硬體特性和記憶體預算“混合搭配”層。
逐層嵌入 (PLE)：透過將嵌入解除安裝到 CPU 來減少加速器記憶體使用。這就是為什麼 E2B 模型雖然有 5B 的實際引數，但佔用的 GPU 記憶體卻與 2B 引數模型相當。
KV 快取共享：加速音訊和影片的長上下文處理，與 Gemma 3 4B 相比，預填充速度快 2 倍。

效能與基準測試：

LMArena 分數：E4B 是第一個得分超過 1300 的 10B 以下模型。
MMLU 分數：Gemma 3n 在各種尺寸（E4B、E2B 和幾種混合搭配配置）下都表現出有競爭力的效能。
多語言支援：支援 140 種語言的文字互動和 35 種語言的多模態互動。

演示空間

體驗模型最簡單的方式是使用該模型專用的 Hugging Face Space。您可以在這裡嘗試不同的提示，使用不同的模態。

📱 空間

使用 transformers 進行推理

安裝最新版本的 timm（用於視覺編碼器）和 transformers 來執行推理，或者如果您想對其進行微調。

pip install -U -q timm
pip install -U -q transformers

使用 pipeline 進行推理

開始使用 Gemma 3n 的最簡單方法是使用 transformers 中的 pipeline 抽象

import torch
from transformers import pipeline

pipe = pipeline(
   "image-text-to-text",
   model="google/gemma-3n-E4B-it", # "google/gemma-3n-E4B-it"
   device="cuda",
   torch_dtype=torch.bfloat16
)

messages = [
   {
       "role": "user",
       "content": [
           {"type": "image", "url": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg"},
           {"type": "text", "text": "Describe this image"}
       ]
   }
]

output = pipe(text=messages, max_new_tokens=32)
print(output[0]["generated_text"][-1]["content"])

輸出

The image shows a futuristic, sleek aircraft soaring through the sky. It's designed with a distinctive, almost alien aesthetic, featuring a wide body and large

使用 transformers 進行詳細推理

從 Hub 初始化模型和處理器，並編寫一個 `model_generation` 函式，該函式負責處理提示並對模型執行推理。

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "google/gemma-3n-e4b-it" # google/gemma-3n-e2b-it
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id).to(device)

def model_generation(model, messages):
    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    )
    input_len = inputs["input_ids"].shape[-1]

    inputs = inputs.to(model.device, dtype=model.dtype)

    with torch.inference_mode():
        generation = model.generate(**inputs, max_new_tokens=32, disable_compile=False)
        generation = generation[:, input_len:]

    decoded = processor.batch_decode(generation, skip_special_tokens=True)
    print(decoded[0])

由於該模型支援所有模態作為輸入，以下是透過 transformers 使用它們的簡要程式碼說明。

純文字

# Text Only

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is the capital of France?"}
        ]
    }
]
model_generation(model, messages)

輸出

The capital of France is **Paris**.

與音訊交錯

# Interleaved with Audio

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in English:"},
            {"type": "audio", "audio": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/speech.wav"},
        ]
    }
]
model_generation(model, messages)

輸出

Send a text to Mike. I'll be home late tomorrow.

與影像/影片交錯

對影片的支援是透過一系列影像幀來實現的

# Interleaved with Image

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg"},
            {"type": "text", "text": "Describe this image."}
        ]
    }
]
model_generation(model, messages)

輸出

The image shows a futuristic, sleek, white airplane against a backdrop of a clear blue sky transitioning into a cloudy, hazy landscape below. The airplane is tilted at

使用 MLX 進行推理

Gemma 3n 在釋出首日即支援 MLX 的全部 3 種模態。請確保升級您的 mlx-vlm 安裝。

pip install -u mlx-vlm

從視覺開始

python -m mlx_vlm.generate --model google/gemma-3n-E4B-it --max-tokens 100 --temperature 0.5 --prompt "Describe this image in detail." --image https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg

以及音訊

python -m mlx_vlm.generate --model google/gemma-3n-E4B-it --max-tokens 100 --temperature 0.0 --prompt "Transcribe the following speech segment in English:" --audio https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/audio-samples/jfk.wav

使用 llama.cpp 進行推理

除了 MLX，Gemma 3n（僅文字）也可以直接與 llama.cpp 配合使用。請確保從原始碼安裝 llama.cpp/Ollama。

在此檢視 llama.cpp 的安裝說明：https://github.com/ggml-org/llama.cpp/blob/master/docs/install.md

您可以這樣執行它

llama-server -hf ggml-org/gemma-3n-E4B-it-GGUF:Q8_0

使用 Transformers.js 和 ONNXRuntime 進行推理

最後，我們還發布了 gemma-3n-E2B-it 模型變體的 ONNX 權重，從而可以在不同的執行時和平臺上靈活部署。對於 JavaScript 開發者，Gemma3n 已被整合到 Transformers.js 中，並從 3.6.0 版本開始可用。

有關如何使用這些庫執行模型的更多資訊，請檢視模型卡片中的使用部分。

在免費的 Google Colab 中進行微調

考慮到模型的尺寸，對特定下游任務跨模態進行微調是非常方便的。為了讓您更容易地微調模型，我們建立了一個簡單的 notebook，讓您可以在免費的 Google Colab 上進行實驗！

我們還提供了一個專門的用於音訊任務微調的 notebook，以便您可以輕鬆地將模型應用於您的語音資料集和基準測試！

Hugging Face Gemma Recipes

隨著這次釋出，我們還推出了 Hugging Face Gemma Recipes 程式碼倉庫。您可以在其中找到用於執行和微調模型的 `notebooks` 和 `scripts`。

我們非常希望您能使用 Gemma 系列模型，併為其新增更多的 recipes！歡迎隨時在該倉庫中提出 Issues 和建立 Pull Requests。

結論

我們總是很高興能託管 Google 及其 Gemma 系列模型。我們希望社群能夠齊心協力，充分利用這些模型。多模態、小尺寸、高能力，成就了一次偉大的模型釋出！

如果您想更詳細地討論這些模型，請直接在本部落格文章下方發起討論。我們將非常樂意提供幫助！

非常感謝 Arthur、Cyril、Raushan、Lysandre 以及 Hugging Face 的每一位成員，他們負責了整合工作並將其提供給社群！

更多部落格文章

nanoVLM: 最簡單的純 PyTorch 訓練 VLM 程式碼庫

作者 2025 年 5 月 21 日 • 202

視覺語言模型 (更好、更快、更強)

作者 2025 年 5 月 12 日 • 501

社群

mrdbourke

6 月 26 日

轟動性的釋出！

MobileNet-v5 的權重會加入到 https://huggingface.co/timm 嗎？

是否有關於此模型（僅視覺編碼器部分）效能的結果可供查閱？

感謝大家的努力。

rishiraj

6 月 27 日

寫得真好，想要更深入瞭解架構的朋友，請閱讀 https://huggingface.co/blog/rishiraj/matformer-in-gemma-3n

evo42

6 月 27 日

感謝釋出首日即支援 MLX 🙏

已刪除

6 月 27 日

此評論已被隱藏

jobesu

6 月 27 日

•

編輯於 6 月 28 日

非常感謝這次釋出！
文章中提到該模型“在 Google Pixel 上達到 60 FPS”，因此是影像作為輸入。如果我沒記錯的話，該模型在 Google Pixel 上運行於 Google Tensor G4 晶片。要在高通晶片（例如 QCS8550）上執行該模型，我的理解是我們應該使用 llama.cpp 庫，但它似乎不提供 ViT 編碼器（文章中寫道“llama.cpp（文字輸入）”）。我的理解是否正確？或者在高通上最好的方式是使用 Onnxruntime 版本？基本上我的問題是，在支援多模態的情況下，在裝置上使用 Gemma 的最佳方式是什麼？
謝謝。

透過拖放到文字輸入框、貼上或點選此處上傳圖片、音訊和影片。

點選或貼上此處以上傳圖片

· 註冊或登入發表評論

贊

114