使用 IDEFICS 完成影像任務

雖然單個任務可以透過微調專用模型來解決，但最近出現並流行的一種替代方法是使用大型模型處理各種任務而無需微調。例如，大型語言模型可以處理摘要、翻譯、分類等 NLP 任務。這種方法不再侷限於單一模態（例如文字），在本指南中，我們將演示如何使用名為 IDEFICS 的大型多模態模型來解決影像-文字任務。

IDEFICS 是一個基於 Flamingo 的開放式視覺和語言模型，Flamingo 是 DeepMind 最初開發的一種最先進的視覺語言模型。該模型接受任意影像和文字輸入序列，並生成連貫的文字作為輸出。它可以回答有關影像的問題，描述視覺內容，建立基於多幅影像的故事等等。IDEFICS 有兩種變體 - 800 億引數和 90 億引數，這兩種變體都可以在 🤗 Hub 上找到。對於每種變體，您還可以找到經過微調的指令版本模型，適用於對話式用例。

此模型用途廣泛，可用於各種影像和多模態任務。然而，作為大型模型意味著它需要大量的計算資源和基礎設施。您需要決定這種方法是否比為每個單獨任務微調專用模型更適合您的用例。

在本指南中，您將學習如何

載入 IDEFICS 和載入模型的量化版本
使用 IDEFICS 完成
以批處理模式執行推理
執行 IDEFICS 指令進行對話使用

在開始之前，請確保您已安裝所有必要的庫。

pip install -q bitsandbytes sentencepiece accelerate transformers

要使用非量化版本的模型檢查點執行以下示例，您將需要至少 20GB 的 GPU 記憶體。

載入模型

讓我們開始載入模型的 90 億引數檢查點

>>> checkpoint = "HuggingFaceM4/idefics-9b"

與其他 Transformers 模型一樣，您需要從檢查點載入處理器和模型本身。IDEFICS 處理器將 LlamaTokenizer 和 IDEFICS 影像處理器封裝到一個處理器中，負責為模型準備文字和影像輸入。

>>> import torch

>>> from transformers import IdeficsForVisionText2Text, AutoProcessor

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto")

將 device_map 設定為 "auto" 將自動確定如何以最最佳化的方式載入和儲存模型權重，給定現有裝置。

量化模型

如果高階 GPU 記憶體可用性有問題，您可以載入模型的量化版本。要以 4 位精度載入模型和處理器，請將 BitsAndBytesConfig 傳遞給 from_pretrained 方法，模型將在載入時動態壓縮。

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig

>>> quantization_config = BitsAndBytesConfig(
...     load_in_4bit=True,
...     bnb_4bit_compute_dtype=torch.float16,
... )

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(
...     checkpoint,
...     quantization_config=quantization_config,
...     device_map="auto"
... )

現在您已經以建議的方式之一載入了模型，讓我們繼續探索可以使用 IDEFICS 的任務。

影像字幕

影像字幕是預測給定影像的字幕的任務。一個常見的應用是幫助視障人士在不同情況下導航，例如線上瀏覽影像內容。

為了說明此任務，獲取要新增字幕的影像，例如

圖片由 Hendo Wang 拍攝。

IDEFICS 接受文字和影像提示。但是，要為影像新增字幕，您不必向模型提供文字提示，只需提供預處理的輸入影像。如果沒有文字提示，模型將從 BOS（序列開始）標記開始生成文字，從而建立字幕。

作為模型的影像輸入，您可以使用影像物件（PIL.Image）或可以從中檢索影像的 URL。

>>> prompt = [
...     "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
A puppy in a flower bed

在呼叫 generate 時包含 bad_words_ids 是個好主意，以避免在增加 max_new_tokens 時出現錯誤：當模型未生成影像時，模型會想要生成新的 <image> 或 <fake_token_around_image> 標記。您可以像本指南中一樣動態設定它，或者將其儲存在 GenerationConfig 中，如文字生成策略指南中所述。

帶提示的影像字幕

您可以透過提供文字提示來擴充套件影像字幕，模型將根據影像繼續生成。讓我們再看一個影像來演示

圖片由 Denys Nevozhai 拍攝。

文字和影像提示可以作為單個列表傳遞給模型的處理器，以建立適當的輸入。

>>> prompt = [
...     "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...     "This is an image of ",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
This is an image of the Eiffel Tower in Paris, France.

少樣本提示

雖然 IDEFICS 展示了出色的零樣本結果，但您的任務可能需要特定格式的字幕，或帶有其他限制或要求，從而增加了任務的複雜性。少樣本提示可用於實現上下文學習。透過在提示中提供示例，您可以引導模型生成模仿給定示例格式的結果。

讓我們以上一張埃菲爾鐵塔的影像為例，為模型構建一個提示，向模型演示除了學習影像中的物件是什麼之外，我們還希望獲得一些有趣的關於它的資訊。然後，讓我們看看，我們是否可以獲得自由女神像影像的相同響應格式

圖片由 Juan Mayobre 拍攝。

>>> prompt = ["User:",
...            "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...            "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n",
...            "User:",
...            "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80",
...            "Describe this image.\nAssistant:"
...            ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
User: Describe this image.
Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building. 
User: Describe this image.
Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall.

請注意，僅僅透過一個示例（即 1 樣本）模型就已經學會了如何執行任務。對於更復雜的任務，可以隨意嘗試更多示例（例如，3 樣本、5 樣本等）。

視覺問答

視覺問答（VQA）是基於影像回答開放式問題的人物。與影像字幕類似，它可用於輔助應用，但也可用於教育（關於視覺材料的推理）、客戶服務（基於影像的產品問題）和影像檢索。

讓我們為這項任務獲取一張新影像

圖片由 Jarritos Mexican Soda 拍攝。

您可以透過提供適當的指令，將模型從影像字幕引導到視覺問答

>>> prompt = [
...     "Instruction: Provide an answer to the question. Use the image to answer.\n",
...     "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...     "Question: Where are these people and what's the weather like? Answer:"
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Provide an answer to the question. Use the image to answer.
 Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day.

影像分類

IDEFICS 能夠將影像分類到不同的類別中，而無需在包含這些特定類別標記示例的資料上進行明確訓練。給定一個類別列表，並利用其影像和文字理解能力，模型可以推斷出影像可能屬於哪個類別。

例如，我們有這張蔬菜攤的圖片

圖片由 Peter Wendt 拍攝。

我們可以指示模型將影像分類到我們擁有的其中一個類別中

>>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office']
>>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n",
...     "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",    
...     "Category: "
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office'].
Category: Vegetables

在上面的例子中，我們指示模型將影像分類為單個類別，但是，您也可以提示模型進行排名分類。

影像引導文字生成

對於更具創造性的應用程式，您可以使用影像引導文字生成來根據影像生成文字。這對於建立產品描述、廣告、場景描述等非常有用。

讓我們提示 IDEFICS 根據一張簡單的紅門影像寫一個故事

Image of a red door with a pumpkin on the steps

圖片由 Craig Tidball 拍攝。

>>> prompt = ["Instruction: Use the image to write a story. \n",
...     "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80",
...     "Story: \n"]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0]) 
Instruction: Use the image to write a story. 
 Story: 
Once upon a time, there was a little girl who lived in a house with a red door.  She loved her red door.  It was the prettiest door in the whole world.

One day, the little girl was playing in her yard when she noticed a man standing on her doorstep.  He was wearing a long black coat and a top hat.

The little girl ran inside and told her mother about the man.

Her mother said, “Don’t worry, honey.  He’s just a friendly ghost.”

The little girl wasn’t sure if she believed her mother, but she went outside anyway.

When she got to the door, the man was gone.

The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep.

He was wearing a long black coat and a top hat.

The little girl ran

IDEFICS 似乎注意到了門階上的南瓜，然後寫了一個關於鬼魂的萬聖節恐怖故事。

對於像這樣較長的輸出，您將極大地受益於調整文字生成策略。這可以幫助您顯著提高生成的輸出質量。檢視文字生成策略以瞭解更多資訊。

批處理模式執行推理

前面所有部分都演示了 IDEFICS 的單個示例。以非常相似的方式，您可以透過傳遞提示列表來對一批示例執行推理

>>> prompts = [
...     [   "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
... ]

>>> inputs = processor(prompts, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i,t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n") 
0:
This is an image of the Eiffel Tower in Paris, France.

1:
This is an image of a couple on a picnic blanket.

2:
This is an image of a vegetable stand.

IDEFICS 指令用於對話

對於對話用例，您可以在 🤗 Hub 上找到經過微調的指令版本模型：HuggingFaceM4/idefics-80b-instruct 和 HuggingFaceM4/idefics-9b-instruct。

這些檢查點是基礎模型在監督和指令微調資料集混合上微調的結果，這提高了下游效能，同時使模型在對話設定中更易於使用。

對話用例的使用和提示與使用基礎模型非常相似

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor
>>> from accelerate.test_utils.testing import get_backend

>>> device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
>>> checkpoint = "HuggingFaceM4/idefics-9b-instruct"
>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> prompts = [
...     [
...         "User: What is in this image?",
...         "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
...         "<end_of_utterance>",

...         "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",

...         "\nUser:",
...         "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
...         "And who is that?<end_of_utterance>",

...         "\nAssistant:",
...     ],
... ]

>>> # --batched mode
>>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
>>> # --single sample mode
>>> # inputs = processor(prompts[0], return_tensors="pt").to(device)

>>> # Generation args
>>> exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i, t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n")

< > 在 GitHub 上更新