視覺文件檢索

文件除了文字之外，如果包含圖表、表格和影像，則可能包含多模態資料。從這些文件中檢索資訊具有挑戰性，因為單獨的文字檢索模型無法處理視覺資料，而影像檢索模型則缺乏粒度和文件處理能力。

視覺文件檢索可以幫助從所有型別的文件中檢索資訊，包括多模態檢索增強生成 (RAG)。這些模型接受文件（作為影像）和文字，並計算它們之間的相似度分數。

本指南演示瞭如何使用 ColPali 索引和檢索文件。

對於大規模用例，您可能需要使用向量資料庫來索引和檢索文件。

確保已安裝 Transformers 和 Datasets。

pip install -q datasets transformers

我們將索引一個與不明飛行物目擊相關的文件資料集。我們過濾掉缺少感興趣列的示例。它包含多列，我們對 specific_detail_query 列感興趣，其中包含文件的簡短摘要，以及包含文件的 image 列。

from datasets import load_dataset

dataset = load_dataset("davanstrien/ufo-ColPali")
dataset = dataset["train"]
dataset = dataset.filter(lambda example: example["specific_detail_query"] is not None)
dataset

Dataset({
    features: ['image', 'raw_queries', 'broad_topical_query', 'broad_topical_explanation', 'specific_detail_query', 'specific_detail_explanation', 'visual_element_query', 'visual_element_explanation', 'parsed_into_json'],
    num_rows: 2172
})

讓我們載入模型和分詞器。

import torch
from transformers import ColPaliForRetrieval, ColPaliProcessor

model_name = "vidore/colpali-v1.2-hf"

processor = ColPaliProcessor.from_pretrained(model_name)

model = ColPaliForRetrieval.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
).eval()

將文字查詢傳遞給處理器，並從模型返回索引的文字嵌入。對於影像到文字搜尋，請在 ColPaliProcessor 中用 images 引數替換 text 引數以傳遞影像。

inputs = processor(text="a document about Mars expedition").to("cuda")
with torch.no_grad():
  text_embeds = model(**inputs, return_tensors="pt").embeddings

離線索引影像，並在推理期間返回查詢文字嵌入以獲取其最接近的影像嵌入。

透過使用 map 將影像和影像嵌入寫入資料集來儲存它們，如下所示。新增一個包含索引嵌入的 embeddings 列。ColPali 嵌入佔用大量儲存空間，因此將它們從 GPU 中移除並以 NumPy 向量的形式儲存在 CPU 中。

ds_with_embeddings = dataset.map(lambda example: {'embeddings': model(**processor(images=example["image"]).to("cuda"), return_tensors="pt").embeddings.to(torch.float32).detach().cpu().numpy()})

對於線上推理，建立一個函式以批處理方式搜尋影像嵌入，並檢索 k 個最相關的影像。下面的函式返回給定索引資料集、文字嵌入、前 k 個結果數和批處理大小的資料集中的索引及其分數。

def find_top_k_indices_batched(dataset, text_embedding, processor, k=10, batch_size=4):
    scores_and_indices = []

    for start_idx in range(0, len(dataset), batch_size):

        end_idx = min(start_idx + batch_size, len(dataset))
        batch = dataset[start_idx:end_idx]        
        batch_embeddings = [torch.tensor(emb[0], dtype=torch.float32) for emb in batch["embeddings"]]
        scores = processor.score_retrieval(text_embedding.to("cpu").to(torch.float32), batch_embeddings)

        if hasattr(scores, "tolist"):
            scores = scores.tolist()[0]

        for i, score in enumerate(scores):
            scores_and_indices.append((score, start_idx + i))

    sorted_results = sorted(scores_and_indices, key=lambda x: -x[0])

    topk = sorted_results[:k]
    indices = [idx for _, idx in topk]
    scores = [score for score, _ in topk]

    return indices, scores

生成文字嵌入並將其傳遞給上述函式以返回資料集索引和分數。

with torch.no_grad():
  text_embeds = model(**processor(text="a document about Mars expedition").to("cuda"), return_tensors="pt").embeddings
indices, scores = find_top_k_indices_batched(ds_with_embeddings, text_embeds, processor, k=3, batch_size=4)
print(indices, scores)

([440, 442, 443],
 [14.370786666870117,
  13.675487518310547,
  12.9899320602417])

顯示影像以檢視與火星相關的文件。

for i in indices:
  display(dataset[i]["image"])

< > 在 GitHub 上更新