文件問答

文件問答，也稱為文件視覺問答，是一項涉及回答關於文件影像的問題的任務。支援此任務的模型的輸入通常是影像和問題的組合，輸出是自然語言表達的答案。這些模型利用多種模態，包括文字、單詞位置（邊界框）和影像本身。

本指南演示瞭如何

在 DocVQA 資料集上微調 LayoutLMv2。
使用您的微調模型進行推理。

要檢視與此任務相容的所有架構和檢查點，我們建議檢視任務頁面

LayoutLMv2 透過在令牌的最終隱藏狀態之上新增一個問答頭來解決文件問答任務，以預測答案的起始和結束令牌的位置。換句話說，該問題被視為抽取式問答：給定上下文，抽取哪段資訊回答了問題。上下文來自 OCR 引擎的輸出，這裡是 Google 的 Tesseract。

開始之前，請確保已安裝所有必需的庫。LayoutLMv2 依賴於 detectron2、torchvision 和 tesseract。

pip install -q transformers datasets

pip install 'git+https://github.com/facebookresearch/detectron2.git'
pip install torchvision

sudo apt install tesseract-ocr
pip install -q pytesseract

安裝所有依賴項後，重啟您的執行時。

我們鼓勵您與社群分享您的模型。登入您的 Hugging Face 帳戶以將其上傳到 🤗 Hub。出現提示時，輸入您的令牌登入

>>> from huggingface_hub import notebook_login

>>> notebook_login()

讓我們定義一些全域性變數。

>>> model_checkpoint = "microsoft/layoutlmv2-base-uncased"
>>> batch_size = 4

載入資料

在本指南中，我們使用了一個經過預處理的 DocVQA 小樣本，您可以在 🤗 Hub 上找到它。如果您想使用完整的 DocVQA 資料集，您可以在 DocVQA 主頁上註冊並下載。如果這樣做，要繼續本指南，請檢視如何將檔案載入到 🤗 資料集中。

>>> from datasets import load_dataset

>>> dataset = load_dataset("nielsr/docvqa_1200_examples")
>>> dataset
DatasetDict({
    train: Dataset({
        features: ['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'],
        num_rows: 200
    })
})

如您所見，資料集已分為訓練集和測試集。檢視一個隨機示例以熟悉其特徵。

>>> dataset["train"].features

以下是各個欄位的含義

id: 示例的 ID
image: 包含文件影像的 PIL.Image.Image 物件
query: 問題字串 - 自然語言提問，支援多種語言
answers: 人工標註者提供的一系列正確答案
words 和 bounding_boxes: OCR 結果，我們在此不使用
answer: 由另一個模型匹配的答案，我們在此不使用

我們只保留英語問題，並刪除包含另一個模型預測的 `answer` 特徵。我們還將從標註者提供的一組答案中取第一個。或者，您可以隨機取樣。

>>> updated_dataset = dataset.map(lambda example: {"question": example["query"]["en"]}, remove_columns=["query"])
>>> updated_dataset = updated_dataset.map(
...     lambda example: {"answer": example["answers"][0]}, remove_columns=["answer", "answers"]
... )

請注意，本指南中使用的 LayoutLMv2 檢查點已使用 `max_position_embeddings = 512` 進行訓練（您可以在檢查點的 `config.json` 檔案中找到此資訊）。我們可以截斷示例，但為了避免答案可能位於大文件末尾並最終被截斷的情況，這裡我們將刪除少數嵌入長度可能超過 512 的示例。如果您的資料集中大多數文件都很長，您可以實施滑動視窗策略 - 有關詳細資訊，請檢視此 Notebook。

>>> updated_dataset = updated_dataset.filter(lambda x: len(x["words"]) + len(x["question"].split()) < 512)

此時，我們還將從該資料集中刪除 OCR 功能。這些功能是為微調不同模型而進行的 OCR 結果。如果我們要使用它們，它們仍然需要一些處理，因為它們與本指南中使用的模型的輸入要求不匹配。相反，我們可以對原始資料使用 LayoutLMv2Processor 進行 OCR 和分詞。這樣，我們將獲得與模型預期輸入匹配的輸入。如果您想手動處理影像，請檢視 `LayoutLMv2` 模型文件，瞭解模型期望的輸入格式。

>>> updated_dataset = updated_dataset.remove_columns("words")
>>> updated_dataset = updated_dataset.remove_columns("bounding_boxes")

最後，如果不檢視影像示例，資料探索就不完整。

>>> updated_dataset["train"][11]["image"]

預處理資料

文件問答任務是一個多模態任務，您需要確保每個模態的輸入都根據模型的預期進行預處理。讓我們首先載入 LayoutLMv2Processor，它內部結合了一個可以處理影像資料的影像處理器和一個可以編碼文字資料的分詞器。

>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained(model_checkpoint)

預處理文件影像

首先，讓我們藉助處理器中的 `image_processor` 為模型準備文件影像。預設情況下，影像處理器會將影像大小調整為 224x224，確保它們具有正確的顏色通道順序，使用 Tesseract 應用 OCR 以獲取單詞和標準化邊界框。在本教程中，所有這些預設設定都正是我們需要的。編寫一個函式，將預設影像處理應用於一批影像並返回 OCR 結果。

>>> image_processor = processor.image_processor


>>> def get_ocr_words_and_boxes(examples):
...     images = [image.convert("RGB") for image in examples["image"]]
...     encoded_inputs = image_processor(images)

...     examples["image"] = encoded_inputs.pixel_values
...     examples["words"] = encoded_inputs.words
...     examples["boxes"] = encoded_inputs.boxes

...     return examples

要快速將此預處理應用於整個資料集，請使用 map。

>>> dataset_with_ocr = updated_dataset.map(get_ocr_words_and_boxes, batched=True, batch_size=2)

預處理文字資料

對影像應用 OCR 後，我們需要對資料集的文字部分進行編碼，以準備它用於模型。這涉及將上一步中獲得的單詞和框轉換為令牌級別的 `input_ids`、`attention_mask`、`token_type_ids` 和 `bbox`。對於文字預處理，我們需要處理器中的 `tokenizer`。

>>> tokenizer = processor.tokenizer

除了上面提到的預處理之外，我們還需要為模型新增標籤。對於 🤗 Transformers 中的 `xxxForQuestionAnswering` 模型，標籤由 `start_positions` 和 `end_positions` 組成，指示哪個令牌是答案的開始，哪個令牌是答案的結束。

讓我們從這裡開始。定義一個輔助函式，該函式可以在較大的列表（單詞列表）中查詢子列表（答案分成單詞）。

該函式將接收兩個列表作為輸入，`words_list` 和 `answer_list`。然後，它將遍歷 `words_list` 並檢查 `words_list` 中的當前單詞 (words_list[i]) 是否等於 `answer_list` 的第一個單詞 (answer_list[0])，以及從當前單詞開始的 `words_list` 子列表（與 `answer_list` 長度相同）是否等於 `answer_list`。如果此條件為真，則表示已找到匹配項，函式將記錄該匹配項、其起始索引 (idx) 和其結束索引 (idx + len(answer_list) - 1)。如果找到多個匹配項，函式將僅返回第一個。如果未找到匹配項，函式將返回 (`None`, 0, 和 0)。

>>> def subfinder(words_list, answer_list):
...     matches = []
...     start_indices = []
...     end_indices = []
...     for idx, i in enumerate(range(len(words_list))):
...         if words_list[i] == answer_list[0] and words_list[i : i + len(answer_list)] == answer_list:
...             matches.append(answer_list)
...             start_indices.append(idx)
...             end_indices.append(idx + len(answer_list) - 1)
...     if matches:
...         return matches[0], start_indices[0], end_indices[0]
...     else:
...         return None, 0, 0

為了說明此函式如何找到答案的位置，我們以一個示例為例

>>> example = dataset_with_ocr["train"][1]
>>> words = [word.lower() for word in example["words"]]
>>> match, word_idx_start, word_idx_end = subfinder(words, example["answer"].lower().split())
>>> print("Question: ", example["question"])
>>> print("Words:", words)
>>> print("Answer: ", example["answer"])
>>> print("start_index", word_idx_start)
>>> print("end_index", word_idx_end)
Question:  Who is in  cc in this letter?
Words: ['wie', 'baw', 'brown', '&', 'williamson', 'tobacco', 'corporation', 'research', '&', 'development', 'internal', 'correspondence', 'to:', 'r.', 'h.', 'honeycutt', 'ce:', 't.f.', 'riehl', 'from:', '.', 'c.j.', 'cook', 'date:', 'may', '8,', '1995', 'subject:', 'review', 'of', 'existing', 'brainstorming', 'ideas/483', 'the', 'major', 'function', 'of', 'the', 'product', 'innovation', 'graup', 'is', 'to', 'develop', 'marketable', 'nove!', 'products', 'that', 'would', 'be', 'profitable', 'to', 'manufacture', 'and', 'sell.', 'novel', 'is', 'defined', 'as:', 'of', 'a', 'new', 'kind,', 'or', 'different', 'from', 'anything', 'seen', 'or', 'known', 'before.', 'innovation', 'is', 'defined', 'as:', 'something', 'new', 'or', 'different', 'introduced;', 'act', 'of', 'innovating;', 'introduction', 'of', 'new', 'things', 'or', 'methods.', 'the', 'products', 'may', 'incorporate', 'the', 'latest', 'technologies,', 'materials', 'and', 'know-how', 'available', 'to', 'give', 'then', 'a', 'unique', 'taste', 'or', 'look.', 'the', 'first', 'task', 'of', 'the', 'product', 'innovation', 'group', 'was', 'to', 'assemble,', 'review', 'and', 'categorize', 'a', 'list', 'of', 'existing', 'brainstorming', 'ideas.', 'ideas', 'were', 'grouped', 'into', 'two', 'major', 'categories', 'labeled', 'appearance', 'and', 'taste/aroma.', 'these', 'categories', 'are', 'used', 'for', 'novel', 'products', 'that', 'may', 'differ', 'from', 'a', 'visual', 'and/or', 'taste/aroma', 'point', 'of', 'view', 'compared', 'to', 'canventional', 'cigarettes.', 'other', 'categories', 'include', 'a', 'combination', 'of', 'the', 'above,', 'filters,', 'packaging', 'and', 'brand', 'extensions.', 'appearance', 'this', 'category', 'is', 'used', 'for', 'novel', 'cigarette', 'constructions', 'that', 'yield', 'visually', 'different', 'products', 'with', 'minimal', 'changes', 'in', 'smoke', 'chemistry', 'two', 'cigarettes', 'in', 'cne.', 'emulti-plug', 'te', 'build', 'yaur', 'awn', 'cigarette.', 'eswitchable', 'menthol', 'or', 'non', 'menthol', 'cigarette.', '*cigarettes', 'with', 'interspaced', 'perforations', 'to', 'enable', 'smoker', 'to', 'separate', 'unburned', 'section', 'for', 'future', 'smoking.', '«short', 'cigarette,', 'tobacco', 'section', '30', 'mm.', '«extremely', 'fast', 'buming', 'cigarette.', '«novel', 'cigarette', 'constructions', 'that', 'permit', 'a', 'significant', 'reduction', 'iretobacco', 'weight', 'while', 'maintaining', 'smoking', 'mechanics', 'and', 'visual', 'characteristics.', 'higher', 'basis', 'weight', 'paper:', 'potential', 'reduction', 'in', 'tobacco', 'weight.', '«more', 'rigid', 'tobacco', 'column;', 'stiffing', 'agent', 'for', 'tobacco;', 'e.g.', 'starch', '*colored', 'tow', 'and', 'cigarette', 'papers;', 'seasonal', 'promotions,', 'e.g.', 'pastel', 'colored', 'cigarettes', 'for', 'easter', 'or', 'in', 'an', 'ebony', 'and', 'ivory', 'brand', 'containing', 'a', 'mixture', 'of', 'all', 'black', '(black', 'paper', 'and', 'tow)', 'and', 'ail', 'white', 'cigarettes.', '499150498']
Answer:  T.F. Riehl
start_index 17
end_index 18

然而，一旦示例被編碼，它們將看起來像這樣

>>> encoding = tokenizer(example["question"], example["words"], example["boxes"])
>>> tokenizer.decode(encoding["input_ids"])
[CLS] who is in cc in this letter? [SEP] wie baw brown & williamson tobacco corporation research & development ...

我們需要在編碼後的輸入中找到答案的位置。

`token_type_ids` 告訴我們哪些 token 屬於問題，哪些屬於文件中的單詞。
`tokenizer.cls_token_id` 將有助於找到輸入開頭處的特殊 token。
`word_ids` 將有助於將原始 `words` 中找到的答案與完整編碼輸入中的相同答案進行匹配，並確定答案在編碼輸入中的起始/結束位置。

考慮到這一點，讓我們建立一個函式來編碼資料集中的一批示例

>>> def encode_dataset(examples, max_length=512):
...     questions = examples["question"]
...     words = examples["words"]
...     boxes = examples["boxes"]
...     answers = examples["answer"]

...     # encode the batch of examples and initialize the start_positions and end_positions
...     encoding = tokenizer(questions, words, boxes, max_length=max_length, padding="max_length", truncation=True)
...     start_positions = []
...     end_positions = []

...     # loop through the examples in the batch
...     for i in range(len(questions)):
...         cls_index = encoding["input_ids"][i].index(tokenizer.cls_token_id)

...         # find the position of the answer in example's words
...         words_example = [word.lower() for word in words[i]]
...         answer = answers[i]
...         match, word_idx_start, word_idx_end = subfinder(words_example, answer.lower().split())

...         if match:
...             # if match is found, use `token_type_ids` to find where words start in the encoding
...             token_type_ids = encoding["token_type_ids"][i]
...             token_start_index = 0
...             while token_type_ids[token_start_index] != 1:
...                 token_start_index += 1

...             token_end_index = len(encoding["input_ids"][i]) - 1
...             while token_type_ids[token_end_index] != 1:
...                 token_end_index -= 1

...             word_ids = encoding.word_ids(i)[token_start_index : token_end_index + 1]
...             start_position = cls_index
...             end_position = cls_index

...             # loop over word_ids and increase `token_start_index` until it matches the answer position in words
...             # once it matches, save the `token_start_index` as the `start_position` of the answer in the encoding
...             for id in word_ids:
...                 if id == word_idx_start:
...                     start_position = token_start_index
...                 else:
...                     token_start_index += 1

...             # similarly loop over `word_ids` starting from the end to find the `end_position` of the answer
...             for id in word_ids[::-1]:
...                 if id == word_idx_end:
...                     end_position = token_end_index
...                 else:
...                     token_end_index -= 1

...             start_positions.append(start_position)
...             end_positions.append(end_position)

...         else:
...             start_positions.append(cls_index)
...             end_positions.append(cls_index)

...     encoding["image"] = examples["image"]
...     encoding["start_positions"] = start_positions
...     encoding["end_positions"] = end_positions

...     return encoding

現在我們有了這個預處理函式，我們可以對整個資料集進行編碼

>>> encoded_train_dataset = dataset_with_ocr["train"].map(
...     encode_dataset, batched=True, batch_size=2, remove_columns=dataset_with_ocr["train"].column_names
... )
>>> encoded_test_dataset = dataset_with_ocr["test"].map(
...     encode_dataset, batched=True, batch_size=2, remove_columns=dataset_with_ocr["test"].column_names
... )

讓我們看看編碼後的資料集特徵是什麼樣的

>>> encoded_train_dataset.features
{'image': Sequence(feature=Sequence(feature=Sequence(feature=Value(dtype='uint8', id=None), length=-1, id=None), length=-1, id=None), length=-1, id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'bbox': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
 'start_positions': Value(dtype='int64', id=None),
 'end_positions': Value(dtype='int64', id=None)}

評估

文件問答的評估需要大量的後處理。為了避免佔用您太多時間，本指南跳過了評估步驟。 Trainer 仍然會在訓練期間計算評估損失，因此您不會完全不知道模型的效能。抽取式問答通常使用 F1/精確匹配進行評估。如果您想自己實現，請檢視 Hugging Face 課程的問答章節以獲取靈感。

訓練

恭喜您！您已經成功完成了本指南中最艱難的部分，現在可以訓練自己的模型了。訓練包括以下步驟

使用與預處理中相同的檢查點，透過 AutoModelForDocumentQuestionAnswering 載入模型。
在 TrainingArguments 中定義您的訓練超引數。
定義一個函式來批次處理示例，這裡 DefaultDataCollator 就足夠了
將訓練引數與模型、資料集和資料整理器一起傳遞給 Trainer。
呼叫 train() 來微調您的模型。

>>> from transformers import AutoModelForDocumentQuestionAnswering

>>> model = AutoModelForDocumentQuestionAnswering.from_pretrained(model_checkpoint)

在 TrainingArguments 中，使用 `output_dir` 指定儲存模型的位置，並根據需要配置超引數。如果您希望與社群共享模型，請將 `push_to_hub` 設定為 `True`（您必須登入 Hugging Face 才能上傳模型）。在這種情況下，`output_dir` 也將是您的模型檢查點將被推送到的倉庫名稱。

>>> from transformers import TrainingArguments

>>> # REPLACE THIS WITH YOUR REPO ID
>>> repo_id = "MariaK/layoutlmv2-base-uncased_finetuned_docvqa"

>>> training_args = TrainingArguments(
...     output_dir=repo_id,
...     per_device_train_batch_size=4,
...     num_train_epochs=20,
...     save_steps=200,
...     logging_steps=50,
...     eval_strategy="steps",
...     learning_rate=5e-5,
...     save_total_limit=2,
...     remove_unused_columns=False,
...     push_to_hub=True,
... )

定義一個簡單的資料整理器以將示例批處理在一起。

>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator()

最後，將所有內容整合在一起，並呼叫 train()

>>> from transformers import Trainer

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     data_collator=data_collator,
...     train_dataset=encoded_train_dataset,
...     eval_dataset=encoded_test_dataset,
...     processing_class=processor,
... )

>>> trainer.train()

要將最終模型新增到 🤗 Hub，請建立一個模型卡並呼叫 `push_to_hub`

>>> trainer.create_model_card()
>>> trainer.push_to_hub()

推理

現在您已經微調了 LayoutLMv2 模型並將其上傳到 🤗 Hub，您可以將其用於推理。嘗試微調模型進行推理的最簡單方法是在 Pipeline 中使用它。

舉個例子

>>> example = dataset["test"][2]
>>> question = example["query"]["en"]
>>> image = example["image"]
>>> print(question)
>>> print(example["answers"])
'Who is ‘presiding’ TRRF GENERAL SESSION (PART 1)?'
['TRRF Vice President', 'lee a. waller']

接下來，使用您的模型例項化一個用於文件問答的管道，並將影像 + 問題組合傳遞給它。

>>> from transformers import pipeline

>>> qa_pipeline = pipeline("document-question-answering", model="MariaK/layoutlmv2-base-uncased_finetuned_docvqa")
>>> qa_pipeline(image, question)
[{'score': 0.9949808120727539,
  'answer': 'Lee A. Waller',
  'start': 55,
  'end': 57}]

您也可以手動重現管道的結果，如果您願意的話

選取影像和問題，使用模型中的處理器準備它們，以便模型使用。
透過模型轉發預處理的結果。
模型返回 `start_logits` 和 `end_logits`，它們指示答案的起始 token 和結束 token。兩者形狀均為 (batch_size, sequence_length)。
對 `start_logits` 和 `end_logits` 的最後一個維度進行 argmax，以獲得預測的 `start_idx` 和 `end_idx`。
用分詞器解碼答案。

>>> import torch
>>> from transformers import AutoProcessor
>>> from transformers import AutoModelForDocumentQuestionAnswering

>>> processor = AutoProcessor.from_pretrained("MariaK/layoutlmv2-base-uncased_finetuned_docvqa")
>>> model = AutoModelForDocumentQuestionAnswering.from_pretrained("MariaK/layoutlmv2-base-uncased_finetuned_docvqa")

>>> with torch.no_grad():
...     encoding = processor(image.convert("RGB"), question, return_tensors="pt")
...     outputs = model(**encoding)
...     start_logits = outputs.start_logits
...     end_logits = outputs.end_logits
...     predicted_start_idx = start_logits.argmax(-1).item()
...     predicted_end_idx = end_logits.argmax(-1).item()

>>> processor.tokenizer.decode(encoding.input_ids.squeeze()[predicted_start_idx : predicted_end_idx + 1])
'lee a. waller'

< > 在 GitHub 上更新