Table Transformer

概述

Table Transformer 模型由 Brandon Smock、Rohith Pesala 和 Robin Abraham 在論文 PubTables-1M: Towards comprehensive table extraction from unstructured documents 中提出。作者們引入了一個名為 PubTables-1M 的新資料集，用於衡量從非結構化文件中進行表格提取、表格結構識別和功能分析的進展。作者訓練了 2 個 DETR 模型，一個用於表格檢測，另一個用於表格結構識別，並將其命名為 Table Transformers。

論文摘要如下：

最近，在將機器學習應用於非結構化文件中的表格結構推理和提取問題上取得了顯著進展。然而，最大的挑戰之一仍然是建立具有完整、明確的地面實況且規模龐大的資料集。為了解決這個問題，我們開發了一個更全面的表格提取資料集，名為 PubTables-1M。PubTables-1M 包含近一百萬個來自科學文章的表格，支援多種輸入模式，幷包含表格結構的詳細標題和位置資訊，使其適用於各種建模方法。它還透過一種新穎的規範化程式，解決了先前資料集中觀察到的一個主要的地面實況不一致性問題，即過度分割。我們證明了這些改進顯著提高了表格結構識別的訓練效能，併為評估時的模型效能提供了更可靠的估計。此外，我們表明，在 PubTables-1M 上訓練的基於 Transformer 的目標檢測模型，在檢測、結構識別和功能分析這三個任務上均取得了優異的結果，而無需為這些任務進行任何特殊定製。

表格檢測與表格結構識別的圖示。引自原始論文。

作者釋出了 2 個模型，一個用於文件中的表格檢測，另一個用於表格結構識別（即識別表格中的各個行、列等）。

該模型由 nielsr 貢獻。原始程式碼可以在這裡找到。

資源

物體檢測

Table Transformer 的演示筆記本可以在這裡找到。
事實證明，影像填充對於檢測非常重要。一個包含作者回復的有趣的 Github 討論可以在這裡找到。

TableTransformerConfig

class transformers.TableTransformerConfig

< 源 >

( use_timm_backbone = True backbone_config = None num_channels = 3 num_queries = 100 encoder_layers = 6 encoder_ffn_dim = 2048 encoder_attention_heads = 8 decoder_layers = 6 decoder_ffn_dim = 2048 decoder_attention_heads = 8 encoder_layerdrop = 0.0 decoder_layerdrop = 0.0 is_encoder_decoder = True activation_function = 'relu' d_model = 256 dropout = 0.1 attention_dropout = 0.0 activation_dropout = 0.0 init_std = 0.02 init_xavier_std = 1.0 auxiliary_loss = False position_embedding_type = 'sine' backbone = 'resnet50' use_pretrained_backbone = True backbone_kwargs = None dilation = False class_cost = 1 bbox_cost = 5 giou_cost = 2 mask_loss_coefficient = 1 dice_loss_coefficient = 1 bbox_loss_coefficient = 5 giou_loss_coefficient = 2 eos_coefficient = 0.1 **kwargs )

引數

use_timm_backbone (bool, 可選, 預設為 True) — 是否使用 timm 庫作為主幹網路。如果設定為 False，將使用 AutoBackbone API。
backbone_config (PretrainedConfig 或 dict, 可選) — 主幹網路的配置。僅在 use_timm_backbone 設定為 False 時使用，此時將預設為 ResNetConfig()。
num_channels (int, 可選, 預設為 3) — 輸入通道數。
num_queries (int, 可選, 預設為 100) — 物件查詢的數量，即檢測槽。這是 TableTransformerModel 在單張影像中可以檢測到的最大物件數量。對於 COCO 資料集，我們建議使用 100 個查詢。
d_model (int, 可選, 預設為 256) — 層的維度。
encoder_layers (int, 可選, 預設為 6) — 編碼器層數。
decoder_layers (int, 可選, 預設為 6) — 解碼器層數。
encoder_attention_heads (int, 可選, 預設為 8) — Transformer 編碼器中每個注意力層的注意力頭數。
decoder_attention_heads (int, 可選, 預設為 8) — Transformer 解碼器中每個注意力層的注意力頭數。
decoder_ffn_dim (int, 可選, 預設為 2048) — 解碼器中“中間層”（通常稱為前饋層）的維度。
encoder_ffn_dim (int, 可選, 預設為 2048) — 解碼器中“中間層”（通常稱為前饋層）的維度。
activation_function (str 或 function, 可選, 預設為 "relu") — 編碼器和池化層中的非線性啟用函式（函式或字串）。如果為字串，支援 "gelu"、"relu"、"silu" 和 "gelu_new"。
dropout (float, 可選, 預設為 0.1) — 嵌入層、編碼器和池化層中所有全連線層的丟棄機率。
attention_dropout (float, 可選, 預設為 0.0) — 注意力機率的丟棄率。
activation_dropout (float, 可選, 預設為 0.0) — 全連線層內部啟用函式的丟棄率。
init_std (float, 可選, 預設為 0.02) — 用於初始化所有權重矩陣的 truncated_normal_initializer 的標準差。
init_xavier_std (float, 可選, 預設為 1) — HM Attention 對映模組中用於 Xavier 初始化增益的縮放因子。
encoder_layerdrop (float, 可選, 預設為 0.0) — 編碼器的 LayerDrop 機率。更多細節請參見 [LayerDrop 論文](參見 https://huggingface.co/papers/1909.11556)。
decoder_layerdrop (float, 可選, 預設為 0.0) — 解碼器的 LayerDrop 機率。更多細節請參見 [LayerDrop 論文](參見 https://huggingface.co/papers/1909.11556)。
auxiliary_loss (bool, 可選, 預設為 False) — 是否使用輔助解碼損失（在每個解碼器層計算損失）。
position_embedding_type (str, 可選, 預設為 "sine") — 用於影像特徵之上的位置嵌入型別。可選值為 "sine" 或 "learned"。
backbone (str, 可選) — 當 backbone_config 為 None 時使用的主幹網路名稱。如果 use_pretrained_backbone 為 True，將從 timm 或 transformers 庫中載入相應的預訓練權重。如果 use_pretrained_backbone 為 False，將載入主幹網路的配置並用其初始化隨機權重的主幹網路。
use_pretrained_backbone (bool, 可選, True) — 是否為主幹網路使用預訓練權重。
backbone_kwargs (dict, 可選) — 從檢查點載入時傳遞給 AutoBackbone 的關鍵字引數，例如 {'out_indices': (0, 1, 2, 3)}。如果設定了 backbone_config，則不能指定此引數。
dilation (bool, 可選, 預設為 False) — 是否在最後一個卷積塊（DC5）中用空洞卷積替換步長。僅當 use_timm_backbone = True 時支援。
class_cost (float, 可選, 預設為 1) — 匈牙利匹配代價中分類錯誤的相對權重。
bbox_cost (float, 可選, 預設為 5) — 匈牙利匹配代價中邊界框座標的 L1 誤差的相對權重。
giou_cost (float, 可選, 預設為 2) — 匈牙利匹配代價中邊界框的廣義 IoU 損失的相對權重。
mask_loss_coefficient (float, 可選, 預設為 1) — 全景分割損失中 Focal 損失的相對權重。
dice_loss_coefficient (float, 可選, 預設為 1) — 全景分割損失中 DICE/F-1 損失的相對權重。
bbox_loss_coefficient (float, 可選, 預設為 5) — 目標檢測損失中 L1 邊界框損失的相對權重。
giou_loss_coefficient (float, 可選, 預設為 2) — 目標檢測損失中廣義 IoU 損失的相對權重。
eos_coefficient (float, 可選, 預設為 0.1) — 目標檢測損失中“無物件”類別的相對分類權重。

這是用於儲存 TableTransformerModel 配置的配置類。它用於根據指定的引數例項化一個 Table Transformer 模型，定義模型架構。使用預設值例項化配置將產生與 Table Transformer microsoft/table-transformer-detection 架構類似的配置。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。更多資訊請閱讀 PretrainedConfig 的文件。

示例

>>> from transformers import TableTransformerModel, TableTransformerConfig

>>> # Initializing a Table Transformer microsoft/table-transformer-detection style configuration
>>> configuration = TableTransformerConfig()

>>> # Initializing a model from the microsoft/table-transformer-detection style configuration
>>> model = TableTransformerModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

TableTransformerModel

class transformers.TableTransformerModel

< source >

( config: TableTransformerConfig )

引數

config (TableTransformerConfig) — 包含模型所有引數的模型配置類。使用配置檔案進行初始化不會載入與模型相關的權重，只會載入配置。請檢視 from_pretrained() 方法來載入模型權重。

基礎 Table Transformer 模型（由一個主幹網路和編碼器-解碼器 Transformer 組成），輸出原始的隱藏狀態，頂部沒有任何特定的頭部。

該模型繼承自 PreTrainedModel。請查閱超類文件以瞭解該庫為所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入的大小、修剪頭部等）。

該模型也是 PyTorch torch.nn.Module 的子類。可以像常規的 PyTorch 模組一樣使用它，並參考 PyTorch 文件瞭解所有與通用用法和行為相關的事項。

forward

< source >

( pixel_values: FloatTensor pixel_mask: typing.Optional[torch.FloatTensor] = None decoder_attention_mask: typing.Optional[torch.FloatTensor] = None encoder_outputs: typing.Optional[torch.FloatTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None decoder_inputs_embeds: typing.Optional[torch.FloatTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.models.table_transformer.modeling_table_transformer.TableTransformerModelOutput 或 tuple(torch.FloatTensor)

引數

pixel_values (torch.FloatTensor，形狀為 (batch_size, num_channels, image_size, image_size)) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲取。有關詳細資訊，請參閱 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
pixel_mask (torch.FloatTensor，形狀為 (batch_size, height, width), 可選) — 用於避免在填充畫素值上執行注意力機制的掩碼。掩碼值在 [0, 1] 中選擇：
- 1 表示真實畫素（即未被掩碼），
- 0 表示填充畫素（即被掩碼）。
什麼是注意力掩碼？
decoder_attention_mask (torch.FloatTensor，形狀為 (batch_size, num_queries), 可選) — 預設不使用。可用於掩碼物件查詢。
encoder_outputs (torch.FloatTensor, 可選) — 元組，包含 (last_hidden_state, 可選: hidden_states, 可選: attentions)。last_hidden_state 的形狀為 (batch_size, sequence_length, hidden_size)，可選，是編碼器最後一層輸出的隱藏狀態序列。用於解碼器的交叉注意力機制。
inputs_embeds (torch.FloatTensor，形狀為 (batch_size, sequence_length, hidden_size), 可選) — 可選地，你可以不傳遞扁平化的特徵圖（主幹網路 + 投影層的輸出），而是選擇直接傳遞影像的扁平化表示。
decoder_inputs_embeds (torch.FloatTensor，形狀為 (batch_size, num_queries, hidden_size), 可選) — 可選地，你可以不使用零張量初始化查詢，而是選擇直接傳遞一個嵌入表示。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。有關更多詳細資訊，請參閱返回張量下的 attentions。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關更多詳細資訊，請參閱返回張量下的 hidden_states。
return_dict (bool, 可選) — 是否返回一個 ModelOutput 而不是一個普通的元組。

transformers.models.table_transformer.modeling_table_transformer.TableTransformerModelOutput 或 tuple(torch.FloatTensor)

一個 transformers.models.table_transformer.modeling_table_transformer.TableTransformerModelOutput 或一個 torch.FloatTensor 的元組（如果傳遞了 return_dict=False 或當 config.return_dict=False 時），根據配置（TableTransformerConfig）和輸入，包含各種元素。

last_hidden_state (torch.FloatTensor，形狀為 (batch_size, sequence_length, hidden_size)) — 模型解碼器最後一層輸出的隱藏狀態序列。
past_key_values (~cache_utils.EncoderDecoderCache, 可選, 當傳遞 use_cache=True 或 config.use_cache=True 時返回) — 這是一個 Cache 例項。更多詳情請參閱我們的 kv cache 指南。

包含預先計算的隱藏狀態（自注意力塊中的鍵和值，如果 config.is_encoder_decoder=True，則可選地包含在交叉注意力塊中），可用於（參見 past_key_values 輸入）加速順序解碼。
decoder_hidden_states (tuple[torch.FloatTensor, ...], 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — torch.FloatTensor 的元組（一個用於嵌入層的輸出，如果模型有嵌入層，+ 一個用於每層的輸出），形狀為 (batch_size, sequence_length, hidden_size)。

解碼器在每一層輸出時的隱藏狀態以及初始嵌入輸出。
decoder_attentions (tuple[torch.FloatTensor, ...], 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — torch.FloatTensor 的元組（每層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。

解碼器的注意力權重，在注意力 softmax 之後，用於計算自注意力頭中的加權平均。
cross_attentions (tuple[torch.FloatTensor, ...], 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — torch.FloatTensor 的元組（每層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。

解碼器交叉注意力層的注意力權重，在注意力 softmax 之後，用於計算交叉注意力頭中的加權平均。
encoder_last_hidden_state (torch.FloatTensor，形狀為 (batch_size, sequence_length, hidden_size)，可選) — 模型編碼器最後一層輸出的隱藏狀態序列。
encoder_hidden_states (tuple[torch.FloatTensor, ...], 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — torch.FloatTensor 的元組（一個用於嵌入層的輸出，如果模型有嵌入層，+ 一個用於每層的輸出），形狀為 (batch_size, sequence_length, hidden_size)。

編碼器在每一層輸出時的隱藏狀態以及初始嵌入輸出。
encoder_attentions (tuple[torch.FloatTensor, ...], 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — torch.FloatTensor 的元組（每層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。

編碼器的注意力權重，在注意力 softmax 之後，用於計算自注意力頭中的加權平均。
intermediate_hidden_states (torch.FloatTensor，形狀為 (config.decoder_layers, batch_size, sequence_length, hidden_size), 可選, 當 config.auxiliary_loss=True 時返回) — 中間解碼器啟用值，即每個解碼器層的輸出，每個都經過了層歸一化。

TableTransformerModel 的 forward 方法覆蓋了 __call__ 特殊方法。

雖然前向傳播的流程需要在此函式內定義，但之後應該呼叫 Module 例項而不是這個函式，因為前者會負責執行預處理和後處理步驟，而後者會靜默地忽略它們。

示例

>>> from transformers import AutoImageProcessor, TableTransformerModel
>>> from huggingface_hub import hf_hub_download
>>> from PIL import Image

>>> file_path = hf_hub_download(repo_id="nielsr/example-pdf", repo_type="dataset", filename="example_pdf.png")
>>> image = Image.open(file_path).convert("RGB")

>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/table-transformer-detection")
>>> model = TableTransformerModel.from_pretrained("microsoft/table-transformer-detection")

>>> # prepare image for the model
>>> inputs = image_processor(images=image, return_tensors="pt")

>>> # forward pass
>>> outputs = model(**inputs)

>>> # the last hidden states are the final query embeddings of the Transformer decoder
>>> # these are of shape (batch_size, num_queries, hidden_size)
>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 15, 256]

TableTransformerForObjectDetection

class transformers.TableTransformerForObjectDetection

< source >

( config: TableTransformerConfig )

引數

config (TableTransformerConfig) — 包含模型所有引數的模型配置類。使用配置檔案進行初始化不會載入與模型相關的權重，只會載入配置。請檢視 from_pretrained() 方法來載入模型權重。

Table Transformer 模型（由一個主幹網路和編碼器-解碼器 Transformer 組成），頂部帶有目標檢測頭，用於諸如 COCO 檢測等任務。

該模型繼承自 PreTrainedModel。請查閱超類文件以瞭解該庫為所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入的大小、修剪頭部等）。

該模型也是 PyTorch torch.nn.Module 的子類。可以像常規的 PyTorch 模組一樣使用它，並參考 PyTorch 文件瞭解所有與通用用法和行為相關的事項。

forward

< source >

( pixel_values: FloatTensor pixel_mask: typing.Optional[torch.FloatTensor] = None decoder_attention_mask: typing.Optional[torch.FloatTensor] = None encoder_outputs: typing.Optional[torch.FloatTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None decoder_inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[list[dict]] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.models.table_transformer.modeling_table_transformer.TableTransformerObjectDetectionOutput 或 tuple(torch.FloatTensor)

引數

pixel_values (torch.FloatTensor，形狀為 (batch_size, num_channels, image_size, image_size)) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲取。有關詳細資訊，請參閱 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
pixel_mask (torch.FloatTensor，形狀為 (batch_size, height, width), 可選) — 用於避免在填充畫素值上執行注意力機制的掩碼。掩碼值在 [0, 1] 中選擇：
- 1 表示真實畫素（即未被掩碼），
- 0 表示填充畫素（即被掩碼）。
什麼是注意力掩碼？
decoder_attention_mask (torch.FloatTensor，形狀為 (batch_size, num_queries), 可選) — 預設不使用。可用於掩碼物件查詢。
encoder_outputs (torch.FloatTensor, 可選) — 元組，包含 (last_hidden_state, 可選: hidden_states, 可選: attentions)。last_hidden_state 的形狀為 (batch_size, sequence_length, hidden_size)，可選，是編碼器最後一層輸出的隱藏狀態序列。用於解碼器的交叉注意力機制。
inputs_embeds (torch.FloatTensor，形狀為 (batch_size, sequence_length, hidden_size), 可選) — 可選地，你可以不傳遞扁平化的特徵圖（主幹網路 + 投影層的輸出），而是選擇直接傳遞影像的扁平化表示。
decoder_inputs_embeds (torch.FloatTensor，形狀為 (batch_size, num_queries, hidden_size), 可選) — 可選地，你可以不使用零張量初始化查詢，而是選擇直接傳遞一個嵌入表示。
labels (list[Dict]，長度為 (batch_size,), 可選) — 用於計算二分匹配損失的標籤。一個字典列表，每個字典至少包含以下兩個鍵：'class_labels' 和 'boxes'（分別表示批次中一個影像的類別標籤和邊界框）。類別標籤本身應該是一個長度為 `(影像中邊界框的數量,)` 的 `torch.LongTensor`，而邊界框應該是一個形狀為 `(影像中邊界框的數量, 4)` 的 `torch.FloatTensor`。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。有關更多詳細資訊，請參閱返回張量下的 attentions。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關更多詳細資訊，請參閱返回張量下的 hidden_states。
return_dict (bool, 可選) — 是否返回一個 ModelOutput 而不是一個普通的元組。

transformers.models.table_transformer.modeling_table_transformer.TableTransformerObjectDetectionOutput 或 tuple(torch.FloatTensor)

一個 transformers.models.table_transformer.modeling_table_transformer.TableTransformerObjectDetectionOutput 或一個 torch.FloatTensor 的元組（如果傳遞了 return_dict=False 或當 config.return_dict=False 時），根據配置（TableTransformerConfig）和輸入，包含各種元素。

loss (torch.FloatTensor，形狀為 (1,), 可選, 當提供 labels 時返回) — 總損失，是類別預測的負對數似然（交叉熵）和邊界框損失的線性組合。後者定義為 L1 損失和廣義尺度不變 IoU 損失的線性組合。
loss_dict (Dict, 可選) — 包含各個損失的字典。用於日誌記錄。
logits (形狀為 (batch_size, num_queries, num_classes + 1) 的 torch.FloatTensor) — 所有查詢的分類 logits（包括無物件）。
pred_boxes (torch.FloatTensor，形狀為 (batch_size, num_queries, 4)) — 所有查詢的歸一化邊界框座標，表示為 (center_x, center_y, width, height)。這些值在 [0, 1] 範圍內歸一化，相對於批次中每個影像的大小（不考慮可能的填充）。你可以使用 ~TableTransformerImageProcessor.post_process_object_detection 來檢索未歸一化的邊界框。
auxiliary_outputs (list[Dict], 可選) — 可選，僅在啟用輔助損失（即 config.auxiliary_loss 設定為 True）並提供標籤時返回。它是一個字典列表，包含每個解碼器層的上述兩個鍵（logits 和 pred_boxes）。
last_hidden_state (形狀為 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor，可選) — 模型解碼器最後一層輸出的隱藏狀態序列。
decoder_hidden_states (tuple[torch.FloatTensor], 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — torch.FloatTensor 的元組（一個用於嵌入層的輸出，如果模型有嵌入層，+ 一個用於每層的輸出），形狀為 (batch_size, sequence_length, hidden_size)。

解碼器在每一層輸出時的隱藏狀態以及初始嵌入輸出。
decoder_attentions (tuple[torch.FloatTensor], 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — torch.FloatTensor 的元組（每層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。

解碼器的注意力權重，在注意力 softmax 之後，用於計算自注意力頭中的加權平均。
cross_attentions (tuple[torch.FloatTensor], 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — torch.FloatTensor 的元組（每層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。

解碼器交叉注意力層的注意力權重，在注意力 softmax 之後，用於計算交叉注意力頭中的加權平均。
encoder_last_hidden_state (torch.FloatTensor，形狀為 (batch_size, sequence_length, hidden_size), 可選, 預設為 None) — 模型編碼器最後一層輸出的隱藏狀態序列。
encoder_hidden_states (tuple[torch.FloatTensor], 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — torch.FloatTensor 的元組（一個用於嵌入層的輸出，如果模型有嵌入層，+ 一個用於每層的輸出），形狀為 (batch_size, sequence_length, hidden_size)。

編碼器在每一層輸出時的隱藏狀態以及初始嵌入輸出。
encoder_attentions (tuple[torch.FloatTensor], 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — torch.FloatTensor 的元組（每層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。

編碼器的注意力權重，在注意力 softmax 之後，用於計算自注意力頭中的加權平均。

TableTransformerForObjectDetection 的 forward 方法覆蓋了 __call__ 特殊方法。

示例

>>> from huggingface_hub import hf_hub_download
>>> from transformers import AutoImageProcessor, TableTransformerForObjectDetection
>>> import torch
>>> from PIL import Image

>>> file_path = hf_hub_download(repo_id="nielsr/example-pdf", repo_type="dataset", filename="example_pdf.png")
>>> image = Image.open(file_path).convert("RGB")

>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/table-transformer-detection")
>>> model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-transformer-detection")

>>> inputs = image_processor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)

>>> # convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
>>> target_sizes = torch.tensor([image.size[::-1]])
>>> results = image_processor.post_process_object_detection(outputs, threshold=0.9, target_sizes=target_sizes)[
...     0
... ]

>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
...     box = [round(i, 2) for i in box.tolist()]
...     print(
...         f"Detected {model.config.id2label[label.item()]} with confidence "
...         f"{round(score.item(), 3)} at location {box}"
...     )
Detected table with confidence 1.0 at location [202.1, 210.59, 1119.22, 385.09]

< > 在 GitHub 上更新