TAPAS

概述

TAPAS 模型由 Jonathan Herzig、Paweł Krzysztof Nowak、Thomas Müller、Francesco Piccinno 和 Julian Martin Eisenschlos 在TAPAS: 透過預訓練進行弱監督表格解析中提出。它是一個基於 BERT 的模型，專門為回答表格資料相關問題而設計（並進行預訓練）。與 BERT 相比，TAPAS 使用相對位置嵌入，並具有 7 種標記型別來編碼表格結構。TAPAS 在大型資料集上透過掩碼語言建模（MLM）目標進行預訓練，該資料集包含來自英文維基百科的數百萬個表格和相應的文字。

對於問答，TAPAS 在頂部有 2 個頭：一個單元格選擇頭和一個聚合頭，用於（可選）在所選單元格中執行聚合（例如計數或求和）。TAPAS 已在多個數據集上進行了微調：

SQA（微軟的序列問答）
WTQ（斯坦福大學的維基表格問題）
WikiSQL（Salesforce 提供）。

它在 SQA 和 WTQ 上均取得了最先進的效能，同時在 WikiSQL 上的效能與 SOTA 相當，但架構更簡單。

論文摘要如下：

在表格上回答自然語言問題通常被視為一項語義解析任務。為了減輕完整邏輯形式的收整合本，一種流行的方法側重於弱監督，即使用指代而非邏輯形式。然而，從弱監督中訓練語義解析器存在困難，此外，生成的邏輯形式僅用作檢索指代之前的中間步驟。在本文中，我們提出了 TAPAS，一種無需生成邏輯形式即可在表格上進行問答的方法。TAPAS 從弱監督中訓練，並透過選擇表格單元格並可選地對該選擇應用相應的聚合運算子來預測指代。TAPAS 擴充套件了 BERT 的架構以將表格編碼為輸入，從維基百科爬取的文字片段和表格的有效聯合預訓練進行初始化，並進行端到端訓練。我們對三個不同的語義解析資料集進行了實驗，發現 TAPAS 透過將 SQA 的最先進準確率從 55.1 提高到 67.2，並與 WIKISQL 和 WIKITQ 的最先進水平持平，從而超越或媲美語義解析模型，但模型架構更簡單。我們還發現，在我們的設定中，從 WIKISQL 到 WIKITQ 的遷移學習（這是微不足道的）產生了 48.7 的準確率，比最先進水平高出 4.2 個百分點。

此外，作者透過建立數百萬自動生成的訓練示例的平衡資料集，進一步預訓練了 TAPAS 以識別**表格蘊含**，這些示例在微調之前的中間步驟中學習。TAPAS 的作者將這種進一步的預訓練稱為中間預訓練（因為 TAPAS 首先在 MLM 上進行預訓練，然後在另一個數據集上進行預訓練）。他們發現中間預訓練進一步提高了 SQA 的效能，實現了新的最先進水平，並在 TabFact（一個包含 16k 維基百科表格的用於表格蘊含的大規模資料集，一個二元分類任務）上實現了最先進水平。欲瞭解更多詳情，請參閱他們的後續論文：Julian Martin Eisenschlos、Syrine Krichene 和 Thomas Müller 的使用中間預訓練理解表格。

TAPAS 架構。摘自原始部落格文章。

此模型由 nielsr 貢獻。此模型的 TensorFlow 版本由 kamalkraj 貢獻。原始程式碼可在此處找到。

使用技巧

TAPAS 預設使用相對位置嵌入（在表格的每個單元格處重新開始位置嵌入）。請注意，這是在 TAPAS 原始論文發表後新增的功能。據作者稱，這通常會帶來稍好的效能，並允許在不耗盡嵌入的情況下編碼更長的序列。這反映在 TapasConfig 的 reset_position_index_per_cell 引數中，該引數預設設定為 True。在 hub 上可用的模型預設版本都使用相對位置嵌入。您仍然可以透過在呼叫 from_pretrained() 方法時傳入額外的引數 revision="no_reset" 來使用具有絕對位置嵌入的模型。請注意，通常建議在右側而不是左側填充輸入。
TAPAS 基於 BERT，因此 TAPAS-base 例如對應於 BERT-base 架構。當然，TAPAS-large 將帶來最佳效能（論文中報告的結果來自 TAPAS-large）。各種大小模型的效能結果顯示在原始 GitHub 倉庫中。
TAPAS 具有在 SQA 上微調的檢查點，能夠在會話設定中回答與表格相關的問題。這意味著您可以提出後續問題，例如與前一個問題相關的“他多大了？”。請注意，在會話設定中，TAPAS 的前向傳播略有不同：在這種情況下，您必須將每個表格-問題對逐一輸入到模型中，以便 prev_labels 令牌型別 ID 可以被模型對前一個問題的預測 labels 覆蓋。有關更多資訊，請參閱“用法”部分。
TAPAS 與 BERT 類似，因此依賴於掩碼語言建模（MLM）目標。因此，它在預測掩碼令牌和一般 NLU 方面效率高，但不適合文字生成。採用因果語言建模（CLM）目標訓練的模型在這方面表現更好。請注意，TAPAS 可以用作 EncoderDecoderModel 框架中的編碼器，以將其與 GPT-2 等自迴歸文字解碼器結合使用。

用法：微調

在這裡，我們解釋瞭如何在你自己的資料集上微調 TapasForQuestionAnswering。

步驟 1：選擇使用 TAPAS 的 3 種方式之一 - 或進行實驗

基本上，有 3 種不同的方式可以微調 TapasForQuestionAnswering，對應於 Tapas 被微調的不同資料集。

SQA：如果你對在會話設定中提問與表格相關的後續問題感興趣。例如，如果你首先問“第一個演員的名字是什麼？”，然後你可以問一個後續問題，例如“他多大了？”。在這裡，問題不涉及任何聚合（所有問題都是單元格選擇問題）。
WTQ：如果你不感興趣在會話設定中提問，而只是提問與表格相關的問題，這些問題可能涉及聚合，例如計算行數、求和單元格值或平均單元格值。你就可以問“C羅職業生涯中進球總數是多少？”。這種情況也稱為**弱監督**，因為模型本身必須僅根據問題的答案學習適當的聚合運算子（SUM/COUNT/AVERAGE/NONE）。
WikiSQL-supervised：此資料集基於 WikiSQL，模型在訓練期間被賦予了真實聚合運算子。這也被稱為**強監督**。在這裡，學習適當的聚合運算子要容易得多。

總結一下：

任務	示例資料集	描述
對話式	SQA	對話式，僅限單元格選擇問題
聚合的弱監督	WTQ	問題可能涉及聚合，模型必須僅根據答案進行學習
聚合的強監督	WikiSQL-supervised	問題可能涉及聚合，模型必須根據黃金聚合運算子進行學習

Pytorch

隱藏 Pytorch 內容

使用預訓練的基座和從中心隨機初始化的分類頭初始化模型，可以按如下所示進行。

>>> from transformers import TapasConfig, TapasForQuestionAnswering

>>> # for example, the base sized model with default SQA configuration
>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base")

>>> # or, the base sized model with WTQ configuration
>>> config = TapasConfig.from_pretrained("google/tapas-base-finetuned-wtq")
>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

>>> # or, the base sized model with WikiSQL configuration
>>> config = TapasConfig("google-base-finetuned-wikisql-supervised")
>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

當然，你不必一定遵循 TAPAS 微調的三種方式之一。你也可以在初始化 TapasConfig 時，透過定義你想要的任何超引數來嘗試，然後根據該配置建立一個 TapasForQuestionAnswering。例如，如果你的資料集既有對話式問題，又有可能涉及聚合的問題，那麼你可以這樣做。下面是一個例子：

>>> from transformers import TapasConfig, TapasForQuestionAnswering

>>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
>>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
>>> # initializing the pre-trained base sized model with our custom classification heads
>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

TensorFlow

隱藏 TensorFlow 內容

使用預訓練的基礎模型和從中心隨機初始化的分類頭初始化模型，可以按照以下所示進行。請務必安裝 tensorflow_probability 依賴項。

>>> from transformers import TapasConfig, TFTapasForQuestionAnswering

>>> # for example, the base sized model with default SQA configuration
>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base")

>>> # or, the base sized model with WTQ configuration
>>> config = TapasConfig.from_pretrained("google/tapas-base-finetuned-wtq")
>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

>>> # or, the base sized model with WikiSQL configuration
>>> config = TapasConfig("google-base-finetuned-wikisql-supervised")
>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

當然，你不必一定遵循 TAPAS 微調的三種方式之一。你也可以透過在初始化 TapasConfig 時定義任何你想要的超引數來進行實驗，然後根據該配置建立一個 TFTapasForQuestionAnswering。例如，如果你的資料集既包含對話式問題，也包含可能涉及聚合的問題，那麼你可以這樣做。下面是一個示例：

>>> from transformers import TapasConfig, TFTapasForQuestionAnswering

>>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
>>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
>>> # initializing the pre-trained base sized model with our custom classification heads
>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

你也可以從一個已經微調過的檢查點開始。這裡需要注意的是，WTQ 上已經微調過的檢查點由於 L2 損失有些脆弱而存在一些問題。更多資訊請參見此處。

有關 HuggingFace 中心上所有預訓練和微調的 TAPAS 檢查點的列表，請參見此處。

第二步：以 SQA 格式準備資料

其次，無論您上面選擇了什麼，您都應該將資料集準備成 SQA 格式。該格式是一個 TSV/CSV 檔案，包含以下列：

id: 可選，表格-問題對的 id，用於記錄。
annotator: 可選，標註表格-問題對的人員 id，用於記錄。
position: 整數，指示問題是與表格相關的第幾個問題（第一、第二、第三……）。僅在會話設定（SQA）中需要。如果您選擇 WTQ/WikiSQL-supervised，則不需要此列。
question: 字串
table_file: 字串，包含表格資料的 csv 檔名
answer_coordinates: 一個或多個元組的列表（每個元組都是單元格座標，即屬於答案的行、列對）
answer_text: 一個或多個字串的列表（每個字串都是答案的一部分的單元格值）
aggregation_label: 聚合運算子的索引。僅在聚合強監督（WikiSQL-supervised 案例）中需要
float_answer: 問題的浮點答案，如果有的話（如果沒有則為 np.nan）。僅在聚合弱監督（如 WTQ 和 WikiSQL）中需要

表格本身應存在於一個資料夾中，每個表格都是一個單獨的 CSV 檔案。請注意，TAPAS 演算法的作者使用了一些自動化邏輯的轉換指令碼將其他資料集（WTQ、WikiSQL）轉換為 SQA 格式。作者在此處解釋了這一點。與 HuggingFace 實現相容的此指令碼的轉換版本可在此處找到。有趣的是，這些轉換指令碼並不完美（answer_coordinates 和 float_answer 欄位是根據 answer_text 填充的），這意味著 WTQ 和 WikiSQL 的結果實際上可以改進。

步驟 3：使用 TapasTokenizer 將資料轉換為張量

Pytorch

隱藏 Pytorch 內容

第三，鑑於您已經以 TSV/CSV 格式（以及包含表格資料的相應 CSV 檔案）準備了資料，您可以使用 TapasTokenizer 將表格-問題對轉換為 input_ids、attention_mask、token_type_ids 等。同樣，根據您上面選擇的三種情況中的哪一種，TapasForQuestionAnswering 需要不同的輸入才能進行微調：

任務	所需輸入
對話式	`input_ids`, `attention_mask`, `token_type_ids`, `labels`
聚合的弱監督	`input_ids`, `attention_mask`, `token_type_ids`, `labels`, `numeric_values`, `numeric_values_scale`, `float_answer`
聚合的強監督	`input ids`, `attention mask`, `token type ids`, `labels`, `aggregation_labels`

TapasTokenizer 根據 TSV 檔案的 answer_coordinates 和 answer_text 列建立 labels、numeric_values 和 numeric_values_scale。float_answer 和 aggregation_labels 已經存在於步驟 2 的 TSV 檔案中。這是一個示例：

>>> from transformers import TapasTokenizer
>>> import pandas as pd

>>> model_name = "google/tapas-base"
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)

>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
>>> queries = [
...     "What is the name of the first actor?",
...     "How many movies has George Clooney played in?",
...     "What is the total number of movies?",
... ]
>>> answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
>>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(
...     table=table,
...     queries=queries,
...     answer_coordinates=answer_coordinates,
...     answer_text=answer_text,
...     padding="max_length",
...     return_tensors="pt",
... )
>>> inputs
{'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}

請注意，TapasTokenizer 期望表格資料是**純文字**。您可以在資料幀上使用 .astype(str) 將其轉換為純文字資料。當然，這僅展示瞭如何編碼單個訓練示例。建議建立資料載入器以迭代批次。

>>> import torch
>>> import pandas as pd

>>> tsv_path = "your_path_to_the_tsv_file"
>>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"


>>> class TableDataset(torch.utils.data.Dataset):
...     def __init__(self, data, tokenizer):
...         self.data = data
...         self.tokenizer = tokenizer

...     def __getitem__(self, idx):
...         item = data.iloc[idx]
...         table = pd.read_csv(table_csv_path + item.table_file).astype(
...             str
...         )  # be sure to make your table data text only
...         encoding = self.tokenizer(
...             table=table,
...             queries=item.question,
...             answer_coordinates=item.answer_coordinates,
...             answer_text=item.answer_text,
...             truncation=True,
...             padding="max_length",
...             return_tensors="pt",
...         )
...         # remove the batch dimension which the tokenizer adds by default
...         encoding = {key: val.squeeze(0) for key, val in encoding.items()}
...         # add the float_answer which is also required (weak supervision for aggregation case)
...         encoding["float_answer"] = torch.tensor(item.float_answer)
...         return encoding

...     def __len__(self):
...         return len(self.data)


>>> data = pd.read_csv(tsv_path, sep="\t")
>>> train_dataset = TableDataset(data, tokenizer)
>>> train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=32)

TensorFlow

隱藏 TensorFlow 內容

第三，鑑於您已經以 TSV/CSV 格式（以及相應的包含表格資料的 CSV 檔案）準備了資料，您可以使用 TapasTokenizer 將表格-問題對轉換為 input_ids、attention_mask、token_type_ids 等。同樣，根據您上面選擇的三種情況中的哪一種，TFTapasForQuestionAnswering 需要不同的輸入才能進行微調：

任務	所需輸入
對話式	`input_ids`, `attention_mask`, `token_type_ids`, `labels`
聚合的弱監督	`input_ids`, `attention_mask`, `token_type_ids`, `labels`, `numeric_values`, `numeric_values_scale`, `float_answer`
聚合的強監督	`input ids`, `attention mask`, `token type ids`, `labels`, `aggregation_labels`

>>> from transformers import TapasTokenizer
>>> import pandas as pd

>>> model_name = "google/tapas-base"
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)

>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
>>> queries = [
...     "What is the name of the first actor?",
...     "How many movies has George Clooney played in?",
...     "What is the total number of movies?",
... ]
>>> answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
>>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(
...     table=table,
...     queries=queries,
...     answer_coordinates=answer_coordinates,
...     answer_text=answer_text,
...     padding="max_length",
...     return_tensors="tf",
... )
>>> inputs
{'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}

>>> import tensorflow as tf
>>> import pandas as pd

>>> tsv_path = "your_path_to_the_tsv_file"
>>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"


>>> class TableDataset:
...     def __init__(self, data, tokenizer):
...         self.data = data
...         self.tokenizer = tokenizer

...     def __iter__(self):
...         for idx in range(self.__len__()):
...             item = self.data.iloc[idx]
...             table = pd.read_csv(table_csv_path + item.table_file).astype(
...                 str
...             )  # be sure to make your table data text only
...             encoding = self.tokenizer(
...                 table=table,
...                 queries=item.question,
...                 answer_coordinates=item.answer_coordinates,
...                 answer_text=item.answer_text,
...                 truncation=True,
...                 padding="max_length",
...                 return_tensors="tf",
...             )
...             # remove the batch dimension which the tokenizer adds by default
...             encoding = {key: tf.squeeze(val, 0) for key, val in encoding.items()}
...             # add the float_answer which is also required (weak supervision for aggregation case)
...             encoding["float_answer"] = tf.convert_to_tensor(item.float_answer, dtype=tf.float32)
...             yield encoding["input_ids"], encoding["attention_mask"], encoding["numeric_values"], encoding[
...                 "numeric_values_scale"
...             ], encoding["token_type_ids"], encoding["labels"], encoding["float_answer"]

...     def __len__(self):
...         return len(self.data)


>>> data = pd.read_csv(tsv_path, sep="\t")
>>> train_dataset = TableDataset(data, tokenizer)
>>> output_signature = (
...     tf.TensorSpec(shape=(512,), dtype=tf.int32),
...     tf.TensorSpec(shape=(512,), dtype=tf.int32),
...     tf.TensorSpec(shape=(512,), dtype=tf.float32),
...     tf.TensorSpec(shape=(512,), dtype=tf.float32),
...     tf.TensorSpec(shape=(512, 7), dtype=tf.int32),
...     tf.TensorSpec(shape=(512,), dtype=tf.int32),
...     tf.TensorSpec(shape=(512,), dtype=tf.float32),
... )
>>> train_dataloader = tf.data.Dataset.from_generator(train_dataset, output_signature=output_signature).batch(32)

請注意，此處我們獨立編碼每個表格-問題對。只要您的資料集**不是對話式**的，這就可以了。如果您的資料集涉及對話式問題（例如 SQA 中），則應首先按表格（按其 position 索引的順序）將 queries、answer_coordinates 和 answer_text 分組在一起，並批次編碼每個表格及其問題。這將確保 prev_labels 令牌型別（請參閱 TapasTokenizer 的文件）設定正確。有關更多資訊，請參閱此筆記本。有關使用 TensorFlow 模型的更多資訊，請參閱此筆記本。

**第四步：訓練（微調）模型

Pytorch

隱藏 Pytorch 內容

然後，您可以按照以下方式微調 TapasForQuestionAnswering（此處以聚合弱監督為例）：

>>> from transformers import TapasConfig, TapasForQuestionAnswering, AdamW

>>> # this is the default WTQ configuration
>>> config = TapasConfig(
...     num_aggregation_labels=4,
...     use_answer_as_supervision=True,
...     answer_loss_cutoff=0.664694,
...     cell_selection_preference=0.207951,
...     huber_loss_delta=0.121194,
...     init_cell_selection_weights_to_zero=True,
...     select_one_column=True,
...     allow_empty_column_selection=False,
...     temperature=0.0352513,
... )
>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

>>> optimizer = AdamW(model.parameters(), lr=5e-5)

>>> model.train()
>>> for epoch in range(2):  # loop over the dataset multiple times
...     for batch in train_dataloader:
...         # get the inputs;
...         input_ids = batch["input_ids"]
...         attention_mask = batch["attention_mask"]
...         token_type_ids = batch["token_type_ids"]
...         labels = batch["labels"]
...         numeric_values = batch["numeric_values"]
...         numeric_values_scale = batch["numeric_values_scale"]
...         float_answer = batch["float_answer"]

...         # zero the parameter gradients
...         optimizer.zero_grad()

...         # forward + backward + optimize
...         outputs = model(
...             input_ids=input_ids,
...             attention_mask=attention_mask,
...             token_type_ids=token_type_ids,
...             labels=labels,
...             numeric_values=numeric_values,
...             numeric_values_scale=numeric_values_scale,
...             float_answer=float_answer,
...         )
...         loss = outputs.loss
...         loss.backward()
...         optimizer.step()

TensorFlow

隱藏 TensorFlow 內容

然後，您可以按照以下方式微調 TFTapasForQuestionAnswering（此處以聚合弱監督為例）：

>>> import tensorflow as tf
>>> from transformers import TapasConfig, TFTapasForQuestionAnswering

>>> # this is the default WTQ configuration
>>> config = TapasConfig(
...     num_aggregation_labels=4,
...     use_answer_as_supervision=True,
...     answer_loss_cutoff=0.664694,
...     cell_selection_preference=0.207951,
...     huber_loss_delta=0.121194,
...     init_cell_selection_weights_to_zero=True,
...     select_one_column=True,
...     allow_empty_column_selection=False,
...     temperature=0.0352513,
... )
>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

>>> optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)

>>> for epoch in range(2):  # loop over the dataset multiple times
...     for batch in train_dataloader:
...         # get the inputs;
...         input_ids = batch[0]
...         attention_mask = batch[1]
...         token_type_ids = batch[4]
...         labels = batch[-1]
...         numeric_values = batch[2]
...         numeric_values_scale = batch[3]
...         float_answer = batch[6]

...         # forward + backward + optimize
...         with tf.GradientTape() as tape:
...             outputs = model(
...                 input_ids=input_ids,
...                 attention_mask=attention_mask,
...                 token_type_ids=token_type_ids,
...                 labels=labels,
...                 numeric_values=numeric_values,
...                 numeric_values_scale=numeric_values_scale,
...                 float_answer=float_answer,
...             )
...         grads = tape.gradient(outputs.loss, model.trainable_weights)
...         optimizer.apply_gradients(zip(grads, model.trainable_weights))

用法：推理

Pytorch

隱藏 Pytorch 內容

在這裡，我們解釋如何使用 TapasForQuestionAnswering 或 TFTapasForQuestionAnswering 進行推理（即對新資料進行預測）。對於推理，只需向模型提供 input_ids、attention_mask 和 token_type_ids（您可以使用 TapasTokenizer 獲取這些資訊）即可獲得 logits。接下來，您可以使用方便的 ~models.tapas.tokenization_tapas.convert_logits_to_predictions 方法將這些 logits 轉換為預測座標和可選的聚合索引。

然而，請注意，推理**不同**，這取決於設定是否是對話式的。在非對話式設定中，推理可以並行處理批處理中的所有表格-問題對。這是一個示例：

>>> from transformers import TapasTokenizer, TapasForQuestionAnswering
>>> import pandas as pd

>>> model_name = "google/tapas-base-finetuned-wtq"
>>> model = TapasForQuestionAnswering.from_pretrained(model_name)
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)

>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
>>> queries = [
...     "What is the name of the first actor?",
...     "How many movies has George Clooney played in?",
...     "What is the total number of movies?",
... ]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="pt")
>>> outputs = model(**inputs)
>>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
...     inputs, outputs.logits.detach(), outputs.logits_aggregation.detach()
... )

>>> # let's print out the results:
>>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3: "COUNT"}
>>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]

>>> answers = []
>>> for coordinates in predicted_answer_coordinates:
...     if len(coordinates) == 1:
...         # only a single cell:
...         answers.append(table.iat[coordinates[0]])
...     else:
...         # multiple cells
...         cell_values = []
...         for coordinate in coordinates:
...             cell_values.append(table.iat[coordinate])
...         answers.append(", ".join(cell_values))

>>> display(table)
>>> print("")
>>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
...     print(query)
...     if predicted_agg == "NONE":
...         print("Predicted answer: " + answer)
...     else:
...         print("Predicted answer: " + predicted_agg + " > " + answer)
What is the name of the first actor?
Predicted answer: Brad Pitt
How many movies has George Clooney played in?
Predicted answer: COUNT > 69
What is the total number of movies?
Predicted answer: SUM > 87, 53, 69

TensorFlow

隱藏 TensorFlow 內容

在這裡，我們解釋瞭如何使用 TFTapasForQuestionAnswering 進行推理（即對新資料進行預測）。對於推理，只需向模型提供 input_ids、attention_mask 和 token_type_ids（您可以使用 TapasTokenizer 獲取這些資訊）即可獲得 logits。接下來，您可以使用方便的 ~models.tapas.tokenization_tapas.convert_logits_to_predictions 方法將這些 logits 轉換為預測座標和可選的聚合索引。

然而，請注意，推理**不同**，這取決於設定是否是對話式的。在非對話式設定中，推理可以並行處理批處理中的所有表格-問題對。這是一個示例：

>>> from transformers import TapasTokenizer, TFTapasForQuestionAnswering
>>> import pandas as pd

>>> model_name = "google/tapas-base-finetuned-wtq"
>>> model = TFTapasForQuestionAnswering.from_pretrained(model_name)
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)

>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
>>> queries = [
...     "What is the name of the first actor?",
...     "How many movies has George Clooney played in?",
...     "What is the total number of movies?",
... ]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="tf")
>>> outputs = model(**inputs)
>>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
...     inputs, outputs.logits, outputs.logits_aggregation
... )

>>> # let's print out the results:
>>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3: "COUNT"}
>>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]

>>> answers = []
>>> for coordinates in predicted_answer_coordinates:
...     if len(coordinates) == 1:
...         # only a single cell:
...         answers.append(table.iat[coordinates[0]])
...     else:
...         # multiple cells
...         cell_values = []
...         for coordinate in coordinates:
...             cell_values.append(table.iat[coordinate])
...         answers.append(", ".join(cell_values))

>>> display(table)
>>> print("")
>>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
...     print(query)
...     if predicted_agg == "NONE":
...         print("Predicted answer: " + answer)
...     else:
...         print("Predicted answer: " + predicted_agg + " > " + answer)
What is the name of the first actor?
Predicted answer: Brad Pitt
How many movies has George Clooney played in?
Predicted answer: COUNT > 69
What is the total number of movies?
Predicted answer: SUM > 87, 53, 69

如果是對話式設定，那麼每個表格-問題對必須**按順序**提供給模型，以便 prev_labels 令牌型別可以被前一個表格-問題對的預測 labels 覆蓋。同樣，更多資訊可以在此筆記本（適用於 PyTorch）和此筆記本（適用於 TensorFlow）中找到。

資源

TAPAS 特定輸出

class transformers.models.tapas.modeling_tapas.TableQuestionAnsweringOutput

< 來源 >

( 損失: typing.Optional[torch.FloatTensor] = None 對數: typing.Optional[torch.FloatTensor] = None 對數聚合: typing.Optional[torch.FloatTensor] = None 隱藏狀態: typing.Optional[tuple[torch.FloatTensor]] = None 注意力: typing.Optional[tuple[torch.FloatTensor]] = None )

引數

損失 (torch.FloatTensor，形狀為 (1,)，可選，當提供 labels (可能還有 answer, aggregation_labels, numeric_values 和 numeric_values_scale) 時返回) — 總損失，作為分層單元格選擇對數似然損失和（可選）半監督迴歸損失以及（可選）聚合監督損失的總和。
對數 (torch.FloatTensor，形狀為 (batch_size, sequence_length)) — 每個 token 的單元格選擇頭的預測分數。
聚合對數 (torch.FloatTensor, 可選, 形狀為 (batch_size, num_aggregation_labels)) — 聚合頭對每個聚合運算子的預測分數。
隱藏狀態 (tuple[torch.FloatTensor]，可選，當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — torch.FloatTensor 的元組（一個用於嵌入層的輸出，如果模型有嵌入層，+ 一個用於每層的輸出），形狀為 (batch_size, sequence_length, hidden_size)。

模型在每層輸出處的隱藏狀態以及可選的初始嵌入輸出。
注意力 (tuple[torch.FloatTensor]，可選，當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — torch.FloatTensor 的元組（每層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。

在注意力 softmax 之後，用於計算自注意力頭中的加權平均值的注意力權重。

TapasForQuestionAnswering 的輸出型別。

TapasConfig

class transformers.TapasConfig

< 來源 >

( 詞彙量 = 30522 隱藏大小 = 768 隱藏層數量 = 12 注意力頭數量 = 12 中間大小 = 3072 隱藏啟用 = 'gelu' 隱藏dropout機率 = 0.1 注意力probs dropout機率 = 0.1 最大位置嵌入 = 1024 型別詞彙量大小 = [3, 256, 256, 2, 256, 256, 10] 初始化範圍 = 0.02 層歸一化eps = 1e-12 填充token id = 0 正標籤權重 = 10.0 聚合標籤數量 = 0 聚合損失權重 = 1.0 使用答案作為監督 = None 答案損失重要性 = 1.0 使用歸一化答案損失 = False huber損失delta = None 溫度 = 1.0 聚合溫度 = 1.0 單元格使用gumbel = False 聚合使用gumbel = False 平均近似函式 = 'ratio' 單元格選擇偏好 = None 答案損失截止 = None 最大行數 = 64 最大列數 = 32 每個單元格平均對數 = False 選擇一列 = True 允許空列選擇 = False 初始化單元格選擇權重為零 = False 每單元格重置位置索引 = True 停用每token損失 = False 聚合標籤 = None 無聚合標籤索引 = None **kwargs )

引數

詞彙表大小 (int，可選，預設為 30522) — TAPAS 模型的詞彙表大小。定義了呼叫 TapasModel 時傳入的 inputs_ids 可以表示的不同 token 的數量。
隱藏層大小 (int，可選，預設為 768) — 編碼器層和池化層的大小。
隱藏層數量 (int，可選，預設為 12) — Transformer 編碼器中的隱藏層數量。
注意力頭數量 (int，可選，預設為 12) — Transformer 編碼器中每個注意力層的注意力頭數量。
中間大小 (int，可選，預設為 3072) — Transformer 編碼器中“中間”（通常稱為前饋）層的大小。
hidden_act (str 或 Callable, 可選, 預設為 "gelu") — 編碼器和池化器中的非線性啟用函式（函式或字串）。如果為字串，支援 "gelu"、"relu"、"swish" 和 "gelu_new"。
hidden_dropout_prob (float, 可選, 預設為 0.1) — 嵌入、編碼器和池化器中所有全連線層的 dropout 機率。
attention_probs_dropout_prob (float, 可選, 預設為 0.1) — 注意力機率的 dropout 比率。
max_position_embeddings (int, 可選, 預設為 1024) — 此模型可能使用的最大序列長度。通常設定為較大值以防萬一（例如 512 或 1024 或 2048）。
type_vocab_sizes (list[int], 可選, 預設為 [3, 256, 256, 2, 256, 256, 10]) — 呼叫 TapasModel 時傳入的 token_type_ids 的詞彙表大小。
initializer_range (float, 可選, 預設為 0.02) — 用於初始化所有權重矩陣的 truncated_normal_initializer 的標準差。
layer_norm_eps (float, 可選, 預設為 1e-12) — 層歸一化層使用的 epsilon 值。
positive_label_weight (float, 可選, 預設為 10.0) — 正面標籤的權重。
num_aggregation_labels (int, 可選, 預設為 0) — 要預測的聚合運算子的數量。
aggregation_loss_weight (float, 可選, 預設為 1.0) — 聚合損失的重要性權重。
use_answer_as_supervision (bool, 可選) — 是否將答案作為聚合示例的唯一監督。
answer_loss_importance (float, 可選, 預設為 1.0) — 迴歸損失的重要性權重。
use_normalized_answer_loss (bool, 可選, 預設為 False) — 是否透過預測值和期望值的最大值來歸一化答案損失。
huber_loss_delta (float, 可選) — 用於計算迴歸損失的 Delta 引數。
temperature (float, 可選, 預設為 1.0) — 用於控制（或改變）單元格邏輯機率偏斜的值。
aggregation_temperature (float, 可選, 預設為 1.0) — 縮放聚合邏輯以控制機率的偏斜。
use_gumbel_for_cells (bool, 可選, 預設為 False) — 是否將 Gumbel-Softmax 應用於單元格選擇。
use_gumbel_for_aggregation (bool, 可選, 預設為 False) — 是否將 Gumbel-Softmax 應用於聚合選擇。
average_approximation_function (string, 可選, 預設為 "ratio") — 在弱監督情況下計算單元格預期平均值的方法。可以是 "ratio"、"first_order" 或 "second_order" 之一。
cell_selection_preference (float, 可選) — 模糊情況下的單元格選擇偏好。僅適用於聚合弱監督（WTQ、WikiSQL）。如果聚合機率（不包括“NONE”運算子）的總質量高於此超引數，則會為示例預測聚合。
answer_loss_cutoff (float, 可選) — 忽略答案損失大於截止值的示例。
max_num_rows (int, 可選, 預設為 64) — 最大行數。
max_num_columns (int, 可選, 預設為 32) — 最大列數。
average_logits_per_cell (bool, 可選, 預設為 False) — 是否對每個單元格的邏輯值取平均。
select_one_column (bool, 可選, 預設為 True) — 是否限制模型只從單列中選擇單元格。
allow_empty_column_selection (bool, 可選, 預設為 False) — 是否允許不選擇任何列。
init_cell_selection_weights_to_zero (bool, 可選, 預設為 False) — 是否將單元格選擇權重初始化為 0，以便初始機率為 50%。
reset_position_index_per_cell (bool, 可選, 預設為 True) — 是否在每個單元格重新開始位置索引（即使用相對位置嵌入）。
disable_per_token_loss (bool, 可選, 預設為 False) — 是否停用單元格上的任何（強或弱）監督。
aggregation_labels (dict[int, label], 可選) — 用於聚合結果的聚合標籤。例如，WTQ 模型具有以下聚合標籤： {0: "NONE", 1: "SUM", 2: "AVERAGE", 3: "COUNT"}
no_aggregation_label_index (int, 可選) — 如果聚合標籤已定義且其中一個標籤表示“無聚合”，則應將其設定為其索引。例如，WTQ 模型將“NONE”聚合標籤設定為索引 0，因此對於這些模型，該值應設定為 0。

這是用於儲存 TapasModel 配置的配置類。它用於根據指定引數例項化 TAPAS 模型，定義模型架構。使用預設值例項化配置將產生與 TAPAS google/tapas-base-finetuned-sqa 架構類似的配置。

配置物件繼承自 PreTrainedConfig，可用於控制模型輸出。有關這些方法的更多資訊，請參閱 PretrainedConfig 的文件。

BERT 之外的超引數取自原始實現的 run_task_main.py 和 hparam_utils.py。原始實現可在 https://github.com/google-research/tapas/tree/master 找到。

示例

>>> from transformers import TapasModel, TapasConfig

>>> # Initializing a default (SQA) Tapas configuration
>>> configuration = TapasConfig()
>>> # Initializing a model from the configuration
>>> model = TapasModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config

Transformers

TAPAS

概述

使用技巧

用法：微調

用法：推理

資源

TAPAS 特定輸出

class transformers.models.tapas.modeling_tapas.TableQuestionAnsweringOutput

TapasConfig

class transformers.TapasConfig

TapasTokenizer

class transformers.TapasTokenizer

__call__

convert_logits_to_predictions

save_vocabulary

TapasModel

class transformers.TapasModel

forward

TapasForMaskedLM

class transformers.TapasForMaskedLM

forward

TapasForSequenceClassification

class transformers.TapasForSequenceClassification

forward

TapasForQuestionAnswering

class transformers.TapasForQuestionAnswering

forward

TFTapasModel

class transformers.TFTapasModel

呼叫

TFTapasForMaskedLM

class transformers.TFTapasForMaskedLM

呼叫

TFTapasForSequenceClassification

class transformers.TFTapasForSequenceClassification

呼叫

TFTapasForQuestionAnswering

class transformers.TFTapasForQuestionAnswering

呼叫

call