與 PyArrow 配合使用

本文件是使用 `datasets` 與 PyArrow 配合使用的快速入門，特別關注如何使用 Arrow 計算函式處理資料集，以及如何將資料集轉換為 PyArrow 或從 PyArrow 轉換。

這特別有用，因為它允許快速零複製操作，因為 `datasets` 在底層使用 PyArrow。

資料集格式

預設情況下，資料集返回常規的Python物件：整數、浮點數、字串、列表等。

要獲取 PyArrow 表或陣列，可以使用 Dataset.with_format() 將資料集的格式設定為 `pyarrow`。

>>> from datasets import Dataset
>>> data = {"col_0": ["a", "b", "c", "d"], "col_1": [0., 0., 1., 1.]}
>>> ds = Dataset.from_dict(data)
>>> ds = ds.with_format("arrow")
>>> ds[0]       # pa.Table
pyarrow.Table
col_0: string
col_1: double
----
col_0: [["a"]]
col_1: [[0]]
>>> ds[:2]      # pa.Table
pyarrow.Table
col_0: string
col_1: double
----
col_0: [["a","b"]]
col_1: [[0,0]]
>>> ds["data"]  # pa.array
<pyarrow.lib.ChunkedArray object at 0x1394312a0>
[
  [
    "a",
    "b",
    "c",
    "d"
  ]
]

這也適用於例如使用 `load_dataset(..., streaming=True)` 獲取的 `IterableDataset` 物件。

>>> ds = ds.with_format("arrow")
>>> for table in ds.iter(batch_size=2):
...     print(table)
...     break
pyarrow.Table
col_0: string
col_1: double
----
col_0: [["a","b"]]
col_1: [[0,0]]

處理資料

PyArrow 函式通常比手寫的常規 Python 函式更快，因此它們是最佳化資料處理的好選擇。你可以在 Dataset.map() 或 Dataset.filter() 中使用 Arrow 計算函式來處理資料集。

>>> import pyarrow.compute as pc
>>> from datasets import Dataset
>>> data = {"col_0": ["a", "b", "c", "d"], "col_1": [0., 0., 1., 1.]}
>>> ds = Dataset.from_dict(data)
>>> ds = ds.with_format("arrow")
>>> ds = ds.map(lambda t: t.append_column("col_2", pc.add(t["col_1"], 1)), batched=True)
>>> ds[:2]
pyarrow.Table
col_0: string
col_1: double
col_2: double
----
col_0: [["a","b"]]
col_1: [[0,0]]
col_2: [[1,1]]
>>> ds = ds.filter(lambda t: pc.equal(t["col_0"], "b"), batched=True)
>>> ds[0]
pyarrow.Table
col_0: string
col_1: double
col_2: double
----
col_0: [["b"]]
col_1: [[0]]
col_2: [[1]]

我們使用 `batched=True` 是因為在 PyArrow 中處理批次資料比逐行處理更快。在 `map()` 中也可以使用 `batch_size=` 來設定每個 `table` 的大小。

這也適用於 IterableDataset.map() 和 IterableDataset.filter()。

從 PyArrow 匯入或匯出

一個 Dataset 是 PyArrow Table 的包裝器，你可以直接從 Table 例項化一個 Dataset。

ds = Dataset(table)

你可以使用 Dataset.data 訪問資料集的 PyArrow Table，它返回一個 `MemoryMappedTable`、`InMemoryTable` 或 `ConcatenationTable`，具體取決於 Arrow 資料的來源和應用的操作。

這些物件封裝了底層 PyArrow 表，可透過 `Dataset.data.table` 訪問。此表包含資料集的所有資料，但也可能在 `Dataset._indices` 處有一個索引對映，它將資料集行索引對映到 PyArrow 錶行索引。如果資料集已透過 Dataset.shuffle() 打亂，或者僅使用了行的子集（例如在 Dataset.select() 之後），則可能會發生這種情況。

在一般情況下，可以使用 `table = ds.with_format("arrow")[:]` 將資料集匯出到 PyArrow Table。

< > 在 GitHub 上更新