與 Pandas 配合使用

本文件簡要介紹瞭如何將 `datasets` 與 Pandas 結合使用，特別側重於如何使用 Pandas 函式處理資料集，以及如何將資料集與 Pandas 互相轉換。

這特別有用，因為它允許快速操作，因為 `datasets` 底層使用 PyArrow，而 PyArrow 與 Pandas 結合得很好。

資料集格式

預設情況下，資料集返回常規的Python物件：整數、浮點數、字串、列表等。

要獲取 Pandas DataFrame 或 Series，您可以使用 Dataset.with_format() 將資料集的格式設定為 `pandas`。

>>> from datasets import Dataset
>>> data = {"col_0": ["a", "b", "c", "d"], "col_1": [0., 0., 1., 1.]}
>>> ds = Dataset.from_dict(data)
>>> ds = ds.with_format("pandas")
>>> ds[0]       # pd.DataFrame
  col_0  col_1
0     a    0.0
>>> ds[:2]      # pd.DataFrame
  col_0  col_1
0     a    0.0
1     b    0.0
>>> ds["data"]  # pd.Series
0    a
1    b
2    c
3    d
Name: col_0, dtype: object

這也適用於例如使用 `load_dataset(..., streaming=True)` 獲取的 `IterableDataset` 物件。

>>> ds = ds.with_format("pandas")
>>> for df in ds.iter(batch_size=2):
...     print(df)
...     break
  col_0  col_1
0     a    0.0
1     b    0.0

處理資料

Pandas 函式通常比普通的手寫 Python 函式更快，因此它們是最佳化資料處理的好選擇。您可以在 Dataset.map() 或 Dataset.filter() 中使用 Pandas 函式來處理資料集。

>>> from datasets import Dataset
>>> data = {"col_0": ["a", "b", "c", "d"], "col_1": [0., 0., 1., 1.]}
>>> ds = Dataset.from_dict(data)
>>> ds = ds.with_format("pandas")
>>> ds = ds.map(lambda df: df.assign(col_2=df.col_1 + 1), batched=True)
>>> ds[:2]
  col_0  col_1  col_2
0     a    0.0    1.0
1     b    0.0    1.0
>>> ds = ds.filter(lambda df: df.col_0 == "b", batched=True)
>>> ds[0]
  col_0  col_1  col_2
0     b    0.0    1.0

我們使用 `batched=True`，因為在 Pandas 中批次處理資料比逐行處理更快。也可以在 `map()` 中使用 `batch_size=` 來設定每個 `df` 的大小。

這也適用於 IterableDataset.map() 和 IterableDataset.filter()。

從 Pandas 匯入或匯出

要從 Pandas 匯入資料，您可以使用 Dataset.from_pandas()

ds = Dataset.from_pandas(df)

您可以使用 Dataset.to_pandas() 將資料集匯出到 Pandas DataFrame

df = Dataset.to_pandas()

< > 在 GitHub 上更新