與 Polars 結合使用

本文件簡要介紹瞭如何將 datasets 與 Polars 結合使用，特別關注如何使用 Polars 函式處理資料集，以及如何將資料集與 Polars 相互轉換。

這一點特別有用，因為它允許快速的零複製操作，因為 datasets 和 Polars 的底層都使用了 Arrow。

資料集格式

預設情況下，資料集返回常規的Python物件：整數、浮點數、字串、列表等。

要獲取 Polars DataFrames 或 Series，你可以使用 Dataset.with_format() 將資料集的格式設定為 polars。

>>> from datasets import Dataset
>>> data = {"col_0": ["a", "b", "c", "d"], "col_1": [0., 0., 1., 1.]}
>>> ds = Dataset.from_dict(data)
>>> ds = ds.with_format("polars")
>>> ds[0]       # pl.DataFrame
shape: (1, 2)
┌───────┬───────┐
│ col_0 ┆ col_1 │
│ ---   ┆ ---   │
│ str   ┆ f64   │
╞═══════╪═══════╡
│ a     ┆ 0.0   │
└───────┴───────┘
>>> ds[:2]      # pl.DataFrame
shape: (2, 2)
┌───────┬───────┐
│ col_0 ┆ col_1 │
│ ---   ┆ ---   │
│ str   ┆ f64   │
╞═══════╪═══════╡
│ a     ┆ 0.0   │
│ b     ┆ 0.0   │
└───────┴───────┘
>>> ds["data"]  # pl.Series
shape: (4,)
Series: 'col_0' [str]
[
        "a"
        "b"
        "c"
        "d"
]

這也適用於例如使用 `load_dataset(..., streaming=True)` 獲取的 `IterableDataset` 物件。

>>> ds = ds.with_format("polars")
>>> for df in ds.iter(batch_size=2):
...     print(df)
...     break
shape: (2, 2)
┌───────┬───────┐
│ col_0 ┆ col_1 │
│ ---   ┆ ---   │
│ str   ┆ f64   │
╞═══════╪═══════╡
│ a     ┆ 0.0   │
│ b     ┆ 0.0   │
└───────┴───────┘

處理資料

Polars 函式通常比常規手寫的 Python 函式更快，因此它們是最佳化資料處理的好選擇。你可以在 Dataset.map() 或 Dataset.filter() 中使用 Polars 函式處理資料集。

>>> import polars as pl
>>> from datasets import Dataset
>>> data = {"col_0": ["a", "b", "c", "d"], "col_1": [0., 0., 1., 1.]}
>>> ds = Dataset.from_dict(data)
>>> ds = ds.with_format("polars")
>>> ds = ds.map(lambda df: df.with_columns(pl.col("col_1").add(1).alias("col_2")), batched=True)
>>> ds[:2]
shape: (2, 3)
┌───────┬───────┬───────┐
│ col_0 ┆ col_1 ┆ col_2 │
│ ---   ┆ ---   ┆ ---   │
│ str   ┆ f64   ┆ f64   │
╞═══════╪═══════╪═══════╡
│ a     ┆ 0.0   ┆ 1.0   │
│ b     ┆ 0.0   ┆ 1.0   │
└───────┴───────┴───────┘
>>> ds = ds.filter(lambda df: df["col_0"] == "b", batched=True)
>>> ds[0]
shape: (1, 3)
┌───────┬───────┬───────┐
│ col_0 ┆ col_1 ┆ col_2 │
│ ---   ┆ ---   ┆ ---   │
│ str   ┆ f64   ┆ f64   │
╞═══════╪═══════╪═══════╡
│ b     ┆ 0.0   ┆ 1.0   │
└───────┴───────┴───────┘

我們使用 batched=True，因為在 Polars 中批次處理資料比逐行處理更快。也可以在 map() 中使用 batch_size= 來設定每個 df 的大小。

這也適用於 IterableDataset.map() 和 IterableDataset.filter()。

示例：資料提取

Polars 中有許多適用於任何資料型別的函式：字串、浮點數、整數等。你可以在這裡找到完整列表。這些函式是用 Rust 編寫的，並在資料批次上執行，從而實現了快速的資料處理。

這裡有一個例子，展示了使用 Polars 從一個大語言模型（LLM）推理資料集中提取解決方案時，速度比常規 Python 函式快了 5 倍。

from datasets import load_dataset

ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train")

# Using a regular python function
pattern = re.compile("boxed\\{(.*)\\}")
result_ds = ds.map(lambda x: {"value_solution": m.group(1) if (m:=pattern.search(x["solution"])) else None})
# Time: 10s

# Using a Polars function
expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution")
result_ds = ds.with_format("polars").map(lambda df: df.with_columns(expr), batched=True)
# Time: 2s

從 Polars 匯入或匯出

要從 Polars 匯入資料，你可以使用 Dataset.from_polars()。

ds = Dataset.from_polars(df)

你也可以使用 Dataset.to_polars() 將 Dataset 匯出為 Polars DataFrame。

df = Dataset.to_polars(ds)

< > 在 GitHub 上更新