PyArrow

Arrow 是一種面向列的資料格式和工具箱，用於快速資料交換和記憶體分析。由於 PyArrow 支援 fsspec 來讀寫遠端資料，你可以使用 Hugging Face 路徑 (hf://) 在 Hub 上讀寫資料。它對於 Parquet 資料特別有用，因為 Parquet 是 Hugging Face 上最常見的檔案格式。事實上，Parquet 因其結構、型別、元資料和壓縮而特別高效。

載入表格

你可以從本地檔案或遠端儲存（如 Hugging Face 資料集）載入資料。PyArrow 支援多種格式，包括 CSV、JSON，更重要的是 Parquet

>>> import pyarrow.parquet as pq
>>> table = pq.read_table("path/to/data.parquet")

要從 Hugging Face 載入檔案，路徑需要以 hf:// 開頭。例如，stanfordnlp/imdb 資料集倉庫的路徑是 hf://datasets/stanfordnlp/imdb。Hugging Face 上的資料集包含多個 Parquet 檔案。Parquet 檔案格式旨在高效讀寫資料幀，並使資料在資料分析語言之間輕鬆共享。以下是如何將檔案 plain_text/train-00000-of-00001.parquet 載入為 pyarrow 表格（需要 pyarrow>=21.0）：

>>> import pyarrow.parquet as pq
>>> table = pq.read_table("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet")
>>> table
pyarrow.Table
text: string
label: int64
----
text: [["I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it (... 1542 chars omitted)", ...],...,[..., "The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritan (... 221 chars omitted)"]]
label: [[0,0,0,0,0,...,0,0,0,0,0],...,[1,1,1,1,1,...,1,1,1,1,1]]

如果你不想載入完整的 Parquet 資料，可以獲取 Parquet 元資料或按行組載入。

>>> import pyarrow.parquet as pq
>>> pf = pq.ParquetFile("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet")
>>> pf.metadata
<pyarrow._parquet.FileMetaData object at 0x1171b4090>
  created_by: parquet-cpp-arrow version 12.0.0
  num_columns: 2
  num_rows: 25000
  num_row_groups: 25
  format_version: 2.6
  serialized_size: 62036
>>> for i in pf.num_row_groups:
...     table = pf.read_row_group(i)
...     ...

有關 Hugging Face 路徑及其實現方式的更多資訊，請參閱客戶端庫中關於 HfFileSystem 的文件。

儲存表格

你可以使用 pyarrow.parquet.write_table 將 PyArrow Table 儲存到本地檔案或直接儲存到 Hugging Face。

要將表格儲存到 Hugging Face，您首先需要使用您的 Hugging Face 帳戶登入，例如使用

hf auth login

然後，您可以建立一個數據集倉庫，例如使用

from huggingface_hub import HfApi

HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")

最後，您可以在 PyArrow 中使用Hugging Face 路徑

import pyarrow.parquet as pq

pq.write_table(table, "hf://datasets/username/my_dataset/imdb.parquet", use_content_defined_chunking=True)

# or write in separate files if the dataset has train/validation/test splits
pq.write_table(table_train, "hf://datasets/username/my_dataset/train.parquet", use_content_defined_chunking=True)
pq.write_table(table_valid, "hf://datasets/username/my_dataset/validation.parquet", use_content_defined_chunking=True)
pq.write_table(table_test , "hf://datasets/username/my_dataset/test.parquet", use_content_defined_chunking=True)

我們使用 use_content_defined_chunking=True 來啟用更快的 Hugging Face 上傳和下載，這得益於 Xet 重複資料刪除（需要 pyarrow>=21.0）。

內容定義分塊（CDC）使 Parquet 寫入器以一種使重複資料以相同方式分塊和壓縮的方式對資料頁進行分塊。如果沒有 CDC，頁面會被任意分塊，因此由於壓縮而無法檢測到重複資料。多虧了 CDC，從 Hugging Face 上傳和下載 Parquet 檔案更快，因為重複資料只上傳或下載一次。

有關 Xet 的更多資訊請參見此處。

使用影像

您可以載入一個包含元資料檔案的資料夾，其中包含影像名稱或路徑欄位，結構如下：

Example 1:            Example 2:
folder/               folder/
├── metadata.parquet  ├── metadata.parquet
├── img000.png        └── images
├── img001.png            ├── img000.png
...                       ...
└── imgNNN.png            └── imgNNN.png

您可以像這樣迭代影像路徑：

from pathlib import Path
import pyarrow as pq

folder_path = Path("path/to/folder")
table = pq.read_table(folder_path + "metadata.parquet")
for file_name in table["file_name"].to_pylist():
    image_path = folder_path / file_name
    ...

由於資料集採用支援的結構（一個包含 file_name 欄位的 metadata.parquet 檔案），您可以將此資料集儲存到 Hugging Face，並且資料集檢視器會同時顯示元資料和影像。

from huggingface_hub import HfApi
api = HfApi()

api.upload_folder(
    folder_path=folder_path,
    repo_id="username/my_image_dataset",
    repo_type="dataset",
)

將影像嵌入 Parquet 檔案中

PyArrow 有一個二進位制型別，允許在 Arrow 表格中包含影像位元組。因此，它能夠將資料集儲存為一個包含影像（位元組和路徑）和樣本元資料的單個 Parquet 檔案

import pyarrow as pa
import pyarrow.parquet as pq

# Embed the image bytes in Arrow
image_array = pa.array([
    {
        "bytes": (folder_path / file_name).read_bytes(),
        "path": file_name,
    }
    for file_name in table["file_name"].to_pylist()
])
table.append_column("image", image_array)

# (Optional) Set the HF Image type for the Dataset Viewer and the `datasets` library
features = {"image": {"_type": "Image"}}  # or using datasets.Features(...).to_dict()
schema_metadata = {"huggingface": {"dataset_info": {"features": features}}}
table = table.replace_schema_metadata(schema_metadata)

# Save to Parquet
# (Optional) with use_content_defined_chunking for faster uploads and downloads
pq.write_table(table, "data.parquet", use_content_defined_chunking=True)

在 Arrow 模式元資料中設定影像型別允許其他庫和 Hugging Face 資料集檢視器知道“影像”包含影像而不是純二進位制資料。

使用音訊

您可以載入一個包含元資料檔案的資料夾，其中包含音訊名稱或路徑欄位，結構如下：

Example 1:            Example 2:
folder/               folder/
├── metadata.parquet  ├── metadata.parquet
├── rec000.wav        └── audios
├── rec001.wav            ├── rec000.wav
...                       ...
└── recNNN.wav            └── recNNN.wav

您可以像這樣迭代音訊路徑：

from pathlib import Path
import pyarrow as pq

folder_path = Path("path/to/folder")
table = pq.read_table(folder_path + "metadata.parquet")
for file_name in table["file_name"].to_pylist():
    audio_path = folder_path / file_name
    ...

由於資料集採用支援的結構（一個包含 file_name 欄位的 metadata.parquet 檔案），您可以將其儲存到 Hugging Face，並且 Hub 資料集檢視器會同時顯示元資料和音訊。

from huggingface_hub import HfApi
api = HfApi()

api.upload_folder(
    folder_path=folder_path,
    repo_id="username/my_audio_dataset",
    repo_type="dataset",
)

將音訊嵌入 Parquet 檔案中

PyArrow 有一個二進位制型別，允許在 Arrow 表格中包含音訊位元組。因此，它能夠將資料集儲存為一個包含音訊（位元組和路徑）和樣本元資料的單個 Parquet 檔案。

import pyarrow as pa
import pyarrow.parquet as pq

# Embed the audio bytes in Arrow
audio_array = pa.array([
    {
        "bytes": (folder_path / file_name).read_bytes(),
        "path": file_name,
    }
    for file_name in table["file_name"].to_pylist()
])
table.append_column("audio", audio_array)

# (Optional) Set the HF Audio type for the Dataset Viewer and the `datasets` library
features = {"audio": {"_type": "Audio"}}  # or using datasets.Features(...).to_dict()
schema_metadata = {"huggingface": {"dataset_info": {"features": features}}}
table = table.replace_schema_metadata(schema_metadata)

# Save to Parquet
# (Optional) with use_content_defined_chunking for faster uploads and downloads
pq.write_table(table, "data.parquet", use_content_defined_chunking=True)

在 Arrow 模式元資料中設定音訊型別允許其他庫和 Hugging Face 資料集檢視器識別“音訊”包含音訊資料，而不僅僅是二進位制資料。

< > 在 GitHub 上更新