透過檔案系統API與Hub互動

除了 HfApi，huggingface_hub 庫還提供了 HfFileSystem，這是一個與 Hugging Face Hub 相容的 pythonic fsspec 檔案介面。HfFileSystem 構建在 HfApi 之上，提供典型的檔案系統風格操作，如 cp、mv、ls、du、glob、get_file 和 put_file。

HfFileSystem 提供 fsspec 相容性，這對於需要它的庫（例如，直接使用 pandas 讀取 Hugging Face 資料集）很有用。然而，由於此相容層，它引入了額外的開銷。為了獲得更好的效能和可靠性，建議在可能的情況下使用 HfApi 方法。

用法

>>> from huggingface_hub import HfFileSystem
>>> fs = HfFileSystem()

>>> # List all files in a directory
>>> fs.ls("datasets/my-username/my-dataset-repo/data", detail=False)
['datasets/my-username/my-dataset-repo/data/train.csv', 'datasets/my-username/my-dataset-repo/data/test.csv']

>>> # List all ".csv" files in a repo
>>> fs.glob("datasets/my-username/my-dataset-repo/**/*.csv")
['datasets/my-username/my-dataset-repo/data/train.csv', 'datasets/my-username/my-dataset-repo/data/test.csv']

>>> # Read a remote file
>>> with fs.open("datasets/my-username/my-dataset-repo/data/train.csv", "r") as f:
...     train_data = f.readlines()

>>> # Read the content of a remote file as a string
>>> train_data = fs.read_text("datasets/my-username/my-dataset-repo/data/train.csv", revision="dev")

>>> # Write a remote file
>>> with fs.open("datasets/my-username/my-dataset-repo/data/validation.csv", "w") as f:
...     f.write("text,label")
...     f.write("Fantastic movie!,good")

可選的 revision 引數可以傳遞給一個特定的提交（例如分支、標籤名稱或提交雜湊）來執行操作。

與 Python 內建的 open 不同，fsspec 的 open 預設以二進位制模式 "rb" 開啟。這意味著您必須顯式地將模式設定為 "r" 用於文字讀取，"w" 用於文字寫入。目前不支援追加檔案（模式 "a" 和 "ab"）。

整合

HfFileSystem 可以與任何整合 fsspec 的庫一起使用，前提是 URL 遵循以下方案：

hf://[<repo_type_prefix>]<repo_id>[@<revision>]/<path/in/repo>

資料集的 repo_type_prefix 是 datasets/，空間的 spaces/，而模型在 URL 中不需要字首。

下面列出了一些有趣的整合，其中 HfFileSystem 簡化了與 Hub 的互動

從 Hub 倉庫讀取/寫入 Pandas DataFrame

>>> import pandas as pd

>>> # Read a remote CSV file into a dataframe
>>> df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv")

>>> # Write a dataframe to a remote CSV file
>>> df.to_csv("hf://datasets/my-username/my-dataset-repo/test.csv")

同樣的工作流程也適用於 Dask 和 Polars DataFrame。

使用 DuckDB 查詢（遠端）Hub 檔案

>>> from huggingface_hub import HfFileSystem
>>> import duckdb

>>> fs = HfFileSystem()
>>> duckdb.register_filesystem(fs)
>>> # Query a remote file and get the result back as a dataframe
>>> fs_query_file = "hf://datasets/my-username/my-dataset-repo/data_dir/data.parquet"
>>> df = duckdb.query(f"SELECT * FROM '{fs_query_file}' LIMIT 10").df()

使用 Zarr 將 Hub 用作陣列儲存

>>> import numpy as np
>>> import zarr

>>> embeddings = np.random.randn(50000, 1000).astype("float32")

>>> # Write an array to a repo
>>> with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="w") as root:
...    foo = root.create_group("embeddings")
...    foobar = foo.zeros('experiment_0', shape=(50000, 1000), chunks=(10000, 1000), dtype='f4')
...    foobar[:] = embeddings

>>> # Read an array from a repo
>>> with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="r") as root:
...    first_row = root["embeddings/experiment_0"][0]

身份驗證

在許多情況下，您必須登入 Hugging Face 帳戶才能與 Hub 互動。請參閱文件的身份驗證部分，瞭解有關 Hub 身份驗證方法的更多資訊。

還可以透過將您的 token 作為引數傳遞給 HfFileSystem 來以程式設計方式登入

>>> from huggingface_hub import HfFileSystem
>>> fs = HfFileSystem(token=token)

如果您以這種方式登入，請注意在分享原始碼時不要意外洩露令牌！

< > 在 GitHub 上更新

Hub Python 庫

透過檔案系統API與Hub互動

用法

整合

身份驗證