快速入門

在此快速入門中，您將學習如何使用資料集檢視器的 REST API 來：

檢查 Hub 上的資料集是否可用。
返回資料集的子集和切分。
預覽資料集的前 100 行。
下載資料集的行切片。
在資料集中搜索單詞。
根據查詢字串過濾行。
以 parquet 檔案形式訪問資料集。
獲取資料集大小（行數或位元組數）。
獲取資料集的統計資訊。

API 端點

每個功能都透過下表彙總的端點提供

端點	方法	描述	查詢引數
/is-valid	GET	檢查特定資料集是否有效。	`dataset`: 資料集名稱
/splits	GET	獲取資料集的子集和切分列表。	`dataset`: 資料集名稱
/first-rows	GET	獲取資料集切分的前幾行。	- `dataset`: 資料集名稱 - `config`: 配置名稱 - `split`: 切分名稱
/rows	GET	獲取資料集切分的一個行切片。	- `dataset`: 資料集名稱 - `config`: 配置名稱 - `split`: 切分名稱 - `offset`: 切片的偏移量 - `length`: 切片的長度（最大 100）
/search	GET	在資料集切分中搜索文字。	- `dataset`: 資料集名稱 - `config`: 配置名稱 - `split`: 切分名稱 - `query`: 要搜尋的文字
/filter	GET	過濾資料集切分中的行。	- `dataset`: 資料集名稱 - `config`: 配置名稱 - `split`: 切分名稱 - `where`: 過濾查詢 - `orderby`: 排序子句 - `offset`: 切片的偏移量 - `length`: 切片的長度（最大 100）
/parquet	GET	獲取資料集的 parquet 檔案列表。	`dataset`: 資料集名稱
/size	GET	獲取資料集的大小。	`dataset`: 資料集名稱
/statistics	GET	獲取資料集切分的統計資訊。	- `dataset`: 資料集名稱 - `config`: 配置名稱 - `split`: 切分名稱
/croissant	GET	獲取有關資料集的 Croissant 元資料。	- `dataset`: 資料集名稱

使用資料集檢視器 API 無需安裝或設定。

如果您還沒有 Hugging Face 賬戶，請註冊一個！雖然您可以在沒有 Hugging Face 賬戶的情況下使用資料集檢視器 API，但如果您不提供可在使用者設定中找到的使用者令牌，您將無法訪問受限資料集，例如 CommonVoice 和 ImageNet。

您可以在 Postman、ReDoc 或 RapidAPI 中隨意試用 API。本快速入門將向您展示如何以程式設計方式查詢端點。

REST API 的基本 URL 是

https://datasets-server.huggingface.co

私有和受限資料集

對於私有和受限資料集，您需要在查詢的 headers 中提供使用者令牌。否則，您將收到錯誤訊息，要求重新認證。

資料集檢視器支援 PRO 使用者或 Enterprise Hub 組織擁有的私有資料集。

Python

JavaScript

cURL

如果您在未提供使用者令牌的情況下嘗試訪問受限資料集，您將看到以下錯誤：

print(data)
{'error': 'The dataset does not exist, or is not accessible without authentication (private or gated). Please check the spelling of the dataset name or retry with authentication.'}

檢查資料集有效性

要檢查特定資料集（例如爛番茄）是否有效，請使用 /is-valid 端點

Python

JavaScript

cURL

這將返回資料集是否提供預覽（請參閱 /first-rows）、檢視器（請參閱 /rows）、搜尋（請參閱 /search）和過濾（請參閱 /filter），以及統計資訊（請參閱 /statistics）

{ "preview": true, "viewer": true, "search": true, "filter": true, "statistics": true }

列出配置和切分

/splits 端點返回資料集中切分的 JSON 列表

Python

JavaScript

cURL

這將返回資料集中可用的子集和切分

{
  "splits": [
    { "dataset": "cornell-movie-review-data/rotten_tomatoes", "config": "default", "split": "train" },
    {
      "dataset": "cornell-movie-review-data/rotten_tomatoes",
      "config": "default",
      "split": "validation"
    },
    { "dataset": "cornell-movie-review-data/rotten_tomatoes", "config": "default", "split": "test" }
  ],
  "pending": [],
  "failed": []
}

預覽資料集

/first-rows 端點返回資料集前 100 行的 JSON 列表。它還返回資料特徵的型別（“列”資料型別）。您應該指定要預覽的資料集的名稱、子集名稱（您可以從 /splits 端點獲取子集名稱）和切分名稱。

Python

JavaScript

cURL

這將返回資料集的前 100 行。

{
  "dataset": "cornell-movie-review-data/rotten_tomatoes",
  "config": "default",
  "split": "train",
  "features": [
    {
      "feature_idx": 0,
      "name": "text",
      "type": { "dtype": "string", "_type": "Value" }
    },
    {
      "feature_idx": 1,
      "name": "label",
      "type": { "names": ["neg", "pos"], "_type": "ClassLabel" }
    }
  ],
  "rows": [
    {
      "row_idx": 0,
      "row": {
        "text": "the rock is destined to be the 21st century's new \" conan \" and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .",
        "label": 1
      },
      "truncated_cells": []
    },
    {
      "row_idx": 1,
      "row": {
        "text": "the gorgeously elaborate continuation of \" the lord of the rings \" trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .",
        "label": 1
      },
      "truncated_cells": []
    },
    ...,
    ...
  ]
}

下載資料集切片

/rows 端點返回資料集在給定位置（偏移量）的行切片的 JSON 列表。它還返回資料特徵的型別（“列”資料型別）。您應該指定資料集名稱、子集名稱（您可以從 /splits 端點獲取子集名稱）、切分名稱以及要下載的切片的偏移量和長度。

Python

JavaScript

cURL

一次最多可以下載 100 行的切片。

響應如下：

{
  "features": [
    {
      "feature_idx": 0,
      "name": "text",
      "type": { "dtype": "string", "_type": "Value" }
    },
    {
      "feature_idx": 1,
      "name": "label",
      "type": { "names": ["neg", "pos"], "_type": "ClassLabel" }
    }
  ],
  "rows": [
    {
      "row_idx": 150,
      "row": {
        "text": "enormously likable , partly because it is aware of its own grasp of the absurd .",
        "label": 1
      },
      "truncated_cells": []
    },
    {
      "row_idx": 151,
      "row": {
        "text": "here's a british flick gleefully unconcerned with plausibility , yet just as determined to entertain you .",
        "label": 1
      },
      "truncated_cells": []
    },
    ...,
    ...
  ],
  "num_rows_total": 8530,
  "num_rows_per_page": 100,
  "partial": false
}

在資料集中搜索文字

/search 端點返回與文字查詢匹配的資料集行切片的 JSON 列表。文字將在 string 型別的列中搜索，即使值巢狀在字典中。它還返回資料特徵的型別（“列”資料型別）。響應格式與 /rows 端點相同。您應該指定資料集名稱、子集名稱（您可以從 /splits 端點獲取子集名稱）、切分名稱以及要在文字列中搜索的查詢。

Python

JavaScript

cURL

一次最多可以獲取 100 行的切片，您可以使用 offset 和 length 引數請求其他切片，與 /rows 端點類似。

響應如下：

{
  "features": [
    {
      "feature_idx": 0,
      "name": "text",
      "type": { "dtype": "string", "_type": "Value" }
    },
    {
      "feature_idx": 1,
      "name": "label",
      "type": { "dtype": "int64", "_type": "Value" }
    }
  ],
  "rows": [
    {
      "row_idx": 9,
      "row": {
        "text": "take care of my cat offers a refreshingly different slice of asian cinema .",
        "label": 1
      },
      "truncated_cells": []
    },
    {
      "row_idx": 472,
      "row": {
        "text": "[ \" take care of my cat \" ] is an honestly nice little film that takes us on an examination of young adult life in urban south korea through the hearts and minds of the five principals .",
        "label": 1
      },
      "truncated_cells": []
    },
    ...,
    ...
  ],
  "num_rows_total": 12,
  "num_rows_per_page": 100,
  "partial": false
}

訪問 Parquet 檔案

資料集檢視器將 Hub 上的每個資料集轉換為 Parquet 格式。/parquet 端點返回資料集的 Parquet URL 的 JSON 列表。

Python

JavaScript

cURL

這將返回每個切分的 Parquet 檔案 URL

{
  "parquet_files": [
    {
      "dataset": "cornell-movie-review-data/rotten_tomatoes",
      "config": "default",
      "split": "test",
      "url": "https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes/resolve/refs%2Fconvert%2Fparquet/default/test/0000.parquet",
      "filename": "0000.parquet",
      "size": 92206
    },
    {
      "dataset": "cornell-movie-review-data/rotten_tomatoes",
      "config": "default",
      "split": "train",
      "url": "https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet",
      "filename": "0000.parquet",
      "size": 698845
    },
    {
      "dataset": "cornell-movie-review-data/rotten_tomatoes",
      "config": "default",
      "split": "validation",
      "url": "https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes/resolve/refs%2Fconvert%2Fparquet/default/validation/0000.parquet",
      "filename": "0000.parquet",
      "size": 90001
    }
  ],
  "pending": [],
  "failed": [],
  "partial": false
}

獲取資料集大小

/size 端點返回一個 JSON，其中包含資料集的大小（行數和位元組大小），以及每個子集和切分的大小。

Python

JavaScript

cURL

這將返回資料集的大小，以及每個子集和切分的大小。

{
  "size": {
    "dataset": {
      "dataset": "cornell-movie-review-data/rotten_tomatoes",
      "num_bytes_original_files": 487770,
      "num_bytes_parquet_files": 881052,
      "num_bytes_memory": 1345449,
      "num_rows": 10662
    },
    "configs": [
      {
        "dataset": "cornell-movie-review-data/rotten_tomatoes",
        "config": "default",
        "num_bytes_original_files": 487770,
        "num_bytes_parquet_files": 881052,
        "num_bytes_memory": 1345449,
        "num_rows": 10662,
        "num_columns": 2
      }
    ],
    "splits": [
      {
        "dataset": "cornell-movie-review-data/rotten_tomatoes",
        "config": "default",
        "split": "train",
        "num_bytes_parquet_files": 698845,
        "num_bytes_memory": 1074806,
        "num_rows": 8530,
        "num_columns": 2
      },
      {
        "dataset": "cornell-movie-review-data/rotten_tomatoes",
        "config": "default",
        "split": "validation",
        "num_bytes_parquet_files": 90001,
        "num_bytes_memory": 134675,
        "num_rows": 1066,
        "num_columns": 2
      },
      {
        "dataset": "cornell-movie-review-data/rotten_tomatoes",
        "config": "default",
        "split": "test",
        "num_bytes_parquet_files": 92206,
        "num_bytes_memory": 135968,
        "num_rows": 1066,
        "num_columns": 2
      }
    ]
  },
  "pending": [],
  "failed": [],
  "partial": false
}

< > 在 GitHub 上更新