資料集檢視器文件
獲取行數和位元組大小
加入 Hugging Face 社群
並獲得增強的文件體驗
開始使用
獲取行數和位元組大小
本指南將向您展示如何使用資料集檢視器的 /size
端點以程式設計方式檢索資料集的大小。您也可以嘗試使用 ReDoc。
/size
端點接受資料集名稱作為其查詢引數
Python
JavaScript
cURL
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/size?dataset=ibm/duorc"
def query():
response = requests.get(API_URL, headers=headers)
return response.json()
data = query()
端點響應是一個 JSON,其中包含資料集的大小,以及其每個子集和分割的大小。它提供了不同形式資料的行數、列數(如適用)和位元組大小:原始檔案、記憶體(RAM)大小和自動轉換的 parquet 檔案。例如,ibm/duorc 資料集在其所有子集和分割中共有 187.213 行,總大小為 97MB。
{
"size":{
"dataset":{
"dataset":"ibm/duorc",
"num_bytes_original_files":58710973,
"num_bytes_parquet_files":58710973,
"num_bytes_memory":1060742354,
"num_rows":187213
},
"configs":[
{
"dataset":"ibm/duorc",
"config":"ParaphraseRC",
"num_bytes_original_files":37709127,
"num_bytes_parquet_files":37709127,
"num_bytes_memory":704394283,
"num_rows":100972,
"num_columns":7
},
{
"dataset":"ibm/duorc",
"config":"SelfRC",
"num_bytes_original_files":21001846,
"num_bytes_parquet_files":21001846,
"num_bytes_memory":356348071,
"num_rows":86241,
"num_columns":7
}
],
"splits":[
{
"dataset":"ibm/duorc",
"config":"ParaphraseRC",
"split":"train",
"num_bytes_parquet_files":26005668,
"num_bytes_memory":494389683,
"num_rows":69524,
"num_columns":7
},
{
"dataset":"ibm/duorc",
"config":"ParaphraseRC",
"split":"validation",
"num_bytes_parquet_files":5566868,
"num_bytes_memory":106733319,
"num_rows":15591,
"num_columns":7
},
{
"dataset":"ibm/duorc",
"config":"ParaphraseRC",
"split":"test",
"num_bytes_parquet_files":6136591,
"num_bytes_memory":103271281,
"num_rows":15857,
"num_columns":7
},
{
"dataset":"ibm/duorc",
"config":"SelfRC",
"split":"train",
"num_bytes_parquet_files":14851720,
"num_bytes_memory":248966361,
"num_rows":60721,
"num_columns":7
},
{
"dataset":"ibm/duorc",
"config":"SelfRC",
"split":"validation",
"num_bytes_parquet_files":3114390,
"num_bytes_memory":56359392,
"num_rows":12961,
"num_columns":7
},
{
"dataset":"ibm/duorc",
"config":"SelfRC",
"split":"test",
"num_bytes_parquet_files":3035736,
"num_bytes_memory":51022318,
"num_rows":12559,
"num_columns":7
}
]
},
"pending":[
],
"failed":[
],
"partial":false
}
如果大小的 partial: true
表示無法確定資料集的實際大小,因為它太大了。
在這種情況下,行數和位元組數可能小於實際數字。
< > 在 GitHub 上更新