資料集檢視器文件
獲取 Croissant 元資料
加入 Hugging Face 社群
並獲得增強的文件體驗
開始使用
獲取 Croissant 元資料
資料集檢視器會自動為 Hugging Face Hub 上的每個資料集生成 Croissant 格式 (JSON-LD) 的元資料。它列出了資料集的名稱、描述、URL,以及資料集作為 Parquet 檔案的分佈,包括列的元資料。Croissant 元資料適用於所有可以轉換為 Parquet 格式的資料集。
什麼是 Croissant?
Croissant 是一種基於 schema.org 構建的元資料格式,旨在描述用於機器學習的資料集,以幫助對其進行索引、搜尋和程式設計載入。
獲取元資料
本指南向您展示如何使用 Hugging Face /croissant
端點檢索與資料集關聯的 Croissant 元資料。
/croissant
端點在 URL 中獲取資料集名稱,例如 ibm/duorc
資料集
Python
JavaScript
cURL
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://huggingface.co/api/datasets/ibm/duorc/croissant"
def query():
response = requests.get(API_URL, headers=headers)
return response.json()
data = query()
在底層,它使用 https://datasets-server.huggingface.co/croissant-crumbs
端點並用 Hub 元資料對其進行豐富。
端點響應是一個包含 Croissant 格式元資料的 JSON-LD。例如,ibm/duorc
資料集有兩個子集:ParaphraseRC
和 SelfRC
(有關分割和子集的更多詳細資訊,請參閱列出分割和子集指南)。元資料鏈接到其 Parquet 檔案並描述了六列中的每一列的型別:plot_id
、plot
、title
、question_id
、question
和 no_answer
{
"@context": {
"@language": "en",
"@vocab": "https://schema.org/",
"citeAs": "cr:citeAs",
"column": "cr:column",
"conformsTo": "dct:conformsTo",
"cr": "http://mlcommons.org/croissant/",
"data": {
"@id": "cr:data",
"@type": "@json"
},
"dataBiases": "cr:dataBiases",
"dataCollection": "cr:dataCollection",
"dataType": {
"@id": "cr:dataType",
"@type": "@vocab"
},
"dct": "http://purl.org/dc/terms/",
"extract": "cr:extract",
"field": "cr:field",
"fileProperty": "cr:fileProperty",
"fileObject": "cr:fileObject",
"fileSet": "cr:fileSet",
"format": "cr:format",
"includes": "cr:includes",
"isLiveDataset": "cr:isLiveDataset",
"jsonPath": "cr:jsonPath",
"key": "cr:key",
"md5": "cr:md5",
"parentField": "cr:parentField",
"path": "cr:path",
"personalSensitiveInformation": "cr:personalSensitiveInformation",
"recordSet": "cr:recordSet",
"references": "cr:references",
"regex": "cr:regex",
"repeated": "cr:repeated",
"replace": "cr:replace",
"sc": "https://schema.org/",
"separator": "cr:separator",
"source": "cr:source",
"subField": "cr:subField",
"transform": "cr:transform"
},
"@type": "sc:Dataset",
"distribution": [
{
"@type": "cr:FileObject",
"@id": "repo",
"name": "repo",
"description": "The Hugging Face git repository.",
"contentUrl": "https://huggingface.co/datasets/ibm/duorc/tree/refs%2Fconvert%2Fparquet",
"encodingFormat": "git+https",
"sha256": "https://github.com/mlcommons/croissant/issues/80"
},
{
"@type": "cr:FileSet",
"@id": "parquet-files-for-config-ParaphraseRC",
"name": "parquet-files-for-config-ParaphraseRC",
"description": "The underlying Parquet files as converted by Hugging Face (see: https://huggingface.co/docs/dataset-viewer/parquet).",
"containedIn": {
"@id": "repo"
},
"encodingFormat": "application/x-parquet",
"includes": "ParaphraseRC/*/*.parquet"
},
{
"@type": "cr:FileSet",
"@id": "parquet-files-for-config-SelfRC",
"name": "parquet-files-for-config-SelfRC",
"description": "The underlying Parquet files as converted by Hugging Face (see: https://huggingface.co/docs/dataset-viewer/parquet).",
"containedIn": {
"@id": "repo"
},
"encodingFormat": "application/x-parquet",
"includes": "SelfRC/*/*.parquet"
}
],
"recordSet": [
{
"@type": "cr:RecordSet",
"@id": "ParaphraseRC",
"name": "ParaphraseRC",
"description": "ibm/duorc - 'ParaphraseRC' subset\n\nAdditional information:\n- 3 splits: train, validation, test\n- 1 skipped column: answers",
"field": [
{
"@type": "cr:Field",
"@id": "ParaphraseRC/plot_id",
"name": "ParaphraseRC/plot_id",
"description": "Column 'plot_id' from the Hugging Face parquet file.",
"dataType": "sc:Text",
"source": {
"fileSet": {
"@id": "parquet-files-for-config-ParaphraseRC"
},
"extract": {
"column": "plot_id"
}
}
},
{
"@type": "cr:Field",
"@id": "ParaphraseRC/plot",
"name": "ParaphraseRC/plot",
"description": "Column 'plot' from the Hugging Face parquet file.",
"dataType": "sc:Text",
"source": {
"fileSet": {
"@id": "parquet-files-for-config-ParaphraseRC"
},
"extract": {
"column": "plot"
}
}
},
{
"@type": "cr:Field",
"@id": "ParaphraseRC/title",
"name": "ParaphraseRC/title",
"description": "Column 'title' from the Hugging Face parquet file.",
"dataType": "sc:Text",
"source": {
"fileSet": {
"@id": "parquet-files-for-config-ParaphraseRC"
},
"extract": {
"column": "title"
}
}
},
{
"@type": "cr:Field",
"@id": "ParaphraseRC/question_id",
"name": "ParaphraseRC/question_id",
"description": "Column 'question_id' from the Hugging Face parquet file.",
"dataType": "sc:Text",
"source": {
"fileSet": {
"@id": "parquet-files-for-config-ParaphraseRC"
},
"extract": {
"column": "question_id"
}
}
},
{
"@type": "cr:Field",
"@id": "ParaphraseRC/question",
"name": "ParaphraseRC/question",
"description": "Column 'question' from the Hugging Face parquet file.",
"dataType": "sc:Text",
"source": {
"fileSet": {
"@id": "parquet-files-for-config-ParaphraseRC"
},
"extract": {
"column": "question"
}
}
},
{
"@type": "cr:Field",
"@id": "ParaphraseRC/no_answer",
"name": "ParaphraseRC/no_answer",
"description": "Column 'no_answer' from the Hugging Face parquet file.",
"dataType": "sc:Boolean",
"source": {
"fileSet": {
"@id": "parquet-files-for-config-ParaphraseRC"
},
"extract": {
"column": "no_answer"
}
}
}
]
},
{
"@type": "cr:RecordSet",
"@id": "SelfRC",
"name": "SelfRC",
"description": "ibm/duorc - 'SelfRC' subset\n\nAdditional information:\n- 3 splits: train, validation, test\n- 1 skipped column: answers",
"field": [
{
"@type": "cr:Field",
"@id": "SelfRC/plot_id",
"name": "SelfRC/plot_id",
"description": "Column 'plot_id' from the Hugging Face parquet file.",
"dataType": "sc:Text",
"source": {
"fileSet": {
"@id": "parquet-files-for-config-SelfRC"
},
"extract": {
"column": "plot_id"
}
}
},
{
"@type": "cr:Field",
"@id": "SelfRC/plot",
"name": "SelfRC/plot",
"description": "Column 'plot' from the Hugging Face parquet file.",
"dataType": "sc:Text",
"source": {
"fileSet": {
"@id": "parquet-files-for-config-SelfRC"
},
"extract": {
"column": "plot"
}
}
},
{
"@type": "cr:Field",
"@id": "SelfRC/title",
"name": "SelfRC/title",
"description": "Column 'title' from the Hugging Face parquet file.",
"dataType": "sc:Text",
"source": {
"fileSet": {
"@id": "parquet-files-for-config-SelfRC"
},
"extract": {
"column": "title"
}
}
},
{
"@type": "cr:Field",
"@id": "SelfRC/question_id",
"name": "SelfRC/question_id",
"description": "Column 'question_id' from the Hugging Face parquet file.",
"dataType": "sc:Text",
"source": {
"fileSet": {
"@id": "parquet-files-for-config-SelfRC"
},
"extract": {
"column": "question_id"
}
}
},
{
"@type": "cr:Field",
"@id": "SelfRC/question",
"name": "SelfRC/question",
"description": "Column 'question' from the Hugging Face parquet file.",
"dataType": "sc:Text",
"source": {
"fileSet": {
"@id": "parquet-files-for-config-SelfRC"
},
"extract": {
"column": "question"
}
}
},
{
"@type": "cr:Field",
"@id": "SelfRC/no_answer",
"name": "SelfRC/no_answer",
"description": "Column 'no_answer' from the Hugging Face parquet file.",
"dataType": "sc:Boolean",
"source": {
"fileSet": {
"@id": "parquet-files-for-config-SelfRC"
},
"extract": {
"column": "no_answer"
}
}
}
]
}
],
"name": "duorc",
"description": "\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for duorc\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nThe DuoRC dataset is an English language dataset of questions and answers gathered from crowdsourced AMT workers on Wikipedia and IMDb movie plots. The workers were given freedom to pick answer from the plots or synthesize their own answers. It contains two sub-datasets - SelfRC and ParaphraseRC. SelfRC dataset is built on Wikipedia movie plots solely. ParaphraseRC has questions written from Wikipedia movie plots and the… See the full description on the dataset page: https://huggingface.co/datasets/ibm/duorc.",
"alternateName": [
"ibm/duorc",
"DuoRC"
],
"creator": {
"@type": "Organization",
"name": "IBM",
"url": "https://huggingface.co/ibm"
},
"keywords": [
"question-answering",
"text2text-generation",
"abstractive-qa",
"extractive-qa",
"crowdsourced",
"crowdsourced",
"monolingual",
"100K<n<1M",
"10K<n<100K",
"original",
"English",
"mit",
"Croissant",
"arxiv:1804.07927",
"🇺🇸 Region: US"
],
"license": "https://choosealicense.com/licenses/mit/",
"sameAs": "https://duorc.github.io/",
"url": "https://huggingface.co/datasets/ibm/duorc"
}
載入資料集
要載入資料集,您可以使用 mlcroissant 庫。它提供了一種從 Croissant 元資料載入資料集的簡單方法。
< > 在 GitHub 上更新