探索拆分資料的統計資訊

資料集檢視器提供了 /statistics 端點，用於獲取請求資料集預先計算的一些基本統計資訊。這將讓您快速瞭解資料的分佈情況。

目前，統計資訊僅針對包含 Parquet 匯出的資料集計算。

/statistics 端點需要三個查詢引數

dataset：資料集名稱，例如 nyu-mll/glue
config：子集名稱，例如 cola
split：分片名稱，例如 train

讓我們獲取 nyu-mll/glue 資料集、cola 子集、train 拆分的一些統計資訊

Python

JavaScript

cURL

響應 JSON 包含三個鍵

num_examples - 拆分中的樣本數量，或資料集大於 5GB 時資料第一個塊中的樣本數量（請參閱下面的 partial 欄位）。
statistics - 每列統計資訊的字典列表，每個字典包含三個鍵：column_name、column_type 和 column_statistics。column_statistics 的內容取決於列型別，詳情請參閱按資料型別劃分的響應結構
partial - 如果統計資訊是在資料的前 5 GB 上計算而不是在完整的拆分上計算，則為 true，否則為 false。

{
  "num_examples": 8551,
  "statistics": [
    {
      "column_name": "idx",
      "column_type": "int",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "min": 0,
        "max": 8550,
        "mean": 4275,
        "median": 4275,
        "std": 2468.60541,
        "histogram": {
          "hist": [
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            847
          ],
          "bin_edges": [
            0,
            856,
            1712,
            2568,
            3424,
            4280,
            5136,
            5992,
            6848,
            7704,
            8550
          ]
        }
      }
    },
    {
      "column_name": "label",
      "column_type": "class_label",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "no_label_count": 0,
        "no_label_proportion": 0,
        "n_unique": 2,
        "frequencies": {
          "unacceptable": 2528,
          "acceptable": 6023
        }
      }
    },
    {
      "column_name": "sentence",
      "column_type": "string_text",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "min": 6,
        "max": 231,
        "mean": 40.70074,
        "median": 37,
        "std": 19.14431,
        "histogram": {
          "hist": [
            2260,
            4512,
            1262,
            380,
            102,
            26,
            6,
            1,
            1,
            1
          ],
          "bin_edges": [
            6,
            29,
            52,
            75,
            98,
            121,
            144,
            167,
            190,
            213,
            231
          ]
        }
      }
    }
  ],
  "partial": false
}

按資料型別劃分的響應結構

目前，統計資訊支援字串、浮點數和整數、列表、日期時間、音訊和影像資料，以及 datasets 庫的特殊 datasets.ClassLabel 特徵型別。

響應中的 column_type 可以是以下值之一

class_label - 用於表示分類資料的 datasets.ClassLabel 特徵
float - 用於浮點資料型別
int - 用於整數資料型別
bool - 用於布林資料型別
string_label - 用於被視為類別的字串資料型別（見下文）
string_text - 用於不表示類別的字串資料型別（見下文）
list - 用於任何其他資料型別（包括列表）的列表
audio - 用於音訊資料
image - 用於影像資料
datetime - 用於日期時間資料

class_label

此型別表示編碼為 ClassLabel 特徵的分類資料。計算以下度量

null 值的數量和比例
無標籤值的數量和比例
唯一值的數量（不包括 null 和 無標籤）
每個標籤的值計數（不包括 null 和 無標籤）

示例

{
  "column_name": "label",
  "column_type": "class_label",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "no_label_count": 0,
    "no_label_proportion": 0,
    "n_unique": 2,
    "frequencies": {
      "unacceptable": 2528,
      "acceptable": 6023
    }
  }
}

float

以下度量針對浮點資料型別返回

最小值、最大值、平均值、中位數和標準差
null 和 NaN 值的數量和比例（NaN 值被視為 null）
包含 10 個 bin 的直方圖

示例

{
  "column_name": "clarity",
  "column_type": "float",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "min": 0,
    "max": 2,
    "mean": 1.67206,
    "median": 1.8,
    "std": 0.38714,
    "histogram": {
      "hist": [
        17,
        12,
        48,
        52,
        135,
        188,
        814,
        15,
        1628,
        2048
      ],
      "bin_edges": [
        0,
        0.2,
        0.4,
        0.6,
        0.8,
        1,
        1.2,
        1.4,
        1.6,
        1.8,
        2
      ]
    }
  }
}

int

以下度量針對整數資料型別返回

最小值、最大值、平均值、中位數和標準差
null 值的數量和比例
包含小於或等於 10 個 bin 的直方圖

示例

{
    "column_name": "direction",
    "column_type": "int",
    "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0.0,
        "min": 0,
        "max": 1,
        "mean": 0.49925,
        "median": 0.0,
        "std": 0.5,
        "histogram": {
            "hist": [
                50075,
                49925
            ],
            "bin_edges": [
                0,
                1,
                1
            ]
        }
    }
}

bool

以下度量針對布林資料型別返回

null 值的數量和比例
'True' 和 'False' 值的值計數

示例

{
  "column_name": "penalty",
  "column_type": "bool",
  "column_statistics":
    {
        "nan_count": 3,
        "nan_proportion": 0.15,
        "frequencies": {
            "False": 7,
            "True": 10
        }
    }
}

string_label

如果請求拆分中字串列的唯一值比例小於或等於 0.2 且唯一值數量小於 1000，或者如果唯一值數量小於或等於 10（與比例無關），則將其視為一個類別。返回以下度量

null 值的數量和比例
唯一值的數量（不包括 null）
每個標籤的值計數（不包括 null）

示例

{
  "column_name": "answerKey",
  "column_type": "string_label",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "n_unique": 4,
    "frequencies": {
      "D": 1221,
      "C": 1146,
      "A": 1378,
      "B": 1212
    }
  }
}

string_text

如果字串列不滿足被視為 string_label 的條件，則將其視為包含文字的列，響應包含按字元數計算的文字長度統計資訊。計算以下度量

文字長度的最小值、最大值、平均值、中位數和標準差
null 值的數量和比例
包含 10 個 bin 的文字長度直方圖

示例

{
  "column_name": "sentence",
  "column_type": "string_text",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "min": 6,
    "max": 231,
    "mean": 40.70074,
    "median": 37,
    "std": 19.14431,
    "histogram": {
      "hist": [
        2260,
        4512,
        1262,
        380,
        102,
        26,
        6,
        1,
        1,
        1
      ],
      "bin_edges": [
        6,
        29,
        52,
        75,
        98,
        121,
        144,
        167,
        190,
        213,
        231
      ]
    }
  }
}

list

對於列表，計算其長度分佈。返回以下度量

列表長度的最小值、最大值、平均值、中位數和標準差
null 值的數量和比例
包含最多 10 個 bin 的列表長度直方圖

示例

{
    "column_name": "chat_history",
    "column_type": "list",
    "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0.0,
        "min": 1,
        "max": 3,
        "mean": 1.01741,
        "median": 1.0,
        "std": 0.13146,
        "histogram": {
            "hist": [
                11177,
                196,
                1
            ],
            "bin_edges": [
                1,
                2,
                3,
                3
            ]
        }
    }
}

請注意，不支援列表字典。

audio

對於音訊資料，計算音訊檔案持續時間的分佈。返回以下度量

音訊檔案持續時間的最小值、最大值、平均值、中位數和標準差
null 值的數量和比例
包含 10 個 bin 的音訊檔案持續時間直方圖

示例

{
    "column_name": "audio",
    "column_type": "audio",
    "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "min": 1.02,
        "max": 15,
        "mean": 13.93042,
        "median": 14.77,
        "std": 2.63734,
        "histogram": {
            "hist": [
                32,
                25,
                18,
                24,
                22,
                17,
                18,
                19,
                55,
                1770
            ],
            "bin_edges": [
                1.02,
                2.418,
                3.816,
                5.214,
                6.612,
                8.01,
                9.408,
                10.806,
                12.204,
                13.602,
                15
            ]
        }
    }
}

image

對於影像資料，計算影像寬度的分佈。返回以下度量

影像檔案寬度的最小值、最大值、平均值、中位數和標準差
null 值的數量和比例
包含 10 個 bin 的影像寬度直方圖

示例

{
    "column_name": "image",
    "column_type": "image",
    "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0.0,
        "min": 256,
        "max": 873,
        "mean": 327.99339,
        "median": 341.0,
        "std": 60.07286,
        "histogram": {
            "hist": [
                1734,
                1637,
                1326,
                121,
                10,
                3,
                1,
                3,
                1,
                2
            ],
            "bin_edges": [
                256,
                318,
                380,
                442,
                504,
                566,
                628,
                690,
                752,
                814,
                873
            ]
        }
    }
}

datetime

計算日期時間分佈。返回以下度量

日期時間的最小值、最大值、平均值、中位數和標準差，表示為精度到秒的字串
null 值的數量和比例
包含 10 個 bin 的日期時間直方圖

示例

{
    "column_name": "date",
    "column_type": "datetime",
    "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0.0,
        "min": "2013-05-18 04:54:11",
        "max": "2013-06-20 10:01:41",
        "mean": "2013-05-27 18:03:39",
        "median": "2013-05-23 11:55:50",
        "std": "11 days, 4:57:32.322450",
        "histogram": {
            "hist": [
                318776,
                393036,
                173904,
                0,
                0,
                0,
                0,
                0,
                0,
                206284
            ],
            "bin_edges": [
                "2013-05-18 04:54:11",
                "2013-05-21 12:36:57",
                "2013-05-24 20:19:43",
                "2013-05-28 04:02:29",
                "2013-05-31 11:45:15",
                "2013-06-03 19:28:01",
                "2013-06-07 03:10:47",
                "2013-06-10 10:53:33",
                "2013-06-13 18:36:19",
                "2013-06-17 02:19:05",
                "2013-06-20 10:01:41"
            ]
        }
    }
}

< > 在 GitHub 上更新