瞭解你的資料集

有兩種型別的資料集物件：一種是常規的 Dataset，另一種是 ✨ IterableDataset ✨。Dataset 提供了對行的快速隨機訪問和記憶體對映，因此即使載入大型資料集也只佔用相對較少的裝置記憶體。但對於那些甚至無法裝入磁碟或記憶體的超大型資料集，IterableDataset 允許您在資料集完全下載完之前就訪問和使用它！

本教程將向您展示如何載入和訪問 Dataset 和 IterableDataset。

Dataset

當您載入一個數據集分割時，您會得到一個 Dataset 物件。您可以用 Dataset 物件做很多事情，因此學習如何操作和與之互動儲存在內部的資料非常重要。

本教程使用 rotten_tomatoes 資料集，但您可以隨意載入任何您喜歡的資料集並跟著操作！

>>> from datasets import load_dataset

>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")

索引

Dataset 包含資料列，每列可以是不同的資料型別。*索引* 或軸標籤，用於訪問資料集中的樣本。例如，按行索引會返回一個包含資料集中一個樣本的字典。

# Get the first row in the dataset
>>> dataset[0]
{'label': 1,
 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}

使用 - 運算子從資料集的末尾開始

# Get the last row in the dataset
>>> dataset[-1]
{'label': 0,
 'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .'}

按列名索引會返回該列中所有值的列表

>>> dataset["text"]
['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
 'effective but too-tepid biopic',
 ...,
 'things really get weird , though not particularly scary : the movie is all portent and no content .']

您可以組合行和列名索引來返回一個特定位置的值

>>> dataset[0]["text"]
'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'

索引順序不重要。首先按列名索引會返回一個 Column 物件，您可以像往常一樣使用行索引進行索引

>>> import time

>>> start_time = time.time()
>>> text = dataset[0]["text"]
>>> end_time = time.time()
>>> print(f"Elapsed time: {end_time - start_time:.4f} seconds")
Elapsed time: 0.0031 seconds

>>> start_time = time.time()
>>> text = dataset["text"][0]
>>> end_time = time.time()
>>> print(f"Elapsed time: {end_time - start_time:.4f} seconds")
Elapsed time: 0.0042 seconds

切片

切片返回資料集的一個切片 - 或子集 - 這對於一次性檢視多行非常有用。要對資料集進行切片，請使用 : 運算子指定一個位置範圍。

# Get the first three rows
>>> dataset[:3]
{'label': [1, 1, 1],
 'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'effective but too-tepid biopic']}

# Get rows between three and six
>>> dataset[3:6]
{'label': [1, 1, 1],
 'text': ['if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
  "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",
  'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .']}

IterableDataset

當您在 load_dataset() 中將 streaming 引數設定為 True 時，會載入一個 IterableDataset

>>> from datasets import load_dataset

>>> iterable_dataset = load_dataset("ethz/food101", split="train", streaming=True)
>>> for example in iterable_dataset:
...     print(example)
...     break
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F5C520>, 'label': 6}

您也可以從一個*現有*的 Dataset 建立一個 IterableDataset，但這比流模式更快，因為資料集是從本地檔案流式傳輸的

>>> from datasets import load_dataset

>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
>>> iterable_dataset = dataset.to_iterable_dataset()

IterableDataset 逐個樣本地迭代資料集，因此您無需等待整個資料集下載完畢即可使用它。可以想象，這對於您想立即使用的大型資料集非常有用！

索引

IterableDataset 的行為與常規的 Dataset 不同。您無法隨機訪問 IterableDataset 中的樣本。相反，您應該遍歷其元素，例如，透過呼叫 next(iter()) 或使用 for 迴圈來返回 IterableDataset 中的下一個專案

>>> next(iter(iterable_dataset))
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F59B50>,
 'label': 6}

>>> for example in iterable_dataset:
...     print(example)
...     break
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DE82B0>, 'label': 6}

但是，IterableDataset 支援列索引，它會返回一個列值的可迭代物件

>>> next(iter(iterable_dataset["label"]))
6

建立子集

您可以使用 IterableDataset.take() 返回包含特定數量樣本的資料集子集

# Get first three examples
>>> list(iterable_dataset.take(3))
[{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DEE9D0>,
  'label': 6},
 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F7479DE8190>,
  'label': 6},
 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x383 at 0x7F7479DE8310>,
  'label': 6}]

但與切片不同，IterableDataset.take() 會建立一個新的 IterableDataset。

下一步

想了解更多關於這兩種資料集型別的區別嗎？請在Dataset 和 IterableDataset 的區別概念指南中瞭解更多。

要更深入地實踐這些資料集型別，請檢視處理指南，學習如何預處理 Dataset，或檢視流式處理指南，學習如何預處理 IterableDataset。

< > 在 GitHub 上更新