大資料？ 🤗 資料集來救援！

如今，處理多 GB 的資料集已變得司空見慣，尤其是在您計劃從頭開始預訓練 BERT 或 GPT-2 等 Transformer 時。在這些情況下，甚至載入資料都可能是一個挑戰。例如，用於預訓練 GPT-2 的 WebText 語料庫包含超過 800 萬個文件和 40 GB 的文字——將這些載入到筆記型電腦的 RAM 中很可能會讓它“心臟病發作”！

幸運的是，🤗 Datasets 旨在克服這些限制。它透過將資料集視為記憶體對映檔案，並透過流式傳輸語料庫中的條目來避免記憶體管理問題和硬碟空間限制。

在本節中，我們將使用一個名為 The Pile 的大型 825 GB 語料庫來探索 🤗 Datasets 的這些功能。讓我們開始吧！

什麼是 The Pile？

The Pile 是一個英文文字語料庫，由 EleutherAI 建立，用於訓練大規模語言模型。它包含各種各樣的資料集，涵蓋科學文章、GitHub 程式碼庫和經過過濾的網路文字。訓練語料庫以 14 GB 塊的形式提供，您還可以下載幾個單獨的元件。讓我們首先看看 PubMed 摘要資料集，這是一個包含來自 PubMed 上 1500 萬篇生物醫學出版物的摘要的語料庫。該資料集採用 JSON Lines 格式，並使用 zstandard 庫進行壓縮，因此首先我們需要安裝它

!pip install zstandard

接下來，我們可以使用我們在第 2 節中學習的遠端檔案方法載入資料集

from datasets import load_dataset

# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
pubmed_dataset

Dataset({
    features: ['meta', 'text'],
    num_rows: 15518009
})

我們可以看到，我們的資料集中有 15,518,009 行和 2 列——非常多！

✎ 預設情況下，🤗 Datasets 會解壓縮載入資料集所需的檔案。如果您想節省硬碟空間，可以將 DownloadConfig(delete_extracted=True) 傳遞給 load_dataset() 的 download_config 引數。有關更多詳細資訊，請參閱文件。

讓我們檢查第一個示例的內容

pubmed_dataset[0]

{'meta': {'pmid': 11409574, 'language': 'eng'},
 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age ...'}

好的，這看起來像是醫學文章的摘要。現在讓我們看看我們使用了多少 RAM 來載入資料集！

記憶體對映的魔力

在 Python 中測量記憶體使用量的一種簡單方法是使用 psutil 庫，可以使用 pip 如下安裝

!pip install psutil

它提供了一個 Process 類，允許我們檢查當前程序的記憶體使用情況，如下所示

import psutil

# Process.memory_info is expressed in bytes, so convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

RAM used: 5678.33 MB

這裡 rss 屬性指的是駐留集大小，它是程序在 RAM 中佔用的記憶體的一部分。此測量值還包括 Python 直譯器和我們載入的庫使用的記憶體，因此載入資料集實際使用的記憶體量要小一些。為了進行比較，讓我們看看資料集在磁碟上的大小，使用 dataset_size 屬性。由於結果像之前一樣以位元組表示，因此我們需要手動將其轉換為 GB

print(f"Number of files in dataset : {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")

Number of files in dataset : 20979437051
Dataset size (cache file) : 19.54 GB

不錯——儘管它的大小接近 20 GB，但我們能夠使用更少的 RAM 載入和訪問資料集！

✏️ **試試看！** 從 The Pile 中選擇一個大於您的筆記型電腦或桌上型電腦 RAM 的子集，使用 🤗 Datasets 載入它，並測量使用的 RAM 量。請注意，為了獲得準確的測量結果，您需要在新程序中執行此操作。您可以在 The Pile 論文的表 1 中找到每個子集的解壓縮大小。

如果您熟悉 Pandas，那麼這個結果可能會讓您感到驚訝，因為 Wes Kinney 的著名經驗法則是您通常需要的資料集大小的 5 到 10 倍的 RAM。那麼 🤗 Datasets 如何解決此記憶體管理問題呢？🤗 Datasets 將每個資料集視為記憶體對映檔案，它提供 RAM 和檔案系統儲存之間的對映，允許庫訪問和操作資料集的元素，而無需將其完全載入到記憶體中。

記憶體對映檔案也可以在多個程序之間共享，這使得像 Dataset.map() 這樣的方法能夠並行化，而無需移動或複製資料集。在幕後，所有這些功能都是由 Apache Arrow 記憶體格式和 pyarrow 庫實現的，它們使資料載入和處理速度飛快。（有關 Apache Arrow 的更多詳細資訊以及與 Pandas 的比較，請檢視 Dejan Simic 的博文。）為了實際觀察這一點，讓我們透過迭代 PubMed 摘要資料集中的所有元素來進行一個小速度測試

import timeit

code_snippet = """batch_size = 1000

for idx in range(0, len(pubmed_dataset), batch_size):
    _ = pubmed_dataset[idx:idx + batch_size]
"""

time = timeit.timeit(stmt=code_snippet, number=1, globals=globals())
print(
    f"Iterated over {len(pubmed_dataset)} examples (about {size_gb:.1f} GB) in "
    f"{time:.1f}s, i.e. {size_gb/time:.3f} GB/s"
)

'Iterated over 15518009 examples (about 19.5 GB) in 64.2s, i.e. 0.304 GB/s'

在這裡，我們使用了 Python 的 timeit 模組來測量 code_snippet 的執行時間。您通常能夠以每秒幾十分之一 GB 到幾 GB 的速度迭代資料集。這對於絕大多數應用程式都非常有效，但有時您需要處理一個甚至無法儲存在筆記型電腦硬碟驅動器上的資料集。例如，如果我們嘗試下載完整的 The Pile，則需要 825 GB 的可用磁碟空間！為了處理這些情況，🤗 Datasets 提供了一個流式傳輸功能，允許我們即時下載和訪問元素，而無需下載整個資料集。讓我們看看它是如何工作的。

💡 在 Jupyter Notebook 中，您還可以使用 %%timeit 魔術函式來計時單元格。

流式資料集

要啟用資料集流，您只需將streaming=True引數傳遞給load_dataset()函式即可。例如，讓我們再次載入PubMed Abstracts資料集，但以流模式載入。

pubmed_dataset_streamed = load_dataset(
    "json", data_files=data_files, split="train", streaming=True
)

與我們在本章其他地方遇到的熟悉的Dataset不同，使用streaming=True返回的物件是IterableDataset。顧名思義，要訪問IterableDataset的元素，我們需要對其進行迭代。我們可以按如下方式訪問我們流式資料集的第一個元素：

next(iter(pubmed_dataset_streamed))

{'meta': {'pmid': 11409574, 'language': 'eng'},
 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age ...'}

流式資料集中的元素可以使用IterableDataset.map()即時處理，這在訓練期間非常有用，如果您需要對輸入進行標記化。該過程與我們在第3章中用於標記化資料集的過程完全相同，唯一的區別是輸出是一次一個返回的。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x["text"]))
next(iter(tokenized_dataset))

{'input_ids': [101, 4958, 5178, 4328, 6779, ...], 'attention_mask': [1, 1, 1, 1, 1, ...]}

💡為了加速流式標記化，您可以傳遞batched=True，就像我們在上一節中看到的那樣。它將分批處理示例；預設批次大小為1,000，可以使用batch_size引數指定。

您還可以使用IterableDataset.shuffle()對流式資料集進行洗牌，但與Dataset.shuffle()不同，這隻會洗牌預定義buffer_size中的元素。

shuffled_dataset = pubmed_dataset_streamed.shuffle(buffer_size=10_000, seed=42)
next(iter(shuffled_dataset))

{'meta': {'pmid': 11410799, 'language': 'eng'},
 'text': 'Randomized study of dose or schedule modification of granulocyte colony-stimulating factor in platinum-based chemotherapy for elderly patients with lung cancer ...'}

在此示例中，我們從緩衝區中前10,000個示例中隨機選擇了一個示例。一旦訪問了一個示例，其在緩衝區中的位置就會被語料庫中的下一個示例填充（例如，上面的第10,001個示例）。您還可以使用IterableDataset.take()和IterableDataset.skip()函式從流式資料集選擇元素，它們的作用類似於Dataset.select()。例如，要選擇PubMed Abstracts資料集中前5個示例，我們可以執行以下操作：

dataset_head = pubmed_dataset_streamed.take(5)
list(dataset_head)

[{'meta': {'pmid': 11409574, 'language': 'eng'},
  'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection ...'},
 {'meta': {'pmid': 11409575, 'language': 'eng'},
  'text': 'Clinical signs of hypoxaemia in children with acute lower respiratory infection: indicators of oxygen therapy ...'},
 {'meta': {'pmid': 11409576, 'language': 'eng'},
  'text': "Hypoxaemia in children with severe pneumonia in Papua New Guinea ..."},
 {'meta': {'pmid': 11409577, 'language': 'eng'},
  'text': 'Oxygen concentrators and cylinders ...'},
 {'meta': {'pmid': 11409578, 'language': 'eng'},
  'text': 'Oxygen supply in rural africa: a personal experience ...'}]

類似地，您可以使用IterableDataset.skip()函式從洗牌後的資料集中建立訓練和驗證拆分，如下所示：

# Skip the first 1,000 examples and include the rest in the training set
train_dataset = shuffled_dataset.skip(1000)
# Take the first 1,000 examples for the validation set
validation_dataset = shuffled_dataset.take(1000)

讓我們以一個常見的應用程式來結束我們對資料集流的探索：將多個數據集組合在一起以建立一個語料庫。🤗 Datasets提供了一個interleave_datasets()函式，該函式將IterableDataset物件的列表轉換為單個IterableDataset，其中新資料集的元素透過在源示例之間交替獲取。當您嘗試組合大型資料集時，此函式特別有用，因此，舉個例子，讓我們流式傳輸Pile的FreeLaw子集，這是一個51 GB的美國法院法律意見資料集。

law_dataset_streamed = load_dataset(
    "json",
    data_files="https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst",
    split="train",
    streaming=True,
)
next(iter(law_dataset_streamed))

{'meta': {'case_ID': '110921.json',
  'case_jurisdiction': 'scotus.tar.gz',
  'date_created': '2010-04-28T17:12:49Z'},
 'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued January 19, 1983.\nDecided April 26, 1983.\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT\n*239 Michael A. Lilly, First Deputy Attorney General of Hawaii, argued the cause for petitioners. With him on the brief was James H. Dannenberg, Deputy Attorney General...'}

這個資料集足夠大，可以給大多數筆記型電腦的RAM帶來壓力，但我們已經能夠在不費吹灰之力的情況下載入和訪問它！現在，讓我們使用interleave_datasets()函式組合來自FreeLaw和PubMed Abstracts資料集的示例。

from itertools import islice
from datasets import interleave_datasets

combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
list(islice(combined_dataset, 2))

[{'meta': {'pmid': 11409574, 'language': 'eng'},
  'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection ...'},
 {'meta': {'case_ID': '110921.json',
   'case_jurisdiction': 'scotus.tar.gz',
   'date_created': '2010-04-28T17:12:49Z'},
  'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued January 19, 1983.\nDecided April 26, 1983.\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT\n*239 Michael A. Lilly, First Deputy Attorney General of Hawaii, argued the cause for petitioners. With him on the brief was James H. Dannenberg, Deputy Attorney General...'}]

在這裡，我們使用了Python的itertools模組中的islice()函式從組合資料集中選擇了前兩個示例，我們可以看到它們與兩個源資料集中第一個示例相匹配。

最後，如果您想以其825 GB的完整性流式傳輸Pile，您可以按如下方式獲取所有準備好的檔案：

base_url = "https://the-eye.eu/public/AI/pile/"
data_files = {
    "train": [base_url + "train/" + f"{idx:02d}.jsonl.zst" for idx in range(30)],
    "validation": base_url + "val.jsonl.zst",
    "test": base_url + "test.jsonl.zst",
}
pile_dataset = load_dataset("json", data_files=data_files, streaming=True)
next(iter(pile_dataset["train"]))

{'meta': {'pile_set_name': 'Pile-CC'},
 'text': 'It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web...'}

✏️ 動手試試！使用像mc4或oscar這樣的大型Common Crawl語料庫來建立一個流式多語言資料集，該資料集表示您選擇的國家/地區的語言口語比例。例如，瑞士的四種國家語言是德語、法語、義大利語和羅曼什語，因此您可以嘗試根據其口語比例對Oscar子集進行取樣來建立一個瑞士語料庫。

現在，您擁有了載入和處理各種形狀和大小的資料集所需的所有工具——但除非您非常幸運，否則在您的NLP之旅中，您將不得不實際建立一個數據集來解決手頭的問題。這就是下一節的主題！

NLP 課程

大資料？ 🤗 資料集來救援！

什麼是 The Pile？

記憶體對映的魔力

流式資料集