大資料？🤗 Datasets 來拯救！

如今，您經常會遇到處理多千兆位元組資料集的情況，特別是如果您計劃從頭開始預訓練像 BERT 或 GPT-2 這樣的 Transformer。在這些情況下，即使是載入資料也可能是一個挑戰。例如，用於預訓練 GPT-2 的 WebText 語料庫包含超過 800 萬個文件和 40 GB 的文字——將其載入到筆記型電腦的 RAM 中很可能會讓它“心臟病發作”！

幸運的是，🤗 Datasets 旨在克服這些限制。它透過將資料集視為記憶體對映檔案，將您從記憶體管理問題中解放出來；透過流式傳輸語料庫中的條目，將您從硬碟限制中解放出來。

在本節中，我們將使用一個龐大的 825 GB 語料庫，稱為 the Pile，來探索 🤗 Datasets 的這些功能。讓我們開始吧！

什麼是 Pile？

Pile 是一個由 EleutherAI 建立的英文文字語料庫，用於訓練大型語言模型。它包含各種各樣的資料集，涵蓋科學文章、GitHub 程式碼庫和過濾後的網頁文字。訓練語料庫以 14 GB 分塊提供，您還可以下載其中一些單獨元件。讓我們首先看看 PubMed Abstracts 資料集，這是一個包含來自 PubMed 的 1500 萬篇生物醫學出版物摘要的語料庫。該資料集採用 JSON Lines 格式，並使用 zstandard 庫進行壓縮，因此我們首先需要安裝它

!pip install zstandard

接下來，我們可以使用我們在第 2 節中學到的遠端檔案載入方法載入資料集

from datasets import load_dataset

# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
pubmed_dataset

Dataset({
    features: ['meta', 'text'],
    num_rows: 15518009
})

我們可以看到我們的資料集中有 15,518,009 行和 2 列——這真是太多了！

✎ 預設情況下，🤗 Datasets 會解壓載入資料集所需的檔案。如果您想節省硬碟空間，可以將 DownloadConfig(delete_extracted=True) 傳遞給 load_dataset() 的 download_config 引數。有關更多詳細資訊，請參閱文件。

讓我們檢查第一個示例的內容

pubmed_dataset[0]

{'meta': {'pmid': 11409574, 'language': 'eng'},
 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age ...'}

好的，這看起來像一篇醫學文章的摘要。現在讓我們看看我們已經使用了多少 RAM 來載入資料集！

記憶體對映的魔力

一種衡量 Python 中記憶體使用情況的簡單方法是使用 psutil 庫，可以透過 pip 安裝如下：

!pip install psutil

它提供了一個 Process 類，允許我們檢查當前程序的記憶體使用情況，如下所示：

import psutil

# Process.memory_info is expressed in bytes, so convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

RAM used: 5678.33 MB

這裡的 rss 屬性指的是駐留集大小，它是程序在 RAM 中佔用的記憶體比例。這個測量也包括 Python 直譯器和我們載入的庫所使用的記憶體，所以載入資料集實際使用的記憶體會小一些。為了進行比較，讓我們使用 dataset_size 屬性看看資料集在磁碟上的大小。由於結果與之前一樣以位元組表示，我們需要手動將其轉換為千兆位元組：

print(f"Dataset size in bytes: {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")

Dataset size in bytes : 20979437051
Dataset size (cache file) : 19.54 GB

太棒了——儘管它接近 20 GB，我們仍能以少得多的 RAM 載入和訪問資料集！

✏️ 試一試！從 Pile 中選擇一個大於您的筆記型電腦或桌上型電腦 RAM 的子集，使用 🤗 Datasets 載入它，並測量使用的 RAM 量。請注意，為了獲得準確的測量結果，您需要在新程序中執行此操作。您可以在 Pile 論文的表 1 中找到每個子集的解壓縮大小。

如果你熟悉 Pandas，這個結果可能會讓你感到驚訝，因為 Wes McKinney 有一句著名的經驗法則，即你通常需要的資料集大小的 5 到 10 倍的 RAM。那麼 🤗 Datasets 是如何解決這個記憶體管理問題的呢？🤗 Datasets 將每個資料集視為一個記憶體對映檔案，它提供了 RAM 和檔案系統儲存之間的對映，允許庫訪問和操作資料集的元素，而無需將其完全載入到記憶體中。

記憶體對映檔案還可以跨多個程序共享，這使得像 Dataset.map() 這樣的方法可以在不需要移動或複製資料集的情況下進行並行化。在底層，所有這些功能都由 Apache Arrow 記憶體格式和 pyarrow 庫實現，這使得資料載入和處理速度極快。（有關 Apache Arrow 和與 Pandas 的比較的更多詳細資訊，請檢視 Dejan Simic 的部落格文章。）為了實際瞭解這一點，讓我們透過迭代 PubMed Abstracts 資料集中的所有元素來執行一個小速度測試

import timeit

code_snippet = """batch_size = 1000

for idx in range(0, len(pubmed_dataset), batch_size):
    _ = pubmed_dataset[idx:idx + batch_size]
"""

time = timeit.timeit(stmt=code_snippet, number=1, globals=globals())
print(
    f"Iterated over {len(pubmed_dataset)} examples (about {size_gb:.1f} GB) in "
    f"{time:.1f}s, i.e. {size_gb/time:.3f} GB/s"
)

'Iterated over 15518009 examples (about 19.5 GB) in 64.2s, i.e. 0.304 GB/s'

這裡我們使用了 Python 的 timeit 模組來測量 code_snippet 的執行時間。通常，您能夠以每秒零點幾 GB 到幾 GB 的速度迭代資料集。這對於絕大多數應用程式來說都非常有效，但有時您必須處理一個甚至無法儲存在筆記型電腦硬碟上的資料集。例如，如果我們嘗試下載整個 Pile，我們將需要 825 GB 的可用磁碟空間！為了處理這些情況，🤗 Datasets 提供了一個流式傳輸功能，允許我們即時下載和訪問元素，而無需下載整個資料集。讓我們看看它是如何工作的。

💡 在 Jupyter 筆記本中，您還可以使用%%timeit 魔術函式來計時單元格。

流式資料集

要啟用資料集流式傳輸，您只需將 streaming=True 引數傳遞給 load_dataset() 函式。例如，讓我們再次載入 PubMed Abstracts 資料集，但以流式傳輸模式：

pubmed_dataset_streamed = load_dataset(
    "json", data_files=data_files, split="train", streaming=True
)

與本章其他地方我們熟悉的 Dataset 不同，使用 streaming=True 返回的物件是 IterableDataset。顧名思義，要訪問 IterableDataset 的元素，我們需要對其進行迭代。我們可以按如下方式訪問流式資料集的第一個元素：

next(iter(pubmed_dataset_streamed))

{'meta': {'pmid': 11409574, 'language': 'eng'},
 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age ...'}

流式資料集的元素可以使用 IterableDataset.map() 即時處理，這在訓練期間（如果您需要對輸入進行標記化）很有用。這個過程與我們在第 3 章中標記化資料集的過程完全相同，唯一的區別是輸出是一個接一個地返回的

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x["text"]))
next(iter(tokenized_dataset))

{'input_ids': [101, 4958, 5178, 4328, 6779, ...], 'attention_mask': [1, 1, 1, 1, 1, ...]}

💡 為了加快流式分詞的速度，您可以傳遞 batched=True，正如我們在上一節中看到的那樣。它將批次處理示例；預設批次大小為 1,000，並且可以使用 batch_size 引數指定。

您還可以使用 IterableDataset.shuffle() 打亂流式資料集，但與 Dataset.shuffle() 不同，這隻會在預定義的 buffer_size 中打亂元素

shuffled_dataset = pubmed_dataset_streamed.shuffle(buffer_size=10_000, seed=42)
next(iter(shuffled_dataset))

{'meta': {'pmid': 11410799, 'language': 'eng'},
 'text': 'Randomized study of dose or schedule modification of granulocyte colony-stimulating factor in platinum-based chemotherapy for elderly patients with lung cancer ...'}

在這個例子中，我們從緩衝區中的前 10,000 個例子中隨機選擇了一個例子。一旦一個例子被訪問，它在緩衝區中的位置就會被語料庫中的下一個例子填充（例如，在上述情況下是第 10,001 個例子）。你也可以使用 IterableDataset.take() 和 IterableDataset.skip() 函式從流式資料集中選擇元素，它們的作用類似於 Dataset.select()。例如，要選擇 PubMed Abstracts 資料集中的前 5 個例子，我們可以這樣做：

dataset_head = pubmed_dataset_streamed.take(5)
list(dataset_head)

[{'meta': {'pmid': 11409574, 'language': 'eng'},
  'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection ...'},
 {'meta': {'pmid': 11409575, 'language': 'eng'},
  'text': 'Clinical signs of hypoxaemia in children with acute lower respiratory infection: indicators of oxygen therapy ...'},
 {'meta': {'pmid': 11409576, 'language': 'eng'},
  'text': "Hypoxaemia in children with severe pneumonia in Papua New Guinea ..."},
 {'meta': {'pmid': 11409577, 'language': 'eng'},
  'text': 'Oxygen concentrators and cylinders ...'},
 {'meta': {'pmid': 11409578, 'language': 'eng'},
  'text': 'Oxygen supply in rural africa: a personal experience ...'}]

類似地，您可以使用 IterableDataset.skip() 函式從打亂的資料集中建立訓練和驗證拆分，如下所示：

# Skip the first 1,000 examples and include the rest in the training set
train_dataset = shuffled_dataset.skip(1000)
# Take the first 1,000 examples for the validation set
validation_dataset = shuffled_dataset.take(1000)

讓我們透過一個常見的應用來結束我們對資料集流式處理的探索：將多個數據集組合在一起建立一個單一的語料庫。🤗 Datasets 提供了一個 interleave_datasets() 函式，它將 IterableDataset 物件的列表轉換為單個 IterableDataset，新資料集的元素透過在源示例之間交替獲得。當您嘗試組合大型資料集時，此函式特別有用，因此作為一個示例，讓我們流式傳輸 Pile 的 FreeLaw 子集，這是一個 51 GB 的美國法院法律意見資料集

law_dataset_streamed = load_dataset(
    "json",
    data_files="https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst",
    split="train",
    streaming=True,
)
next(iter(law_dataset_streamed))

{'meta': {'case_ID': '110921.json',
  'case_jurisdiction': 'scotus.tar.gz',
  'date_created': '2010-04-28T17:12:49Z'},
 'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued January 19, 1983.\nDecided April 26, 1983.\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT\n*239 Michael A. Lilly, First Deputy Attorney General of Hawaii, argued the cause for petitioners. With him on the brief was James H. Dannenberg, Deputy Attorney General...'}

這個資料集足夠大，足以讓大多數筆記型電腦的記憶體吃緊，但我們卻能夠輕鬆載入和訪問它！現在讓我們使用 interleave_datasets() 函式將 FreeLaw 和 PubMed Abstracts 資料集中的示例組合起來

from itertools import islice
from datasets import interleave_datasets

combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
list(islice(combined_dataset, 2))

[{'meta': {'pmid': 11409574, 'language': 'eng'},
  'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection ...'},
 {'meta': {'case_ID': '110921.json',
   'case_jurisdiction': 'scotus.tar.gz',
   'date_created': '2010-04-28T17:12:49Z'},
  'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued January 19, 1983.\nDecided April 26, 1983.\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT\n*239 Michael A. Lilly, First Deputy Attorney General of Hawaii, argued the cause for petitioners. With him on the brief was James H. Dannenberg, Deputy Attorney General...'}]

這裡我們使用了 Python 的 itertools 模組中的 islice() 函式來選擇組合資料集中的前兩個示例，我們可以看到它們與兩個源資料集中的第一個示例相匹配。

最後，如果您想完整地流式傳輸 825 GB 的 Pile，您可以按如下方式獲取所有準備好的檔案：

base_url = "https://the-eye.eu/public/AI/pile/"
data_files = {
    "train": [base_url + "train/" + f"{idx:02d}.jsonl.zst" for idx in range(30)],
    "validation": base_url + "val.jsonl.zst",
    "test": base_url + "test.jsonl.zst",
}
pile_dataset = load_dataset("json", data_files=data_files, streaming=True)
next(iter(pile_dataset["train"]))

{'meta': {'pile_set_name': 'Pile-CC'},
 'text': 'It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web...'}

✏️ 試一試！使用大型 Common Crawl 語料庫，如 mc4 或 oscar，建立代表您選擇的國家/地區語言口語比例的流式多語言資料集。例如，瑞士的四種國語是德語、法語、義大利語和羅曼什語，因此您可以嘗試透過根據其口語比例對 Oscar 子集進行抽樣來建立瑞士語料庫。

您現在擁有載入和處理各種形狀和大小資料集所需的所有工具——但除非您特別幸運，否則您的自然語言處理之旅總會遇到需要實際建立資料集來解決手頭問題的時候。這就是下一節的主題！

< > 在 GitHub 上更新

LLM 課程

大資料？🤗 Datasets 來拯救！

什麼是 Pile？

記憶體對映的魔力

流式資料集