載入並探索音訊資料集

在本課程中，我們將使用 🤗 Datasets 庫來處理音訊資料集。🤗 Datasets 是一個開源庫，用於下載和準備包括音訊在內的所有模態的資料集。該庫提供了對 Hugging Face Hub 上公開的無與倫比的機器學習資料集的輕鬆訪問。此外，🤗 Datasets 包含針對音訊資料集量身定製的多種功能，可簡化研究人員和實踐者處理此類資料集的工作。

要開始處理音訊資料集，請確保已安裝 🤗 Datasets 庫

pip install datasets[audio]

🤗 Datasets 的一個關鍵特性是能夠使用 load_dataset() 函式僅用一行 Python 程式碼下載和準備資料集。

讓我們載入並探索一個名為 MINDS-14 的音訊資料集，其中包含人們用多種語言和方言向電子銀行系統提問的錄音。

要載入 MINDS-14 資料集，我們需要複製 Hub 上的資料集識別符號 (PolyAI/minds14) 並將其傳遞給 load_dataset 函式。我們還將指定我們只對資料的澳大利亞子集 (en-AU) 感興趣，並將其限制為訓練集

from datasets import load_dataset

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds

輸出

Dataset(
    {
        features: [
            "path",
            "audio",
            "transcription",
            "english_transcription",
            "intent_class",
            "lang_id",
        ],
        num_rows: 654,
    }
)

該資料集包含 654 個音訊檔案，每個檔案都附有轉錄、英文翻譯和指示人員查詢意圖的標籤。音訊列包含原始音訊資料。讓我們仔細看看其中一個示例

example = minds[0]
example

輸出

{
    "path": "/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-AU~PAY_BILL/response_4.wav",
    "audio": {
        "path": "/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-AU~PAY_BILL/response_4.wav",
        "array": array(
            [0.0, 0.00024414, -0.00024414, ..., -0.00024414, 0.00024414, 0.0012207],
            dtype=float32,
        ),
        "sampling_rate": 8000,
    },
    "transcription": "I would like to pay my electricity bill using my card can you please assist",
    "english_transcription": "I would like to pay my electricity bill using my card can you please assist",
    "intent_class": 13,
    "lang_id": 2,
}

您可能會注意到音訊列包含幾個特徵。它們是

path：音訊檔案的路徑（本例中為 *.wav）。
array：解碼後的音訊資料，表示為一維 NumPy 陣列。
sampling_rate：音訊檔案的取樣率（本例中為 8,000 Hz）。

intent_class 是音訊錄音的分類類別。要將此數字轉換為有意義的字串，我們可以使用 int2str() 方法

id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])

輸出

"pay_bill"

如果您檢視轉錄功能，您可以看到音訊檔案確實錄製了一個人詢問有關支付賬單的問題。

如果您打算在此資料子集上訓練音訊分類器，則可能不需要所有功能。例如，lang_id 對所有示例都將具有相同的值，因此沒有用處。english_transcription 可能在此子集中複製 transcription，因此我們可以安全地將其刪除。

您可以使用 🤗 Datasets 的 remove_columns 方法輕鬆刪除不相關的特徵

columns_to_remove = ["lang_id", "english_transcription"]
minds = minds.remove_columns(columns_to_remove)
minds

輸出

Dataset({features: ["path", "audio", "transcription", "intent_class"], num_rows: 654})

現在我們已經載入並檢查了資料集的原始內容，讓我們聽幾個例子！我們將使用 Gradio 的 Blocks 和 Audio 功能來解碼資料集中的一些隨機樣本

import gradio as gr


def generate_audio():
    example = minds.shuffle()[0]
    audio = example["audio"]
    return (
        audio["sampling_rate"],
        audio["array"],
    ), id2label(example["intent_class"])


with gr.Blocks() as demo:
    with gr.Column():
        for _ in range(4):
            audio, label = generate_audio()
            output = gr.Audio(audio, label=label)

demo.launch(debug=True)

如果您願意，也可以視覺化一些示例。讓我們繪製第一個示例的波形。

import librosa
import matplotlib.pyplot as plt
import librosa.display

array = example["audio"]["array"]
sampling_rate = example["audio"]["sampling_rate"]

plt.figure().set_figwidth(12)
librosa.display.waveshow(array, sr=sampling_rate)

試一試！下載 MINDS-14 資料集的另一種方言或語言，聆聽並可視化一些示例，以瞭解整個資料集中的變化。您可以在此處找到可用語言的完整列表。

< > 在 GitHub 上更新

音訊課程

載入並探索音訊資料集