轉錄會議

在本節的最後，我們將使用 Whisper 模型為兩個或多個發言者之間的對話或會議生成轉錄。然後，我們將其與“誰何時發言”的說話人分離模型配對。透過將 Whisper 轉錄中的時間戳與說話人分離模型中的時間戳進行匹配，我們可以預測一個端到端的會議轉錄，其中包含每個發言者的完整格式化開始/結束時間。這是您可能在 Otter.ai 等網站上看到的會議轉錄服務的基本版本。

說話人分離

說話人分離（或 diarisation）的任務是獲取未標記的音訊輸入並預測“誰何時發言”。透過這樣做，我們可以預測每個說話人輪次的開始/結束時間戳，對應於每個說話人何時開始說話以及何時結束。

🤗 Transformers 庫目前不包含用於說話人分離的模型，但 Hub 上有可以相對輕鬆使用的檢查點。在此示例中，我們將使用 pyannote.audio 中的預訓練說話人分離模型。讓我們開始並 pip 安裝該軟體包。

pip install --upgrade pyannote.audio

太棒了！該模型的權重託管在 Hugging Face Hub 上。要訪問它們，我們首先必須同意說話人分離模型的使用條款：pyannote/speaker-diarization。隨後是分割模型的使用條款：pyannote/segmentation。

完成後，我們可以將預訓練的說話人分離管道載入到本地裝置上

from pyannote.audio import Pipeline

diarization_pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization@2.1", use_auth_token=True
)

讓我們在一個示例音訊檔案上試用一下！為此，我們將載入 LibriSpeech ASR 資料集的一個樣本，該樣本由兩個不同的說話人連線在一起形成一個單獨的音訊檔案。

from datasets import load_dataset

concatenated_librispeech = load_dataset(
    "sanchit-gandhi/concatenated_librispeech", split="train", streaming=True
)
sample = next(iter(concatenated_librispeech))

我們可以聽一下音訊，看看它聽起來怎麼樣

from IPython.display import Audio

Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"])

太棒了！我們可以清楚地聽到兩個不同的說話人，在大約 15 秒處有一個過渡。讓我們將此音訊檔案傳遞給分離模型，以獲取說話人的開始/結束時間。請注意，pyannote.audio 要求音訊輸入是形狀為 (channels, seq_len) 的 PyTorch 張量，因此我們需要在執行模型之前執行此轉換。

import torch

input_tensor = torch.from_numpy(sample["audio"]["array"][None, :]).float()
outputs = diarization_pipeline(
    {"waveform": input_tensor, "sample_rate": sample["audio"]["sampling_rate"]}
)

outputs.for_json()["content"]

[{'segment': {'start': 0.4978125, 'end': 14.520937500000002},
  'track': 'B',
  'label': 'SPEAKER_01'},
 {'segment': {'start': 15.364687500000002, 'end': 21.3721875},
  'track': 'A',
  'label': 'SPEAKER_00'}]

這看起來很不錯！我們可以看到第一個說話人被預測為說話到 14.5 秒，第二個說話人從 15.4 秒開始。現在我們需要獲取轉錄！

語音轉錄

在本單元中，我們將第三次使用 Whisper 模型作為我們的語音轉錄系統。具體來說，我們將載入 Whisper Base 檢查點，因為它足夠小，可以提供良好的推理速度和合理的轉錄準確性。和以前一樣，您可以隨意使用 Hub 上的任何語音識別檢查點，包括 Wav2Vec2、MMS ASR 或其他 Whisper 檢查點。

from transformers import pipeline

asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-base",
)

讓我們獲取示例音訊的轉錄，同時返回分段級別的時間戳，以便我們知道每個分段的開始/結束時間。您會從單元 5 中記住，我們需要傳遞引數 return_timestamps=True 來啟用 Whisper 的時間戳預測任務。

asr_pipeline(
    sample["audio"].copy(),
    generate_kwargs={"max_new_tokens": 256},
    return_timestamps=True,
)

{
    "text": " The second and importance is as follows. Sovereignty may be defined to be the right of making laws. In France, the king really exercises a portion of the sovereign power, since the laws have no weight. He was in a favored state of mind, owing to the blight his wife's action threatened to cast upon his entire future.",
    "chunks": [
        {"timestamp": (0.0, 3.56), "text": " The second and importance is as follows."},
        {
            "timestamp": (3.56, 7.84),
            "text": " Sovereignty may be defined to be the right of making laws.",
        },
        {
            "timestamp": (7.84, 13.88),
            "text": " In France, the king really exercises a portion of the sovereign power, since the laws have",
        },
        {"timestamp": (13.88, 15.48), "text": " no weight."},
        {
            "timestamp": (15.48, 19.44),
            "text": " He was in a favored state of mind, owing to the blight his wife's action threatened to",
        },
        {"timestamp": (19.44, 21.28), "text": " cast upon his entire future."},
    ],
}

好的！我們看到轉錄的每個段都有一個開始時間和結束時間，發言者在 15.48 秒處發生變化。我們現在可以將此轉錄與從分離模型獲得的發言者時間戳配對，以獲得最終轉錄。

語音盒

為了獲得最終的轉錄，我們將對齊分離模型的時間戳和 Whisper 模型的時間戳。分離模型預測第一個說話者在 14.5 秒結束，第二個說話者在 15.4 秒開始，而 Whisper 預測的段邊界分別為 13.88、15.48 和 19.44 秒。由於 Whisper 的時間戳與分離模型的時間戳不完全匹配，我們需要找到最接近 14.5 和 15.4 秒的這些邊界，並相應地按說話者對轉錄進行分段。具體來說，我們將透過最小化兩者之間的絕對距離來找到分離和轉錄時間戳之間的最佳對齊。

幸運的是，我們可以使用 🤗 Speechbox 軟體包來執行此對齊。首先，讓我們從 main 安裝 speechbox。

pip install git+https://github.com/huggingface/speechbox

我們現在可以透過將分離模型和 ASR 模型傳遞給 ASRDiarizationPipeline 類來例項化我們組合的分離加轉錄管道。

from speechbox import ASRDiarizationPipeline

pipeline = ASRDiarizationPipeline(
    asr_pipeline=asr_pipeline, diarization_pipeline=diarization_pipeline
)

您還可以透過指定 Hub 上 ASR 模型的模型 ID，直接從預訓練模型例項化 ASRDiarizationPipeline。

pipeline = ASRDiarizationPipeline.from_pretrained("openai/whisper-base")

讓我們將音訊檔案傳遞給複合管道，看看會得到什麼結果

pipeline(sample["audio"].copy())

[{'speaker': 'SPEAKER_01',
  'text': ' The second and importance is as follows. Sovereignty may be defined to be the right of making laws. In France, the king really exercises a portion of the sovereign power, since the laws have no weight.',
  'timestamp': (0.0, 15.48)},
 {'speaker': 'SPEAKER_00',
  'text': " He was in a favored state of mind, owing to the blight his wife's action threatened to cast upon his entire future.",
  'timestamp': (15.48, 21.28)}]

太棒了！第一個說話者被分割為從 0 到 15.48 秒說話，第二個說話者從 15.48 到 21.28 秒說話，並附帶了各自的轉錄。

我們可以透過定義兩個輔助函式來更好地格式化時間戳。第一個函式將時間戳元組轉換為字串，並四捨五入到指定的小數位數。第二個函式將說話人 ID、時間戳和文字資訊組合到一行中，並將每個說話人分割到單獨的行中，以便於閱讀。

def tuple_to_string(start_end_tuple, ndigits=1):
    return str((round(start_end_tuple[0], ndigits), round(start_end_tuple[1], ndigits)))


def format_as_transcription(raw_segments):
    return "\n\n".join(
        [
            chunk["speaker"] + " " + tuple_to_string(chunk["timestamp"]) + chunk["text"]
            for chunk in raw_segments
        ]
    )

讓我們重新執行管道，這次根據我們剛剛定義的功能來格式化轉錄。

outputs = pipeline(sample["audio"].copy())

format_as_transcription(outputs)

SPEAKER_01 (0.0, 15.5) The second and importance is as follows. Sovereignty may be defined to be the right of making laws.
In France, the king really exercises a portion of the sovereign power, since the laws have no weight.

SPEAKER_00 (15.5, 21.3) He was in a favored state of mind, owing to the blight his wife's action threatened to cast upon
his entire future.

就是這樣！至此，我們已經對輸入音訊進行了說話人分離和轉錄，並返回了按說話人分割的轉錄。雖然對齊分離時間戳和轉錄時間戳的最小距離演算法很簡單，但在實踐中效果很好。如果您想探索更高階的時間戳組合方法，ASRDiarizationPipeline 的原始碼是一個很好的起點：speechbox/diarize.py

< > 在 GitHub 上更新

音訊課程

轉錄會議

說話人分離

語音轉錄

語音盒