影片分類

影片分類的任務是對整個影片分配一個標籤或類別。每個影片預計只有一個類別。影片分類模型以影片作為輸入，並返回關於影片所屬類別的預測。這些模型可用於對影片內容進行分類。影片分類的一個實際應用是動作/活動識別，這對於健身應用很有用。它對於視力受損的個體也很有幫助，尤其是在他們通勤時。

本指南將向您展示如何：

在 UCF101 資料集的一個子集上微調 VideoMAE。
使用您的微調模型進行推理。

要檢視與此任務相容的所有架構和檢查點，建議查閱任務頁面。

在開始之前，請確保您已安裝所有必要的庫

pip install -q pytorchvideo transformers evaluate

你將使用 PyTorchVideo（名為 `pytorchvideo`）來處理和準備影片。

我們鼓勵您登入到 Hugging Face 帳戶，以便您可以將模型上傳並與社群共享。當出現提示時，輸入您的令牌進行登入。

>>> from huggingface_hub import notebook_login

>>> notebook_login()

載入 UCF101 資料集

首先載入 UCF-101 資料集的一個子集。這將讓你有機會進行實驗，並確保一切正常，然後再花更多時間在完整資料集上進行訓練。

>>> from huggingface_hub import hf_hub_download

>>> hf_dataset_identifier = "sayakpaul/ucf101-subset"
>>> filename = "UCF101_subset.tar.gz"
>>> file_path = hf_hub_download(repo_id=hf_dataset_identifier, filename=filename, repo_type="dataset")

子集下載完成後，你需要解壓縮歸檔檔案

>>> import tarfile

>>> with tarfile.open(file_path) as t:
...      t.extractall(".")

總的來說，資料集的組織結構如下：

UCF101_subset/
    train/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...
    val/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...
    test/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...

然後你可以統計影片的總數。

>>> import pathlib
>>> dataset_root_path = "UCF101_subset"
>>> dataset_root_path = pathlib.Path(dataset_root_path)

>>> video_count_train = len(list(dataset_root_path.glob("train/*/*.avi")))
>>> video_count_val = len(list(dataset_root_path.glob("val/*/*.avi")))
>>> video_count_test = len(list(dataset_root_path.glob("test/*/*.avi")))
>>> video_total = video_count_train + video_count_val + video_count_test
>>> print(f"Total videos: {video_total}")

>>> all_video_file_paths = (
...     list(dataset_root_path.glob("train/*/*.avi"))
...     + list(dataset_root_path.glob("val/*/*.avi"))
...     + list(dataset_root_path.glob("test/*/*.avi"))
...  )
>>> all_video_file_paths[:5]

（排序後的）影片路徑如下所示：

...
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g07_c04.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g07_c06.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c01.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g09_c02.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g09_c06.avi'
...

你會注意到，有些影片片段屬於同一個組/場景，其中組在影片檔案路徑中以 g 表示。例如，v_ApplyEyeMakeup_g07_c04.avi 和 v_ApplyEyeMakeup_g07_c06.avi。

對於驗證和評估拆分，您不希望有來自同一組/場景的影片剪輯，以防止資料洩露。本教程中使用的子集已將此資訊考慮在內。

接下來，你將推匯出資料集中存在的標籤集。另外，建立兩個字典，它們在模型初始化時會很有用

label2id：將類名對映到整數。
id2label：將整數對映到類名。

>>> class_labels = sorted({str(path).split("/")[2] for path in all_video_file_paths})
>>> label2id = {label: i for i, label in enumerate(class_labels)}
>>> id2label = {i: label for label, i in label2id.items()}

>>> print(f"Unique classes: {list(label2id.keys())}.")

# Unique classes: ['ApplyEyeMakeup', 'ApplyLipstick', 'Archery', 'BabyCrawling', 'BalanceBeam', 'BandMarching', 'BaseballPitch', 'Basketball', 'BasketballDunk', 'BenchPress'].

有10個獨特的類別。每個類別在訓練集中有30個影片。

載入模型進行微調

從預訓練檢查點及其關聯的影像處理器例項化一個影片分類模型。模型的編碼器帶有預訓練引數，分類頭是隨機初始化的。影像處理器在編寫資料集的預處理管道時會派上用場。

>>> from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification

>>> model_ckpt = "MCG-NJU/videomae-base"
>>> image_processor = VideoMAEImageProcessor.from_pretrained(model_ckpt)
>>> model = VideoMAEForVideoClassification.from_pretrained(
...     model_ckpt,
...     label2id=label2id,
...     id2label=id2label,
...     ignore_mismatched_sizes=True,  # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
... )

模型載入時，您可能會注意到以下警告

Some weights of the model checkpoint at MCG-NJU/videomae-base were not used when initializing VideoMAEForVideoClassification: [..., 'decoder.decoder_layers.1.attention.output.dense.bias', 'decoder.decoder_layers.2.attention.attention.key.weight']
- This IS expected if you are initializing VideoMAEForVideoClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing VideoMAEForVideoClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of VideoMAEForVideoClassification were not initialized from the model checkpoint at MCG-NJU/videomae-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

警告告訴我們，我們正在丟棄一些權重（例如 classifier 層的權重和偏置），並隨機初始化一些其他權重（一個新的 classifier 層的權重和偏置）。在這種情況下這是預期行為，因為我們正在新增一個新的頭部，我們沒有預訓練權重，所以庫會警告我們應該在使用此模型進行推理之前對其進行微調，這正是我們要做的。

請注意，此檢查點在此任務上表現更好，因為該檢查點是在具有相當領域重疊的相似下游任務上微調獲得的。您可以檢視此檢查點，它是透過微調MCG-NJU/videomae-base-finetuned-kinetics獲得的。

準備訓練資料集

為了對影片進行預處理，你將利用 PyTorchVideo 庫。首先匯入所需的依賴項。

>>> import pytorchvideo.data

>>> from pytorchvideo.transforms import (
...     ApplyTransformToKey,
...     Normalize,
...     RandomShortSideScale,
...     RemoveKey,
...     ShortSideScale,
...     UniformTemporalSubsample,
... )

>>> from torchvision.transforms import (
...     Compose,
...     Lambda,
...     RandomCrop,
...     RandomHorizontalFlip,
...     Resize,
... )

對於訓練資料集的轉換，我們結合使用了均勻時間子取樣、畫素歸一化、隨機裁剪和隨機水平翻轉。對於驗證和評估資料集的轉換，除了隨機裁剪和水平翻轉外，我們保持相同的轉換鏈。要了解這些轉換的更多細節，請檢視 PyTorchVideo 的官方文件。

使用與預訓練模型關聯的 image_processor 來獲取以下資訊

用於歸一化影片幀畫素的影像均值和標準差。
影片幀將調整到的空間解析度。

首先定義一些常量。

>>> mean = image_processor.image_mean
>>> std = image_processor.image_std
>>> if "shortest_edge" in image_processor.size:
...     height = width = image_processor.size["shortest_edge"]
>>> else:
...     height = image_processor.size["height"]
...     width = image_processor.size["width"]
>>> resize_to = (height, width)

>>> num_frames_to_sample = model.config.num_frames
>>> sample_rate = 4
>>> fps = 30
>>> clip_duration = num_frames_to_sample * sample_rate / fps

現在，分別定義資料集特有的轉換和資料集。從訓練集開始：

>>> train_transform = Compose(
...     [
...         ApplyTransformToKey(
...             key="video",
...             transform=Compose(
...                 [
...                     UniformTemporalSubsample(num_frames_to_sample),
...                     Lambda(lambda x: x / 255.0),
...                     Normalize(mean, std),
...                     RandomShortSideScale(min_size=256, max_size=320),
...                     RandomCrop(resize_to),
...                     RandomHorizontalFlip(p=0.5),
...                 ]
...             ),
...         ),
...     ]
... )

>>> train_dataset = pytorchvideo.data.Ucf101(
...     data_path=os.path.join(dataset_root_path, "train"),
...     clip_sampler=pytorchvideo.data.make_clip_sampler("random", clip_duration),
...     decode_audio=False,
...     transform=train_transform,
... )

同樣的工作流程可以應用於驗證集和評估集。

>>> val_transform = Compose(
...     [
...         ApplyTransformToKey(
...             key="video",
...             transform=Compose(
...                 [
...                     UniformTemporalSubsample(num_frames_to_sample),
...                     Lambda(lambda x: x / 255.0),
...                     Normalize(mean, std),
...                     Resize(resize_to),
...                 ]
...             ),
...         ),
...     ]
... )

>>> val_dataset = pytorchvideo.data.Ucf101(
...     data_path=os.path.join(dataset_root_path, "val"),
...     clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration),
...     decode_audio=False,
...     transform=val_transform,
... )

>>> test_dataset = pytorchvideo.data.Ucf101(
...     data_path=os.path.join(dataset_root_path, "test"),
...     clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration),
...     decode_audio=False,
...     transform=val_transform,
... )

**注意**：上述資料集管道取自PyTorchVideo官方示例。我們使用pytorchvideo.data.Ucf101()函式，因為它專為UCF-101資料集定製。在底層，它返回一個pytorchvideo.data.labeled_video_dataset.LabeledVideoDataset物件。LabeledVideoDataset類是PyTorchVideo資料集中所有影片的基類。因此，如果你想使用PyTorchVideo不直接支援的自定義資料集，你可以相應地擴充套件LabeledVideoDataset類。有關詳細資訊，請參閱data API文件。另外，如果你的資料集遵循類似的結構（如上所示），那麼使用pytorchvideo.data.Ucf101()應該也能正常工作。

你可以訪問 num_videos 引數來了解資料集中影片的數量。

>>> print(train_dataset.num_videos, val_dataset.num_videos, test_dataset.num_videos)
# (300, 30, 75)

視覺化預處理影片以更好地除錯

>>> import imageio
>>> import numpy as np
>>> from IPython.display import Image

>>> def unnormalize_img(img):
...     """Un-normalizes the image pixels."""
...     img = (img * std) + mean
...     img = (img * 255).astype("uint8")
...     return img.clip(0, 255)

>>> def create_gif(video_tensor, filename="sample.gif"):
...     """Prepares a GIF from a video tensor.
...
...     The video tensor is expected to have the following shape:
...     (num_frames, num_channels, height, width).
...     """
...     frames = []
...     for video_frame in video_tensor:
...         frame_unnormalized = unnormalize_img(video_frame.permute(1, 2, 0).numpy())
...         frames.append(frame_unnormalized)
...     kargs = {"duration": 0.25}
...     imageio.mimsave(filename, frames, "GIF", **kargs)
...     return filename

>>> def display_gif(video_tensor, gif_name="sample.gif"):
...     """Prepares and displays a GIF from a video tensor."""
...     video_tensor = video_tensor.permute(1, 0, 2, 3)
...     gif_filename = create_gif(video_tensor, gif_name)
...     return Image(filename=gif_filename)

>>> sample_video = next(iter(train_dataset))
>>> video_tensor = sample_video["video"]
>>> display_gif(video_tensor)

訓練模型

利用 🤗 Transformers 中的 Trainer 來訓練模型。要例項化一個 Trainer，你需要定義訓練配置和評估指標。最重要的是 TrainingArguments，它是一個包含所有配置訓練屬性的類。它需要一個輸出資料夾名稱，用於儲存模型的檢查點。它還有助於同步 🤗 Hub 上模型倉庫中的所有資訊。

大多數訓練引數都是自解釋的，但其中一個非常重要的是 `remove_unused_columns=False`。此引數將刪除模型呼叫函式未使用的任何特徵。預設情況下它為 `True`，因為通常刪除未使用的特徵列是理想的，這樣可以更輕鬆地將輸入解包到模型的呼叫函式中。但是，在這種情況下，你需要未使用的特徵（特別是“影片”）才能建立 `pixel_values`（這是模型輸入中預期的強制性鍵）。

>>> from transformers import TrainingArguments, Trainer

>>> model_name = model_ckpt.split("/")[-1]
>>> new_model_name = f"{model_name}-finetuned-ucf101-subset"
>>> num_epochs = 4

>>> args = TrainingArguments(
...     new_model_name,
...     remove_unused_columns=False,
...     eval_strategy="epoch",
...     save_strategy="epoch",
...     learning_rate=5e-5,
...     per_device_train_batch_size=batch_size,
...     per_device_eval_batch_size=batch_size,
...     warmup_ratio=0.1,
...     logging_steps=10,
...     load_best_model_at_end=True,
...     metric_for_best_model="accuracy",
...     push_to_hub=True,
...     max_steps=(train_dataset.num_videos // batch_size) * num_epochs,
... )

由 pytorchvideo.data.Ucf101() 返回的資料集沒有實現 __len__ 方法。因此，在例項化 TrainingArguments 時，我們必須定義 max_steps。

接下來，你需要定義一個函式，用於從預測中計算指標，該函式將使用你現在載入的 metric。你唯一需要做的預處理就是獲取預測 logits 的 argmax。

import evaluate

metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

關於評估的說明:

在 VideoMAE 論文中，作者使用以下評估策略。他們評估模型在測試影片的幾個片段上的表現，並對這些片段應用不同的裁剪，然後報告彙總分數。然而，為了簡單和簡潔，本教程中不考慮這一點。

此外，定義一個 collate_fn，它將用於將示例批次處理在一起。每個批次包含 2 個鍵，即 pixel_values 和 labels。

>>> def collate_fn(examples):
...     # permute to (num_frames, num_channels, height, width)
...     pixel_values = torch.stack(
...         [example["video"].permute(1, 0, 2, 3) for example in examples]
...     )
...     labels = torch.tensor([example["label"] for example in examples])
...     return {"pixel_values": pixel_values, "labels": labels}

然後，你只需將所有這些連同資料集一起傳遞給 Trainer

>>> trainer = Trainer(
...     model,
...     args,
...     train_dataset=train_dataset,
...     eval_dataset=val_dataset,
...     processing_class=image_processor,
...     compute_metrics=compute_metrics,
...     data_collator=collate_fn,
... )

您可能想知道為什麼在已經預處理資料之後，您仍然將 image_processor 作為分詞器傳遞。這僅僅是為了確保影像處理器配置檔案（儲存為 JSON）也將上傳到 Hub 上的倉庫中。

現在透過呼叫 train 方法來微調我們的模型

>>> train_results = trainer.train()

訓練完成後，使用 push_to_hub() 方法將您的模型分享到 Hub，以便所有人都可以使用您的模型。

>>> trainer.push_to_hub()

推理

太棒了，現在你已經微調了一個模型，你可以用它進行推理了！

載入影片進行推理

>>> sample_test_video = next(iter(test_dataset))

嘗試你的微調模型進行推理最簡單的方法是使用 pipeline。例項化一個影片分類 pipeline，將你的模型傳入其中，並將你的影片傳遞給它。

>>> from transformers import pipeline

>>> video_cls = pipeline(model="my_awesome_video_cls_model")
>>> video_cls("https://huggingface.co/datasets/sayakpaul/ucf101-subset/resolve/main/v_BasketballDunk_g14_c06.avi")
[{'score': 0.9272987842559814, 'label': 'BasketballDunk'},
 {'score': 0.017777055501937866, 'label': 'BabyCrawling'},
 {'score': 0.01663011871278286, 'label': 'BalanceBeam'},
 {'score': 0.009560945443809032, 'label': 'BandMarching'},
 {'score': 0.0068979403004050255, 'label': 'BaseballPitch'}]

你也可以手動複製 pipeline 的結果，如果你願意的話。

>>> def run_inference(model, video):
...     # (num_frames, num_channels, height, width)
...     perumuted_sample_test_video = video.permute(1, 0, 2, 3)
...     inputs = {
...         "pixel_values": perumuted_sample_test_video.unsqueeze(0),
...         "labels": torch.tensor(
...             [sample_test_video["label"]]
...         ),  # this can be skipped if you don't have labels available.
...     }

...     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
...     inputs = {k: v.to(device) for k, v in inputs.items()}
...     model = model.to(device)

...     # forward pass
...     with torch.no_grad():
...         outputs = model(**inputs)
...         logits = outputs.logits

...     return logits

現在，將輸入傳遞給模型並返回 logits

>>> logits = run_inference(trained_model, sample_test_video["video"])

解碼 logits，我們得到：

>>> predicted_class_idx = logits.argmax(-1).item()
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
# Predicted class: BasketballDunk

< > 在 GitHub 上更新