影片資料集

本指南將向您展示如何配置包含影片檔案的資料集儲存庫。

具有受支援結構和檔案格式的資料集在其 Hub 頁面上會自動擁有資料集檢視器。

只要您在元資料檔案（metadata.csv/metadata.jsonl/metadata.parquet）中包含這些資訊，影片的其他資訊（如字幕或用於物件檢測的邊界框）將自動載入。

此外，影片可以採用 Parquet 檔案或遵循 WebDataset 格式的 TAR 歸檔檔案。

僅影片

如果您的資料集只包含一列影片，您可以直接將影片檔案儲存在根目錄下

my_dataset_repository/
├── 1.mp4
├── 2.mp4
├── 3.mp4
└── 4.mp4

或子目錄中

my_dataset_repository/
└── videos
    ├── 1.mp4
    ├── 2.mp4
    ├── 3.mp4
    └── 4.mp4

同時支援多種格式，包括 MP4、MOV 和 AVI。

my_dataset_repository/
└── videos
    ├── 1.mp4
    ├── 2.mov
    └── 3.avi

如果您有多個分片，您可以將影片放入相應的目錄中

my_dataset_repository/
├── train
│   ├── 1.mp4
│   └── 2.mp4
└── test
    ├── 3.mp4
    └── 4.mp4

有關更多資訊和按拆分組織資料的其他方法，請參閱檔名和拆分。

附加列

如果您想包含關於資料集的其他資訊，例如文字字幕或邊界框，請將其作為 metadata.csv 檔案新增到您的儲存庫中。這使您可以快速建立用於不同計算機視覺任務（如影片生成或物件檢測）的資料集。

my_dataset_repository/
└── train
    ├── 1.mp4
    ├── 2.mp4
    ├── 3.mp4
    ├── 4.mp4
    └── metadata.csv

您的 metadata.csv 檔案必須有一個 file_name 列，用於將影片檔案與其元資料關聯起來

file_name,text
1.mp4,an animation of a green pokemon with red eyes
2.mp4,a short video of a green and yellow toy with a red nose
3.mp4,a red and white ball shows an angry look on its face
4.mp4,a cartoon ball is smiling

您也可以使用JSONL檔案 `metadata.jsonl`

{"file_name": "1.mp4","text": "an animation of a green pokemon with red eyes"}
{"file_name": "2.mp4","text": "a short video of a green and yellow toy with a red nose"}
{"file_name": "3.mp4","text": "a red and white ball shows an angry look on its face"}
{"file_name": "4.mp4","text": "a cartoon ball is smiling"}

對於更大的資料集，或者如果您對高階資料檢索功能感興趣，可以使用Parquet檔案 `metadata.parquet`。

相對路徑

元資料檔案必須與連結的影片位於同一目錄中，或位於任何父目錄中，如本例所示

my_dataset_repository/
└── train
    ├── videos
    │   ├── 1.mp4
    │   ├── 2.mp4
    │   ├── 3.mp4
    │   └── 4.mp4
    └── metadata.csv

在這種情況下，file_name 列必須是影片的完整相對路徑，而不僅僅是檔名

file_name,text
videos/1.mp4,an animation of a green pokemon with red eyes
videos/2.mp4,a short video of a green and yellow toy with a red nose
videos/3.mp4,a red and white ball shows an angry look on its face
videos/4.mp4,a cartoon ball is smiling

元資料檔案不能放在包含影片的目錄的子目錄中。

更一般地，任何名為 file_name 或 *_file_name 的列都應包含影片的完整相對路徑。

影片分類

對於影片分類資料集，您還可以使用一個簡單的設定：使用目錄來命名影片類別。將您的影片檔案儲存在這樣的目錄結構中

my_dataset_repository/
├── green
│   ├── 1.mp4
│   └── 2.mp4
└── red
    ├── 3.mp4
    └── 4.mp4

使用此結構建立的資料集包含兩列：video 和 label（值為 green 和 red）。

您還可以提供多個拆分。為此，您的資料集目錄應具有以下結構（有關更多資訊，請參閱檔名和拆分）

my_dataset_repository/
├── test
│   ├── green
│   │   └── 2.mp4
│   └── red
│       └── 4.mp4
└── train
    ├── green
    │   └── 1.mp4
    └── red
        └── 3.mp4

您可以在YAML 配置中停用 `label` 列的自動新增。如果您的目錄名沒有特殊含義，請在 README 標頭中設定 `drop_labels: true`

configs:
  - config_name: default  # Name of the dataset subset, if applicable.
    drop_labels: true

大規模資料集

WebDataset 格式

WebDataset 格式非常適合大規模影片資料集。它由包含影片及其元資料的 TAR 歸檔檔案組成，並針對流式傳輸進行了最佳化。如果您有大量影片並希望為大規模訓練獲取流式資料載入器，它將非常有用。

my_dataset_repository/
├── train-0000.tar
├── train-0001.tar
├── ...
└── train-1023.tar

要製作 WebDataset TAR 歸檔檔案，請建立一個包含要歸檔的影片和元資料檔案的目錄，然後使用例如 tar 命令建立 TAR 歸檔檔案。每個歸檔檔案的通常大小約為 1GB。確保每個影片和元資料對共享相同的檔案字首，例如

train-0000/
├── 000.mp4
├── 000.json
├── 001.mp4
├── 001.json
├── ...
├── 999.mp4
└── 999.json

請注意，為了方便使用者並啟用資料集檢視器，Hub 中託管的每個資料集都會自動轉換為 Parquet 格式，最大可達 5GB。由於影片可能非常大，因此影片的 URL 儲存在轉換後的 Parquet 資料中，而不包含影片位元組本身。請參閱 Parquet 格式文件以瞭解更多資訊。

< > 在 GitHub 上更新