FineVideo：幕後故事

釋出於 2024 年 9 月 23 日

在 GitHub 上更新

贊

劉易斯·滕斯托爾 (Lewis Tunstall)

開放的影片資料集稀缺，這減緩了開源影片 AI 的發展。因此，我們構建了 FineVideo，一個包含 4.3 萬個影片、總時長 3.4 千小時的資料集，並附有豐富的描述、敘事細節、場景切分和問答對。

FineVideo 包含高度多樣化的影片和元資料集合，使其成為訓練模型理解影片內容、訓練擴散模型從文字描述生成影片，或使用其結構化資料作為輸入訓練計算機視覺模型的優質素材。

等等，你還沒看過 FineVideo 嗎？快透過資料集瀏覽器頁面瞭解一下吧。

關於這篇博文

在這篇博文中，我們分享了開發 FineVideo 涉及的技術細節和程式碼：這段旅程始於 YouTube-Commons 中的 190 萬個影片，最終得到了 4.4 萬個帶有詳盡標註的影片。

一個好的開始方式是先了解我們整個過程的不同步驟。這些步驟涉及內容篩選、標註和輸出結構化。

FineVideo 影片篩選和標註流程

在接下來的章節中，我們將討論每個步驟，並提供相關程式碼的參考連結。如果你更喜歡直接瀏覽程式碼，請檢視我們在 Github 上的 FineVideo 程式碼倉庫。

首先，讓我們看看我們如何獲取初始的 YouTube 影片列表，以及如何應用初步的篩選。

構建原始資料集

我們的旅程始於 YouTube-Commons：這是一個收錄了在 YouTube 上以 CC-By 許可證共享的影片的音訊轉錄文字的集合。該專案由 PleIAs 建立並維護，作為其語料庫收集專案的一部分。

篩選 YouTube-Commons

YouTube Commons 包含多種語言的影片和轉錄文字，我們最初的任務是將其中的內容縮小到同一種語言。

我們篩選出 YouTube-Commons 中的英語影片，並同時收集相關元資料。透過這次初步篩選，我們收集了 190 萬個影片及其隱藏式字幕和元資料。

以下是我們保留的篩選條件和元資料欄位的一些詳細資訊

篩選條件

欄位	篩選值	描述
original_language	en	英語影片
transcription_language	en	英語字幕

元資料欄位

點選展開元資料欄位

欄位	描述
acodec	音訊編解碼器
age_limit	YouTube 影片的年齡限制
categories	YouTube 影片分類
channel	YouTube 頻道
channel_follower_count	頻道訂閱使用者數
channel_id	YouTube 頻道識別符號
character_count	隱藏式字幕中的字元數
comment_count	YouTube 中的評論數
description	YouTube 影片描述
duration_string	影片時長，格式為 hh:mm:ss
許可證	影片許可證
like_count	YouTube 上的影片點贊數
resolution	影片的畫素解析度，格式為“寬 x 高”
tags	與影片關聯的 YouTube 自由文字標籤
text	隱藏式字幕
title	YouTube 影片標題
upload_date	YouTube 上傳日期
vcodec	影片編解碼器
video_id	YouTube 影片識別符號
view_count	YouTube 上的觀看次數
word_count	隱藏式字幕中的單詞數

內容篩選和元資料收集的程式碼可在此處獲取 [連結]

下載影片

在確定了包含 190 萬個影片的目標列表後，我們成功下載了 180 萬個影片（部分影片被頻道所有者移除，或更改了許可權）。

我們探索了兩種不同的分散式下載方法。

方案 1：Video2dataset

video2dataset 是一個開源專案 [連結]，專注於分散式影片下載、轉換和打包成不同資料集格式。該專案原生支援 Slurm Workload Manager，因此我們可以在我們的 CPU 叢集上執行它。

來源：Video2Dataset GitHub 頁面

由於我們所有的叢集例項都透過同一個公共 IP 訪問網際網路，我們為該專案貢獻了指定代理的功能以方便影片下載。雖然該功能尚未合併，但您可以透過我們的 PR [連結] 來為 video2dataset 打補丁以使用代理功能。

方案 2：雲批次作業

大多數雲提供商都支援執行作業，只需定義執行每個作業的例項型別、定義一個佇列，並提供一個包含待執行程式碼的容器即可。

我們使用 Google Cloud 和 AWS 運行了一個定製的 Docker 容器，該容器使用 ytdlp 下載影片和元資料，並將結果推送到 S3。

用於構建 Docker 容器的檔案可以在這裡找到 [程式碼]。

我們的結論

雖然 Video2Dataset 在使用代理時功能正常，並允許我們執行額外的處理步驟，但我們能向代理發出的每秒請求數成了瓶頸。這使得我們轉向了雲批次作業。

保留動態內容

在尋找最佳影片的過程中，我們將選擇範圍縮小到那些既有視覺動作、又有人以中等至較快語速說話的內容。我們透過詞密度篩選和視覺動態性篩選來實現這一點。

詞密度篩選

我們把影片中的詞密度作為音訊動態性的一個代理指標。詞密度的定義是

詞密度 = 隱藏式字幕中的單詞數 / 影片總時長（秒）

透過對不同密度閾值下的內容質量進行抽樣和視覺評估，我們決定移除所有詞密度低於 0.5 詞/秒的影片。

示例

詞密度	示例
0.25
0.5
0.75
1.0

按詞密度篩選和瀏覽示例的程式碼可以在這裡找到 [連結]

視覺動態性篩選

我們重新利用 FFMPEG 的 Freezedetect 濾鏡來判斷影片的動態性。雖然這個濾鏡本是設計用來識別影片中的靜止部分（連續多個相同的幀），但透過將 noise 引數誇張地設為一個非常高的值，我們也可以識別出運動量很低的片段。

我們沒有對整個影片執行 freezedetect，而是按時間段分析影片，並根據被歸類為靜態的段落數量來判斷影片是否為靜態。透過人工評估，我們設定了一個閾值：如果 40% 的分析段落運動量過低，則丟棄該影片。

此次篩選後丟棄的一些內容型別

型別	示例
帶音樂的靜態圖片
簡報的螢幕錄影
高度靜態的人物對著鏡頭講話

用於按影片動態性分類的 DockerFile 和程式碼可以在這裡找到 [連結]

在分析的 180 萬個影片中，經過這一步篩選，我們保留了 60 萬個動態影片。在這個階段，我們深入研究影片內容，這對確保資料集的多樣性至關重要。

影片分類

為了實現最廣泛的內容選擇，我們利用隱藏式字幕和 YouTube 元資料對 60 萬個篩選後的資產進行了分類。為了更好地控制分類過程，我們建立了一個分類體系，並引導標註過程遵循該體系。

定製分類體系

我們使用 GPT4-o 啟動了定製分類體系的構建，並由一位資訊科學家進行了審查和調整。該分類體系包含 126 個細分類別，並聚合成多個層級。這種多層級的方法允許 FineVideo 的使用者根據其特定用例對資料集進行切片。

該分類體系也提供 JSON 格式 [連結]

在分類體系的初版基礎上，我們開始了內容標註，並透過觀察內容標註的結果，在資訊科學家的幫助下，相應地調整了分類體系。

內容標註

我們使用透過文字生成推理 TGI 提供的 Llama 3.1 70B 對影片進行分類 [程式碼]。

為了確保答案嚴格屬於我們分類體系中的一個類別，提示詞經過了多次迭代。在我們的提示詞評估過程中，我們發現，從提示詞中移除現有的 YouTube 標籤和類別，可以極大地提高結果質量：YouTube 的元資料會使 Llama 3.1 生成的文字偏向 YouTube 提供的某個類別。

prompt_template = """
Given those categories: {leaves}
Classify a youtube video given its closed captioning and some metadata details. RETURN ONLY the selected category and nothing else!
Title: {title}
Description: {description}
Channel: {channel}
Closed Caption: {closed_caption}
"""

分類體系 - 內容標註的反饋迴圈

內容分類過程中的分類體系調整

資訊科學家的職責之一是隨時間推移維護分類體系，以新增新類別或在需要時增加額外的區分度。

使用大語言模型進行內容分類，使得調整分類體系的週期從數月/數年縮短到數小時。此外，在某些情況下，我們專門建立了一些類別來丟棄敏感影片，例如屬於 槍支與武器 和 物質使用與毒品 的影片。

貢獻描述性元資料

在流程的這個階段，我們有三個影片級別的元資料來源

影片類別（由 Llama 3.1 推斷）
YouTube 元資料（標題、描述）
來自 YouTube-Commons 的字幕

為了在影片理解領域做出貢獻，我們決定深入研究時間碼級別的元資料，例如活動、物體、敘事和剪輯方面。雖然我們曾考慮將人工標註作為主動學習設定的一部分，即由一個或多個模型提出標註，然後由人工進行質檢，但正如我們將在下一節中討論的，我們發現 Gemini 是一個很好的解決方案，尤其是在我們限制了輸入影片長度和輸出格式的情況下。

長影片與 Gemini 1.5 Pro

我們深入研究了 Gemini 1.5 Pro，迭代我們的提示詞並用不同長度的內容進行測試。

考慮到其 100 萬 token 的限制（大約相當於 1 小時的影片），我們不得不放棄超過 1 小時的影片。為了克服這種情況，一個想法是加速超過一小時的影片，以便適應 Gemini 的上下文。

探索：加速影片以在 Gemini 的上下文中容納更多內容

雖然在高層面上看起來可行，但當我們開始審視細節時，我們意識到只有影片的前幾分鐘被準確標註了。

發現長影片質量下降讓我們思考：這個問題是否影響了我們其餘的影片？透過抽樣不同長度的影片並檢查標註的影片覆蓋範圍，我們發現在超過 10 分鐘的影片中，質量有所下降。

為了與我們向社群提供高質量資料的目標保持一致，我們放棄了超過 10 分鐘的影片。

內容選擇

鑑於每小時影片使用 Gemini 標註的成本超過 5 美元，我們無法標註篩選後所有的影片。因此，我們希望確保對所有主題都有良好的覆蓋，並尋求在內容多樣性、後期預訓練/微調任務和預算之間找到一個良好的平衡。我們將這個規模限制設定為 4000 小時影片。

為了從 60 萬個影片中篩選出 4000 小時的內容，我們準備了一個演算法，該演算法平衡了內容類別、使用者參與度和頻道代表性，以達到目標時長。

演算法流程圖

內容選擇演算法的一些關鍵部分

活躍度評分：我們透過加權組合評論數、觀看次數和點贊數來計算每個影片的參與度指標。這個分數有助於優先選擇那些在觀眾中反響良好的影片。

影片選擇：此步驟迭代選擇影片以滿足目標時長，同時確保多樣性。它在高度參與的內容與來自不同類別和頻道的代表性之間取得平衡，並使用懲罰系統來避免任何單個頻道的過度代表。

最終調整：我們調整選擇，以儘可能接近目標時長而不超過它。它按時長對選定的影片進行排序，並將它們新增到最終列表中，直到達到最接近目標總時長的總和。

程式碼可以在程式碼倉庫中找到 [連結]。

使用 Gemini 1.5 Pro 進行標註，並使用 GPT4o 進行結構化輸出

為什麼需要結構化資料？

我們透過 FineVideo 的目標之一是提供結構化資料，以此賦能我們的社群：如果你正在研究多模態大語言模型，你可以對資料進行切片，並決定哪些類別適合你的預訓練或微調組合。如果你更關注計算機視覺，你可以直接使用該資料集來訓練分類器，基於 FineVideo 中包含的數值類別，如動態性得分、場景邊界或音影片相關性得分。

結構化資料與 Gemini 1.5

Gemini 1.5 Pro 透過提供模式 (schema) 支援基於 JSON 的輸出。我們探索了這一功能，但很快發現了兩個問題

由於我們的模式非常複雜，我們無法將其完全適配到 Gemini 中。
當我們嘗試使用稍微簡單一些的模式時——儘管仍然相當複雜——Gemini 結果的質量大幅下降：大多數場景型別的資料（角色、活動、道具）都丟失了。我們嘗試將提示詞拆分成多個，並將不同的提示詞與模式的不同部分匹配，但收效甚微。

我們觀察到的情況與其他研究人員的經歷完全一致：新增具體的模式約束可能會降低效能。（讓我自由發言？關於格式限制對大語言模型效能影響的研究）。

我們的解決方案是先用 Gemini 1.5 生成自由文字，然後增加一個處理步驟，將 Gemini 的結果與我們的模式對齊。

我們使用的 Gemini 提示詞如下

Study the video and provide the following details about the video and the semantic scenes that compose it:

- characterList: a list of characters that appear in the whole video and a visual description that should allow me to identify them just seeing an image of them.
- scenes: a list of the scenes with the following properties:
  - start/end timestamps of the scene
  - list of all the characters that appear in the scene
  - list of all activities and their timestamps
  - list of all props and their timestamps
  - list of all video editing details and their start/end timestamps. Details include transitions, effects, music as well as suggestions like segments of the scene that could be removed and why 
  - scene mood with notes on how the visuals, audio and context contribute to it. Use the following taxonomy returning only the name in your answer {"moods":{"Positive":[{"name":"Happy","description":"Feeling joyful, content, or delighted."},{"name":"Excited","description":"Feeling enthusiastic, energetic, or eager."},{"name":"Calm","description":"Feeling peaceful, relaxed, or serene."},{"name":"Grateful","description":"Feeling appreciative or thankful."},{"name":"Proud","description":"Feeling satisfied with one's achievements or the achievements of others."}],"Negative":[{"name":"Sad","description":"Feeling down, unhappy, or sorrowful."},{"name":"Angry","description":"Feeling irritated, frustrated, or furious."},{"name":"Anxious","description":"Feeling nervous, worried, or uneasy."},{"name":"Lonely","description":"Feeling isolated, disconnected, or abandoned."},{"name":"Bored","description":"Feeling uninterested, disengaged, or restless."}],"Neutral":[{"name":"Indifferent","description":"Feeling neither particularly positive nor negative."},{"name":"Content","description":"Feeling satisfied but not overly excited."},{"name":"Curious","description":"Feeling interested or inquisitive without strong emotion."},{"name":"Confused","description":"Feeling uncertain or unclear but without strong negative feelings."},{"name":"Pensive","description":"Feeling thoughtful or reflective without strong emotional engagement."}]}}
    - specific  mood changing moments inside the scene, report the timestamp and what we transition from/to in any of the dimensions (visual / auditive)
  - scene narrative progression and plot development
    - specific narrative moments inside the scene. Report the timestamp and what happened
  - character interaction and dynamics descriptions and their start/end timestamps
  - specific thematic elements and descriptions
  - specific relevant happenings to create deeper meanings and subtexts not explicitly stated that contribute to the richness and depth of the content, timestamp and descriptions
  - dynamism score of the scene. Score between 0 and 1. 1 is highly dynamic
  - audio - visual correlation score. Score between 0 and 1. 0 what we see is not correlated with the speech and 1 is highly correlated

- storylines: a list of the different storylines found and which scenes belong to it. 
  - Specify where is the climax (scene and timestamp) and if the content is being presented a narrative story, or is it more like a collection of facts or non-narrative information
  - if there are scenes not matching storylines, explain how those scenes contribute to the video
- looking at the overall video and the storylines, which segments of the video could be trimmed to make it more dynamic?
- q&a: a list of 5 questions/answers about the video that focus on fine details (objects and or activities), overall story reasoning and mood. Focus on Q&A aspects captured on the audio and the video whenever possible difficult to get only by looking at the transcription.

新增 Instructor

在 Gemini 處理完結果後，我們使用 Instructor 進行解析：這是一個構建在 Pydantic 之上的庫，用於根據給定的模式實現結構化輸出。請參見下表示例。

Instructor 讓我們能夠試驗不同的模型，將 Gemini 的自由文字轉換為我們在 Pydantic 中定義的模式。我們嘗試了 Gemini 和 GPT4o，並最終選擇了 GPT4o，因為它的成功率更高。

影片 Gemini 輸出 Instructor 輸出

影片	Gemini 輸出	Instructor 輸出
	`CharacterList: Man Slim build, brown eyes, shaved sides, black hoodie with colorful logo, black pants. Scenes Scene 1 Start 0:00 End 0:55 Characters: [Man] Activities: Introduces bus Describes peaceful location with cows Props: Bus, cows, deck. Mood:Excited, adventure. Narrative Progression: Introduction to bus. Tour begins outside, highlighting nature and relaxation. Dynamism Score 0.7 Audio-Visual Correlation 1`	{ "title": "Bertie the Bus Tour", "description": "Guided tour of converted bus.", "characterList": [ { "name": "Narrator", "description": "Slim build, brown eyes, shaved sides, black hoodie with colorful logo, black pants." } ], "scenes": [ { "sceneId": 1, "title": "Introduction to Bus", "timestamps": { "start": "0:00", "end": "0:55" }, "cast": ["Narrator"], "activities": [ "Narrator speaks in front of bus", "Shows outdoor deck with chairs, cows nearby." ], "props": ["Bus", "Deck", "Cows"], "mood": "Excited, adventure." } ], "dynamismScore": 0.7, "audioVisualCorrelation": 1 }

CharacterList: Man Slim build, brown eyes, shaved sides, black hoodie with colorful logo, black pants. Scenes Scene 1 Start 0:00 End 0:55 Characters: [Man] Activities: Introduces bus Describes peaceful location with cows Props: Bus, cows, deck. Mood:Excited, adventure. Narrative Progression: Introduction to bus. Tour begins outside, highlighting nature and relaxation. Dynamism Score 0.7 Audio-Visual Correlation 1


{
  "title": "Bertie the Bus Tour",
  "description": "Guided tour of converted bus.",
  "characterList": [
    {
      "name": "Narrator",
      "description": "Slim build, brown eyes, shaved sides, black hoodie with colorful logo, black pants."
    }
  ],
  "scenes": [
    {
      "sceneId": 1,
      "title": "Introduction to Bus",
      "timestamps": {
        "start": "0:00",
        "end": "0:55"
      },
      "cast": ["Narrator"],
      "activities": [
        "Narrator speaks in front of bus",
        "Shows outdoor deck with chairs, cows nearby."
      ],
      "props": ["Bus", "Deck", "Cows"],
      "mood": "Excited, adventure."
    }
  ],
  "dynamismScore": 0.7,
  "audioVisualCorrelation": 1
}

值得強調的是，Gemini 的內容篩選功能丟棄了一些影片，如果你使用 Gemini，也可能會遇到這種情況。在我們的案例中，鑑於我們目標的內容量，Gemini 篩選丟棄的總內容時長可以忽略不計。

完整的影片標註程式碼可以在這裡找到 [連結]。

精細對齊和異常篩選

在影片標註完成且資料與我們的模式正確對齊後，我們關注資料的時間維度，並確保其與影片對齊：Gemini 1.5 以每秒 1 幀的速度讀取影片，而影片的幀率通常為每秒 25 - 29 幀。在我們的精細對齊中，我們確保 Gemini 1.5 提供的場景邊界與影片中的正確幀匹配。

我們還利用這種時間對齊來丟棄那些 Gemini 停止提供有效資料、導致影片部分內容被錯誤標註的情況。需要注意的是，由於我們在流程早期丟棄了所有超過 10 分鐘的內容，資料質量差的影片數量可以忽略不計（低於 0.5%）。

精細元資料 - 將影片場景邊界與鏡頭對齊，作為一種丟棄異常值的方法

影片對齊程式碼連結在此 [連結]

未來工作

我們目前正在準備用 FineVideo 訓練一個多模態大語言模型，並計劃在完成後儘快與社群分享模型權重和訓練方案。

我們也對 FineVideo 的其他擴充套件持開放態度，歡迎提出您的想法，告訴我們您希望看到什麼！

更多部落格文章

TimeScope: 你的影片大型多模態模型能走多遠？

作者 2025 年 7 月 23 日 • 34

CinePile 2.0 - 透過對抗式最佳化打造更強的資料集

作者 2024 年 10 月 23 日 • 18

社群

透過拖放到文字輸入框、貼上或點選此處上傳圖片、音訊和影片。

點選或貼上此處以上傳圖片

· 註冊或登入以發表評論

贊