單目深度估計

單目深度估計是一項計算機視覺任務，它涉及從單個影像預測場景的深度資訊。換句話說，它是從單個攝像機視角估計場景中物體距離的過程。

單目深度估計有多種應用，包括 3D 重建、增強現實、自動駕駛和機器人技術。這是一項具有挑戰性的任務，因為它要求模型理解場景中物體與相應深度資訊之間複雜的相互關係，這些關係可能受到光照條件、遮擋和紋理等因素的影響。

深度估計主要分為兩類

絕對深度估計：此任務變體旨在提供來自相機的精確深度測量。該術語與度量深度估計可互換使用，其中深度以米或英尺的精確測量值提供。絕對深度估計模型輸出的深度圖帶有表示真實世界距離的數值。
相對深度估計：相對深度估計旨在預測場景中物體或點的深度順序，而不提供精確測量值。這些模型輸出一個深度圖，指示場景中哪些部分彼此之間更近或更遠，而不提供 A 和 B 的實際距離。

在本指南中，我們將瞭解如何使用最先進的零樣本相對深度估計模型 Depth Anything V2 和絕對深度估計模型 ZoeDepth 進行推理。

檢視深度估計任務頁面以檢視所有相容的架構和檢查點。

在開始之前，我們需要安裝最新版本的 Transformers

pip install -q -U transformers

深度估計管道

使用支援深度估計模型的pipeline()進行推理是最簡單的方法。從Hugging Face Hub上的檢查點例項化一個管道

>>> from transformers import pipeline
>>> import torch
>>> from accelerate.test_utils.testing import get_backend
# automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
>>> device, _, _ = get_backend()
>>> checkpoint = "depth-anything/Depth-Anything-V2-base-hf"
>>> pipe = pipeline("depth-estimation", model=checkpoint, device=device)

接下來，選擇要分析的影像

>>> from PIL import Image
>>> import requests

>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> image

將影像傳遞給管道。

>>> predictions = pipe(image)

該管道返回一個包含兩個條目的字典。第一個名為predicted_depth，是一個張量，其值表示每個畫素的深度（以米為單位）。第二個名為depth，是一個PIL影像，用於視覺化深度估計結果。

我們來看看視覺化結果

>>> predictions["depth"]

手動進行深度估計推理

現在你已經學會了如何使用深度估計管道，我們來看看如何手動重現相同的結果。

首先從Hugging Face Hub 上的檢查點載入模型和相關的處理器。這裡我們將使用相同的檢查點

>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation

>>> checkpoint = "Intel/zoedepth-nyu-kitti"

>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)
>>> model = AutoModelForDepthEstimation.from_pretrained(checkpoint).to(device)

使用image_processor為模型準備影像輸入，該處理器將負責必要的影像變換，例如調整大小和歸一化

>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values.to(device)

將準備好的輸入傳遞給模型

>>> import torch

>>> with torch.no_grad():
...     outputs = model(pixel_values)

讓我們對結果進行後處理，以移除任何填充，並將深度圖調整為與原始影像大小匹配。`post_process_depth_estimation` 輸出一個包含 `"predicted_depth"` 的字典列表。

>>> # ZoeDepth dynamically pads the input image. Thus we pass the original image size as argument
>>> # to `post_process_depth_estimation` to remove the padding and resize to original dimensions.
>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs,
...     source_sizes=[(image.height, image.width)],
... )

>>> predicted_depth = post_processed_output[0]["predicted_depth"]
>>> depth = (predicted_depth - predicted_depth.min()) / (predicted_depth.max() - predicted_depth.min())
>>> depth = depth.detach().cpu().numpy() * 255
>>> depth = Image.fromarray(depth.astype("uint8"))

在原始實現中，ZoeDepth 模型對原始影像和翻轉影像都執行推理，並對結果進行平均。`post_process_depth_estimation` 函式可以透過將翻轉的輸出傳遞給可選的 `outputs_flipped` 引數來為我們處理此問題

>>> with torch.no_grad():   
...     outputs = model(pixel_values)
...     outputs_flipped = model(pixel_values=torch.flip(inputs.pixel_values, dims=[3]))
>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs,
...     source_sizes=[(image.height, image.width)],
...     outputs_flipped=outputs_flipped,
... )

< > 在 GitHub 上更新