Depth Anything V2

概述

Depth Anything V2 由 Lihe Yang 等人在同名論文中提出。它使用與原始 Depth Anything 模型相同的架構，但利用合成數據和容量更大的教師模型，實現了更精細、更魯棒的深度預測。

論文摘要如下：

本工作提出了 Depth Anything V2。在不追求花哨技術的前提下，我們旨在揭示關鍵發現，為構建強大的單目深度估計模型鋪平道路。值得注意的是，與 V1 相比，此版本透過三項關鍵實踐生成了更精細、更魯棒的深度預測：1) 用合成影像替換所有帶標籤的真實影像，2) 擴大教師模型的容量，3) 透過大規模偽標籤真實影像的橋樑來訓練學生模型。與基於 Stable Diffusion 構建的最新模型相比，我們的模型效率顯著更高（快 10 倍以上），並且更準確。我們提供不同規模的模型（引數範圍從 25M 到 1.3B）以支援廣泛的場景。受益於其強大的泛化能力，我們使用度量深度標籤對其進行微調，以獲得我們的度量深度模型。除了我們的模型，考慮到當前測試集中有限的多樣性和頻繁的噪聲，我們構建了一個具有精確標註和多樣場景的通用評估基準，以促進未來的研究。

Depth Anything 概述。圖片來自原始論文。

Depth Anything 模型由 nielsr 貢獻。原始程式碼可在此處找到。

使用示例

使用 Depth Anything V2 主要有兩種方式：使用 pipeline API（它為您抽象了所有複雜性），或者自行使用 DepthAnythingForDepthEstimation 類。

Pipeline API

使用 pipeline 可以在幾行程式碼內使用模型

>>> from transformers import pipeline
>>> from PIL import Image
>>> import requests

>>> # load pipe
>>> pipe = pipeline(task="depth-estimation", model="depth-anything/Depth-Anything-V2-Small-hf")

>>> # load image
>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> # inference
>>> depth = pipe(image)["depth"]

自行使用模型

如果您想自行進行預處理和後處理，請按以下步驟操作

>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation
>>> import torch
>>> import numpy as np
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("depth-anything/Depth-Anything-V2-Small-hf")
>>> model = AutoModelForDepthEstimation.from_pretrained("depth-anything/Depth-Anything-V2-Small-hf")

>>> # prepare image for the model
>>> inputs = image_processor(images=image, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> # interpolate to original size and visualize the prediction
>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs,
...     target_sizes=[(image.height, image.width)],
... )

>>> predicted_depth = post_processed_output[0]["predicted_depth"]
>>> depth = (predicted_depth - predicted_depth.min()) / (predicted_depth.max() - predicted_depth.min())
>>> depth = depth.detach().cpu().numpy() * 255
>>> depth = Image.fromarray(depth.astype("uint8"))

資源

官方 Hugging Face 和社群（以 🌎 標示）資源列表，助您開始使用 Depth Anything。

單目深度估計任務指南
Depth Anything V2 演示.
展示 DepthAnythingForDepthEstimation 推理的 Jupyter Notebook 可在此處找到。🌎
用於 Apple Silicon 的 small 變體的 Core ML 轉換.

如果您有興趣在此處提交資源，請隨時開啟 Pull Request，我們將對其進行審查！該資源最好能展示一些新內容，而不是重複現有資源。

DepthAnythingConfig

類 transformers.DepthAnythingConfig

< 源 >

( backbone_config = None backbone = None use_pretrained_backbone = False use_timm_backbone = False backbone_kwargs = None patch_size = 14 initializer_range = 0.02 reassemble_hidden_size = 384 reassemble_factors = [4, 2, 1, 0.5] neck_hidden_sizes = [48, 96, 192, 384] fusion_hidden_size = 64 head_in_index = -1 head_hidden_size = 32 depth_estimation_type = 'relative' max_depth = None **kwargs )

引數

backbone_config (Union[dict[str, Any], PretrainedConfig], 可選) — 骨幹模型的配置。僅在 is_hybrid 為 True 或您想利用 AutoBackbone API 時使用。
backbone (str, 可選) — 當 backbone_config 為 None 時使用的骨幹網路名稱。如果 use_pretrained_backbone 為 True，這將從 timm 或 transformers 庫載入相應的預訓練權重。如果 use_pretrained_backbone 為 False，這將載入骨幹網路的配置，並用它來初始化帶有隨機權重的骨幹網路。
use_pretrained_backbone (bool, 可選, 預設為 False) — 是否為骨幹網路使用預訓練權重。
use_timm_backbone (bool, 可選, 預設為 False) — 是否使用 timm 庫作為骨幹網路。如果設定為 False，將使用 AutoBackbone API。
backbone_kwargs (dict, 可選) — 從檢查點載入時要傳遞給 AutoBackbone 的關鍵字引數，例如 {'out_indices': (0, 1, 2, 3)}。如果設定了 backbone_config，則不能指定此引數。
patch_size (int, 可選, 預設為 14) — 從骨幹網路特徵中提取的補丁大小。
initializer_range (float, 可選, 預設為 0.02) — 用於初始化所有權重矩陣的截斷正態初始化器的標準差。
reassemble_hidden_size (int, 可選, 預設為 384) — 重組層的輸入通道數。
reassemble_factors (list[int], 可選, 預設為 [4, 2, 1, 0.5]) — 重組層的上/下采樣因子。
neck_hidden_sizes (list[str], 可選, 預設為 [48, 96, 192, 384]) — 骨幹網路特徵圖的投影隱藏大小。
fusion_hidden_size (int, 可選, 預設為 64) — 融合之前的通道數。
head_in_index (int, 可選, 預設為 -1) — 在深度估計頭部中使用的特徵索引。
head_hidden_size (int, 可選, 預設為 32) — 深度估計頭部中第二次卷積的輸出通道數。
depth_estimation_type (str, 可選, 預設為 "relative") — 要使用的深度估計型別。可以是 ["relative", "metric"] 之一。
max_depth (float, 可選) — 用於“度量”深度估計頭部的最大深度。室內模型應使用 20，室外模型應使用 80。對於“相對”深度估計，此值將被忽略。

這是用於儲存 DepthAnythingModel 配置的配置類。它用於根據指定引數例項化 DepthAnything 模型，定義模型架構。使用預設值例項化配置將生成與 DepthAnything LiheYoung/depth-anything-small-hf 架構類似的配置。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請參閱 PretrainedConfig 的文件。

示例

>>> from transformers import DepthAnythingConfig, DepthAnythingForDepthEstimation

>>> # Initializing a DepthAnything small style configuration
>>> configuration = DepthAnythingConfig()

>>> # Initializing a model from the DepthAnything small style configuration
>>> model = DepthAnythingForDepthEstimation(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

to_dict

< 源 >

( )

將此例項序列化為 Python 字典。覆蓋預設的 to_dict()。返回值：dict[str, any]：構成此配置例項的所有屬性的字典。

DepthAnythingForDepthEstimation

類 transformers.DepthAnythingForDepthEstimation

< 源 >

( config )

引數

config (DepthAnythingForDepthEstimation) — 包含模型所有引數的模型配置類。用配置檔案初始化不會載入與模型相關的權重，只加載配置。請檢視 from_pretrained() 方法來載入模型權重。

Depth Anything 模型，頂部帶有一個深度估計頭部（由 3 個卷積層組成），例如用於 KITTI、NYUv2。

此模型繼承自 PreTrainedModel。請查閱超類文件，瞭解庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。請將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件中所有與通用用法和行為相關的事項。

forward

< 源 >

( pixel_values: FloatTensor labels: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.DepthEstimatorOutput 或 tuple(torch.FloatTensor)

引數

pixel_values (torch.FloatTensor，形狀為 (batch_size, num_channels, image_size, image_size)) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲取。有關詳細資訊，請參閱 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
labels (torch.LongTensor，形狀為 (batch_size, height, width), 可選) — 用於計算損失的真實深度估計圖。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。有關更多詳細資訊，請參閱返回張量中的 attentions。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關更多詳細資訊，請參閱返回張量中的 hidden_states。
return_dict (bool, 可選) — 是否返回 ModelOutput 而不是普通的元組。

transformers.modeling_outputs.DepthEstimatorOutput 或 tuple(torch.FloatTensor)

transformers.modeling_outputs.DepthEstimatorOutput 或 torch.FloatTensor 的元組（如果傳遞 return_dict=False 或 config.return_dict=False），包含根據配置 (DepthAnythingConfig) 和輸入的不同元素。

loss (形狀為 (1,) 的 torch.FloatTensor，可選，當提供 labels 時返回) — 分類損失（如果 config.num_labels==1，則為迴歸損失）。
predicted_depth (torch.FloatTensor，形狀為 (batch_size, height, width)) — 每個畫素的預測深度。
hidden_states (tuple(torch.FloatTensor), 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — torch.FloatTensor 的元組（如果模型有嵌入層，則其中一個用於嵌入層的輸出，加上每個層的一個輸出），形狀為 (batch_size, num_channels, height, width)。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。
attentions (tuple(torch.FloatTensor), 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — torch.FloatTensor 的元組（每個層一個），形狀為 (batch_size, num_heads, patch_size, sequence_length)。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

DepthAnythingForDepthEstimation 的 forward 方法，覆蓋了 __call__ 特殊方法。

儘管前向傳播的配方需要在此函式中定義，但在此之後應呼叫 Module 例項而不是此函式，因為前者負責執行預處理和後處理步驟，而後者則會默默地忽略它們。

示例

>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation
>>> import torch
>>> import numpy as np
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("LiheYoung/depth-anything-small-hf")
>>> model = AutoModelForDepthEstimation.from_pretrained("LiheYoung/depth-anything-small-hf")

>>> # prepare image for the model
>>> inputs = image_processor(images=image, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> # interpolate to original size
>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs,
...     target_sizes=[(image.height, image.width)],
... )

>>> # visualize the prediction
>>> predicted_depth = post_processed_output[0]["predicted_depth"]
>>> depth = predicted_depth * 255 / predicted_depth.max()
>>> depth = depth.detach().cpu().numpy()
>>> depth = Image.fromarray(depth.astype("uint8"))

< > 在 GitHub 上更新

Transformers

Depth Anything V2

概述

使用示例

Pipeline API

自行使用模型

資源

DepthAnythingConfig

類 transformers.DepthAnythingConfig

to_dict

DepthAnythingForDepthEstimation

類 transformers.DepthAnythingForDepthEstimation

forward