Prompt Depth Anything

概述

Prompt Depth Anything 模型由 Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang 在 Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation 中提出。

論文摘要如下：

提示在釋放語言和視覺基礎模型在特定任務上的力量方面起著關鍵作用。我們首次將提示引入深度基礎模型，為度量深度估計建立了一個新的正規化，稱為 Prompt Depth Anything。具體來說，我們使用低成本的 LiDAR 作為提示來指導 Depth Anything 模型進行精確的度量深度輸出，實現高達 4K 解析度。我們的方法以簡潔的提示融合設計為中心，該設計在深度解碼器內以多尺度整合 LiDAR。為了解決包含 LiDAR 深度和精確 GT 深度的資料集有限帶來的訓練挑戰，我們提出了一種可擴充套件的資料管道，包括合成數據 LiDAR 模擬和真實資料偽 GT 深度生成。我們的方法在 ARKitScenes 和 ScanNet++ 資料集上取得了新的最先進水平，並受益於下游應用，包括 3D 重建和廣義機器人抓取。

Prompt Depth Anything 概述。摘自原始論文。

使用示例

Transformers 庫允許您僅用幾行程式碼使用該模型：

>>> import torch
>>> import requests
>>> import numpy as np

>>> from PIL import Image
>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation

>>> url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/image.jpg?raw=true"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("depth-anything/prompt-depth-anything-vits-hf")
>>> model = AutoModelForDepthEstimation.from_pretrained("depth-anything/prompt-depth-anything-vits-hf")

>>> prompt_depth_url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/arkit_depth.png?raw=true"
>>> prompt_depth = Image.open(requests.get(prompt_depth_url, stream=True).raw)
>>> # the prompt depth can be None, and the model will output a monocular relative depth.

>>> # prepare image for the model
>>> inputs = image_processor(images=image, return_tensors="pt", prompt_depth=prompt_depth)

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> # interpolate to original size
>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs,
...     target_sizes=[(image.height, image.width)],
... )

>>> # visualize the prediction
>>> predicted_depth = post_processed_output[0]["predicted_depth"]
>>> depth = predicted_depth * 1000 
>>> depth = depth.detach().cpu().numpy()
>>> depth = Image.fromarray(depth.astype("uint16")) # mm

資源

官方 Hugging Face 和社群（用 🌎 表示）資源列表，可幫助您開始使用 Prompt Depth Anything。

如果您有興趣提交資源以包含在此處，請隨時開啟 Pull Request，我們將對其進行審查！理想情況下，該資源應展示一些新內容，而不是重複現有資源。

PromptDepthAnythingConfig

class transformers.PromptDepthAnythingConfig

< source >

( backbone_config = None backbone = None use_pretrained_backbone = False use_timm_backbone = False backbone_kwargs = None patch_size = 14 initializer_range = 0.02 reassemble_hidden_size = 384 reassemble_factors = [4, 2, 1, 0.5] neck_hidden_sizes = [48, 96, 192, 384] fusion_hidden_size = 64 head_in_index = -1 head_hidden_size = 32 depth_estimation_type = 'relative' max_depth = None **kwargs )

引數

backbone_config (Union[dict[str, Any], PretrainedConfig], optional) — 主幹模型的配置。僅在 is_hybrid 為 True 或您想利用 AutoBackbone API 時使用。
backbone (str, optional) — 當 backbone_config 為 None 時使用的主幹模型名稱。如果 use_pretrained_backbone 為 True，這將從 timm 或 transformers 庫載入相應的預訓練權重。如果 use_pretrained_backbone 為 False，這將載入主幹模型的配置並使用它來初始化具有隨機權重的主幹模型。
use_pretrained_backbone (bool, optional, 預設為 False) — 是否使用主幹模型的預訓練權重。
use_timm_backbone (bool, optional, 預設為 False) — 是否使用 timm 庫作為主幹模型。如果設定為 False，將使用 AutoBackbone API。
backbone_kwargs (dict, optional) — 從檢查點載入時傳遞給 AutoBackbone 的關鍵字引數，例如 {'out_indices': (0, 1, 2, 3)}。如果設定了 backbone_config，則不能指定此引數。
patch_size (int, optional, 預設為 14) — 從主幹模型特徵中提取的補丁大小。
initializer_range (float, optional, 預設為 0.02) — 用於初始化所有權重矩陣的截斷正態初始化器的標準差。
reassemble_hidden_size (int, optional, 預設為 384) — 重組層的輸入通道數。
reassemble_factors (list[int], optional, 預設為 [4, 2, 1, 0.5]) — 重組層的上/下采樣因子。
neck_hidden_sizes (list[str], optional, 預設為 [48, 96, 192, 384]) — 主幹模型特徵圖的投影隱藏大小。
fusion_hidden_size (int, optional, 預設為 64) — 融合前的通道數。
head_in_index (int, optional, 預設為 -1) — 在深度估計頭部中使用的特徵索引。
head_hidden_size (int, optional, 預設為 32) — 深度估計頭部中第二個卷積層的輸出通道數。
depth_estimation_type (str, optional, 預設為 "relative") — 要使用的深度估計型別。可以是 ["relative", "metric"] 之一。
max_depth (float, optional) — “度量”深度估計頭部要使用的最大深度。室內模型應使用 20，室外模型應使用 80。對於“相對”深度估計，此值將被忽略。

這是用於儲存 PromptDepthAnythingModel 配置的配置類。它用於根據指定的引數例項化 PromptDepthAnything 模型，定義模型架構。使用預設值例項化配置將產生類似於 PromptDepthAnything LiheYoung/depth-anything-small-hf 架構的配置。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請閱讀 PretrainedConfig 的文件。

示例

>>> from transformers import PromptDepthAnythingConfig, PromptDepthAnythingForDepthEstimation

>>> # Initializing a PromptDepthAnything small style configuration
>>> configuration = PromptDepthAnythingConfig()

>>> # Initializing a model from the PromptDepthAnything small style configuration
>>> model = PromptDepthAnythingForDepthEstimation(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

to_dict

< 源 >

( )

將此例項序列化為 Python 字典。覆蓋預設的 to_dict()。返回：dict[str, any]：構成此配置例項的所有屬性的字典。

PromptDepthAnythingForDepthEstimation

class transformers.PromptDepthAnythingForDepthEstimation

< 源 >

( config )

引數

config (PromptDepthAnythingForDepthEstimation) — 包含模型所有引數的模型配置類。使用配置檔案初始化不載入與模型關聯的權重，只加載配置。請查閱 from_pretrained() 方法以載入模型權重。

Prompt Depth Anything 模型，頂部帶有一個深度估計頭（由 3 個卷積層組成），例如用於 KITTI、NYUv2。

此模型繼承自 PreTrainedModel。請查閱超類文件以瞭解庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參閱 PyTorch 文件以瞭解與一般使用和行為相關的所有事項。

forward

< 源 >

( pixel_values: FloatTensor prompt_depth: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.DepthEstimatorOutput 或 tuple(torch.FloatTensor)

引數

pixel_values (torch.FloatTensor，形狀為 (batch_size, num_channels, image_size, image_size)) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲得。有關詳細資訊，請參閱 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
prompt_depth (torch.FloatTensor，形狀為 (batch_size, 1, height, width), 可選) — 提示深度是從多視角幾何或低解析度深度感測器獲得的稀疏或低解析度深度。它通常具有 (height, width) 的形狀，其中 height 和 width 可以小於影像的 height 和 width。它是可選的，可以為 None，這意味著將不使用提示深度。如果為 None，輸出將是單目相對深度。建議值以米為單位，但這並非必需。
labels (torch.LongTensor，形狀為 (batch_size, sequence_length), 可選) — 用於計算掩碼語言建模損失的標籤。索引應位於 [0, ..., config.vocab_size] 或 -100（參閱 input_ids 文件字串）中。索引設定為 -100 的標記將被忽略（掩碼），損失僅針對標籤位於 [0, ..., config.vocab_size] 中的標記計算。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。有關更多詳細資訊，請參閱返回張量下的 attentions。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關更多詳細資訊，請參閱返回張量下的 hidden_states。
return_dict (bool, 可選) — 是否返回 ModelOutput 而不是普通的元組。

transformers.modeling_outputs.DepthEstimatorOutput 或 tuple(torch.FloatTensor)

一個 transformers.modeling_outputs.DepthEstimatorOutput 或一個 torch.FloatTensor 元組（如果傳遞 return_dict=False 或 config.return_dict=False），包含根據配置（PromptDepthAnythingConfig）和輸入而變化的不同元素。

loss (形狀為 (1,) 的 torch.FloatTensor，可選，當提供 labels 時返回) — 分類損失（如果 config.num_labels==1，則為迴歸損失）。
predicted_depth (torch.FloatTensor，形狀為 (batch_size, height, width)) — 每個畫素的預測深度。
hidden_states (tuple(torch.FloatTensor), 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — torch.FloatTensor 元組（如果模型有嵌入層，則一個用於嵌入輸出，加上每個層的一個輸出），形狀為 (batch_size, num_channels, height, width)。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。
attentions (tuple(torch.FloatTensor), 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — torch.FloatTensor 元組（每個層一個），形狀為 (batch_size, num_heads, patch_size, sequence_length)。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

此 PromptDepthAnythingForDepthEstimation forward 方法會覆蓋 __call__ 特殊方法。

儘管前向傳播的實現需要在該函式中定義，但之後應呼叫 Module 例項，而不是該函式，因為前者會處理執行前處理和後處理步驟，而後者會默默忽略它們。

示例

>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation
>>> import torch
>>> import numpy as np
>>> from PIL import Image
>>> import requests

>>> url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/image.jpg?raw=true"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("depth-anything/prompt-depth-anything-vits-hf")
>>> model = AutoModelForDepthEstimation.from_pretrained("depth-anything/prompt-depth-anything-vits-hf")

>>> prompt_depth_url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/arkit_depth.png?raw=true"
>>> prompt_depth = Image.open(requests.get(prompt_depth_url, stream=True).raw)

>>> # prepare image for the model
>>> inputs = image_processor(images=image, return_tensors="pt", prompt_depth=prompt_depth)

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> # interpolate to original size
>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs,
...     target_sizes=[(image.height, image.width)],
... )

>>> # visualize the prediction
>>> predicted_depth = post_processed_output[0]["predicted_depth"]
>>> depth = predicted_depth * 1000.
>>> depth = depth.detach().cpu().numpy()
>>> depth = Image.fromarray(depth.astype("uint16")) # mm

PromptDepthAnythingImageProcessor

class transformers.PromptDepthAnythingImageProcessor

< 源 >

( do_resize: bool = True size: typing.Optional[dict[str, int]] = None resample: Resampling = <Resampling.BICUBIC: 3> keep_aspect_ratio: bool = False ensure_multiple_of: int = 1 do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None do_pad: bool = False size_divisor: typing.Optional[int] = None prompt_scale_to_meter: float = 0.001 **kwargs )

引數

do_resize (bool, 可選, 預設為 True) — 是否調整影像的（高度，寬度）尺寸。可以透過 preprocess 中的 do_resize 覆蓋。
size (dict[str, int] 可選, 預設為 {"height" -- 384, "width": 384}): 調整大小後的影像尺寸。可以透過 preprocess 中的 size 覆蓋。
resample (PILImageResampling, 可選, 預設為 Resampling.BICUBIC) — 定義如果調整影像大小要使用的重取樣濾鏡。可以透過 preprocess 中的 resample 覆蓋。
keep_aspect_ratio (bool, 可選, 預設為 False) — 如果為 True，影像將調整為儘可能大的尺寸，以保持寬高比。可以透過 preprocess 中的 keep_aspect_ratio 覆蓋。
ensure_multiple_of (int, 可選, 預設為 1) — 如果 do_resize 為 True，影像將調整為該值的倍數。可以透過 preprocess 中的 ensure_multiple_of 覆蓋。
do_rescale (bool, 可選, 預設為 True) — 是否按指定比例 rescale_factor 重新縮放影像。可以透過 preprocess 中的 do_rescale 覆蓋。
rescale_factor (int 或 float, 可選, 預設為 1/255) — 如果重新縮放影像，要使用的比例因子。可以透過 preprocess 中的 rescale_factor 覆蓋。
do_normalize (bool, 可選, 預設為 True) — 是否歸一化影像。可以透過 preprocess 方法中的 do_normalize 引數覆蓋。
image_mean (float 或 list[float], 可選, 預設為 IMAGENET_STANDARD_MEAN) — 如果歸一化影像，要使用的均值。這是一個浮點數或長度等於影像通道數的浮點數列表。可以透過 preprocess 方法中的 image_mean 引數覆蓋。
image_std (float 或 list[float], 可選, 預設為 IMAGENET_STANDARD_STD) — 如果歸一化影像，要使用的標準差。這是一個浮點數或長度等於影像通道數的浮點數列表。可以透過 preprocess 方法中的 image_std 引數覆蓋。
do_pad (bool, 可選, 預設為 False) — 是否應用中心填充。這是 DINOv2 論文中引入的，它將模型與 DPT 結合使用。
size_divisor (int, 可選) — 如果 do_pad 為 True，將影像尺寸填充到可被此值整除。這是 DINOv2 論文中引入的，它將模型與 DPT 結合使用。
prompt_scale_to_meter (float, 可選, 預設為 0.001) — 用於將提示深度轉換為米的比例因子。

構造一個 PromptDepthAnything 影像處理器。

預處理

< 來源 >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] prompt_depth: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None do_resize: typing.Optional[bool] = None size: typing.Optional[int] = None keep_aspect_ratio: typing.Optional[bool] = None ensure_multiple_of: typing.Optional[int] = None resample: typing.Optional[PIL.Image.Resampling] = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None do_pad: typing.Optional[bool] = None size_divisor: typing.Optional[int] = None prompt_scale_to_meter: typing.Optional[float] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

引數

images (ImageInput) — 要預處理的影像。期望是畫素值範圍為 0 到 255 的單張或批次影像。如果傳入畫素值在 0 到 1 之間的影像，請設定 do_rescale=False。
prompt_depth (ImageInput, 可選) — 要預處理的提示深度，可以是來自多檢視幾何的稀疏深度，也可以是來自深度感測器的低解析度深度。通常形狀為 (高度，寬度)，其中高度和寬度可以小於影像的尺寸。它是可選的，可以為 None，這意味著不使用提示深度。如果為 None，則輸出深度將為單目相對深度。建議提供 prompt_scale_to_meter 值，即用於將提示深度轉換為米的比例因子。當提示深度不以米為單位時，這很有用。
do_resize (bool, 可選, 預設為 self.do_resize) — 是否調整影像大小。
size (dict[str, int], 可選, 預設為 self.size) — 調整大小後的影像尺寸。如果 keep_aspect_ratio 為 True，影像將調整為儘可能大的尺寸，以保持寬高比。如果設定了 ensure_multiple_of，影像將調整為該值的倍數。
keep_aspect_ratio (bool, 可選, 預設為 self.keep_aspect_ratio) — 是否保持影像的寬高比。如果為 False，影像將調整為 (size, size)。如果為 True，影像將調整大小以保持寬高比，並且尺寸將是最大可能尺寸。
ensure_multiple_of (int, 可選, 預設為 self.ensure_multiple_of) — 確保影像尺寸是此值的倍數。
resample (int, 可選, 預設為 self.resample) — 如果調整影像大小，要使用的重取樣濾鏡。這可以是列舉 PILImageResampling 之一。僅在 do_resize 設定為 True 時有效。
do_rescale (bool, 可選, 預設為 self.do_rescale) — 是否將影像值重新縮放到 [0 - 1] 之間。
rescale_factor (float, 可選, 預設為 self.rescale_factor) — 如果 do_rescale 設定為 True，則用於重新縮放影像的比例因子。
do_normalize (bool, 可選, 預設為 self.do_normalize) — 是否歸一化影像。
image_mean (float 或 list[float], 可選, 預設為 self.image_mean) — 影像均值。
image_std (float 或 list[float], 可選, 預設為 self.image_std) — 影像標準差。
prompt_scale_to_meter (float, 可選, 預設為 self.prompt_scale_to_meter) — 用於將提示深度轉換為米的比例因子。
return_tensors (str 或 TensorType, 可選) — 要返回的張量型別。可以是以下之一：
- 未設定：返回 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf'：返回 tf.Tensor 批次。
- TensorType.PYTORCH 或 'pt'：返回 torch.Tensor 批次。
- TensorType.NUMPY 或 'np'：返回 np.ndarray 批次。
- TensorType.JAX 或 'jax'：返回 jax.numpy.ndarray 批次。
data_format (ChannelDimension 或 str, 可選, 預設為 ChannelDimension.FIRST) — 輸出影像的通道維度格式。可以是以下之一：
- ChannelDimension.FIRST：影像格式為 (num_channels, height, width)。
- ChannelDimension.LAST：影像格式為 (height, width, num_channels)。
input_data_format (ChannelDimension 或 str, 可選) — 輸入影像的通道維度格式。如果未設定，則從輸入影像推斷通道維度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：影像格式為 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST：影像格式為 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE：影像格式為 (height, width)。

預處理一張或一批影像。

後處理深度估計

< 來源 >

( outputs: DepthEstimatorOutput target_sizes: typing.Union[transformers.utils.generic.TensorType, list[tuple[int, int]], NoneType] = None ) → list[dict[str, TensorType]]

引數

outputs (DepthEstimatorOutput) — 模型的原始輸出。
target_sizes (TensorType 或 list[tuple[int, int]], 可選) — 形狀為 (batch_size, 2) 的張量或包含批次中每張影像的目標尺寸（高度，寬度）的元組列表（tuple[int, int]）。如果留空，則預測將不調整大小。

list[dict[str, TensorType]]

表示處理後的深度預測的張量字典列表。

將 DepthEstimatorOutput 的原始輸出轉換為最終的深度預測和深度 PIL 影像。僅支援 PyTorch。

< > 在 GitHub 上更新