Depth Anything

Depth Anything 旨在成為單目深度估計 (MDE) 的基礎模型。它在有標籤和約 6200 萬無標籤影像上聯合訓練，以增強資料集。它使用預訓練的 DINOv2 模型作為影像編碼器，以繼承其現有的豐富語義先驗，並使用 DPT 作為解碼器。一個教師模型在無標籤影像上進行訓練以建立偽標籤。學生模型在偽標籤和有標籤影像的組合上進行訓練。為了提高學生模型的效能，對無標籤影像添加了強烈的擾動，以挑戰學生模型從影像中學習更多的視覺知識。

您可以在 Depth Anything 集合中找到所有原始的 Depth Anything 檢查點。

點選右側邊欄中的 Depth Anything 模型，瞭解更多關於如何將 Depth Anything 應用於不同視覺任務的示例。

以下示例演示瞭如何使用 Pipeline 或 AutoModel 類獲取深度圖。

流水線

自動模型

注意事項

DepthAnythingV2 於 2024 年 6 月釋出，採用與 Depth Anything 相同的架構，並與所有程式碼示例和現有工作流程相容。它使用合成數據和更大容量的教師模型，以實現更精細、更魯棒的深度預測。

DepthAnythingConfig

class transformers.DepthAnythingConfig

< 來源 >

( backbone_config = None backbone = None use_pretrained_backbone = False use_timm_backbone = False backbone_kwargs = None patch_size = 14 initializer_range = 0.02 reassemble_hidden_size = 384 reassemble_factors = [4, 2, 1, 0.5] neck_hidden_sizes = [48, 96, 192, 384] fusion_hidden_size = 64 head_in_index = -1 head_hidden_size = 32 depth_estimation_type = 'relative' max_depth = None **kwargs )

引數

backbone_config (Union[dict[str, Any], PretrainedConfig], 可選) — 骨幹模型的配置。僅在 is_hybrid 為 True 或您希望利用 AutoBackbone API 時使用。
backbone (str, 可選) — 當 backbone_config 為 None 時使用的骨幹名稱。如果 use_pretrained_backbone 為 True，這將從 timm 或 transformers 庫中載入相應的預訓練權重。如果 use_pretrained_backbone 為 False，這將載入骨幹的配置並用於使用隨機權重初始化骨幹。
use_pretrained_backbone (bool, 可選, 預設為 False) — 是否為骨幹使用預訓練權重。
use_timm_backbone (bool, 可選, 預設為 False) — 是否為骨幹使用 timm 庫。如果設定為 False，將使用 AutoBackbone API。
backbone_kwargs (dict, 可選) — 載入檢查點時傳遞給 AutoBackbone 的關鍵字引數，例如 {'out_indices': (0, 1, 2, 3)}。如果設定了 backbone_config，則不能指定。
patch_size (int, 可選, 預設為 14) — 從骨幹特徵中提取的補丁大小。
initializer_range (float, 可選, 預設為 0.02) — 用於初始化所有權重矩陣的 truncated_normal_initializer 的標準差。
reassemble_hidden_size (int, 可選, 預設為 384) — 重組層的輸入通道數。
reassemble_factors (list[int], 可選, 預設為 [4, 2, 1, 0.5]) — 重組層的上/下采樣因子。
neck_hidden_sizes (list[str], 可選, 預設為 [48, 96, 192, 384]) — 骨幹特徵圖的隱藏大小。
fusion_hidden_size (int, 可選, 預設為 64) — 融合前的通道數。
head_in_index (int, 可選, 預設為 -1) — 在深度估計頭部中使用的特徵索引。
head_hidden_size (int, 可選, 預設為 32) — 深度估計頭部中第二個卷積層的輸出通道數。
depth_estimation_type (str, 可選, 預設為 "relative") — 要使用的深度估計型別。可以是 ["relative", "metric"] 之一。
max_depth (float, 可選) — 用於“metric”深度估計頭部的最大深度。室內模型應使用 20，室外模型應使用 80。對於“relative”深度估計，此值被忽略。

這是一個配置類，用於儲存 DepthAnythingModel 的配置。它用於根據指定的引數例項化 DepthAnything 模型，定義模型架構。使用預設值例項化配置將產生類似於 DepthAnything LiheYoung/depth-anything-small-hf 架構的配置。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。請閱讀 PretrainedConfig 的文件以獲取更多資訊。

示例

>>> from transformers import DepthAnythingConfig, DepthAnythingForDepthEstimation

>>> # Initializing a DepthAnything small style configuration
>>> configuration = DepthAnythingConfig()

>>> # Initializing a model from the DepthAnything small style configuration
>>> model = DepthAnythingForDepthEstimation(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

to_dict

< 來源 >

( )

將此例項序列化為 Python 字典。重寫預設的 to_dict()。返回：dict[str, any]：包含此配置例項所有屬性的字典，

DepthAnythingForDepthEstimation

class transformers.DepthAnythingForDepthEstimation

< 來源 >

( config )

引數

config (DepthAnythingForDepthEstimation) — 包含模型所有引數的模型配置類。使用配置檔案初始化不會載入與模型相關的權重，只加載配置。請檢視 from_pretrained() 方法以載入模型權重。

Depth Anything 模型，頂部帶有一個深度估計頭部（由 3 個卷積層組成），例如用於 KITTI、NYUv2。

此模型繼承自 PreTrainedModel。請檢視超類文件，瞭解庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件中與一般使用和行為相關的所有事項。

forward

< 來源 >

( pixel_values: FloatTensor labels: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.DepthEstimatorOutput 或 tuple(torch.FloatTensor)

引數

pixel_values (torch.FloatTensor，形狀為 (batch_size, num_channels, image_size, image_size)) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲取。有關詳細資訊，請參見 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
labels (torch.LongTensor，形狀為 (batch_size, height, width)，可選) — 用於計算損失的真實深度估計圖。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。有關更多詳細資訊，請參見返回張量下的 attentions。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關更多詳細資訊，請參見返回張量下的 hidden_states。
return_dict (bool, 可選) — 是否返回 ModelOutput 而不是純元組。

transformers.modeling_outputs.DepthEstimatorOutput 或 tuple(torch.FloatTensor)

一個 transformers.modeling_outputs.DepthEstimatorOutput 或 torch.FloatTensor 的元組（如果傳遞了 return_dict=False 或當 config.return_dict=False 時），包含根據配置 (DepthAnythingConfig) 和輸入而異的各種元素。

loss (形狀為 (1,) 的 torch.FloatTensor，可選，當提供 labels 時返回) — 分類損失（如果 config.num_labels==1，則為迴歸損失）。
predicted_depth (torch.FloatTensor，形狀為 (batch_size, height, width)) — 每個畫素的預測深度。
hidden_states (tuple(torch.FloatTensor), 可選, 當傳遞 output_hidden_states=True 或當 config.output_hidden_states=True 時返回) — torch.FloatTensor 的元組（一個用於嵌入層輸出，如果模型有嵌入層，+一個用於每個層輸出），形狀為 (batch_size, num_channels, height, width)。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。
attentions (tuple(torch.FloatTensor), 可選, 當傳遞 output_attentions=True 或當 config.output_attentions=True 時返回) — torch.FloatTensor 的元組（每個層一個），形狀為 (batch_size, num_heads, patch_size, sequence_length)。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

DepthAnythingForDepthEstimation 的 forward 方法，重寫了 __call__ 特殊方法。

儘管前向傳遞的實現需要在該函式中定義，但應該在之後呼叫 Module 例項，因為前者負責執行預處理和後處理步驟，而後者會靜默忽略它們。

示例

>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation
>>> import torch
>>> import numpy as np
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("LiheYoung/depth-anything-small-hf")
>>> model = AutoModelForDepthEstimation.from_pretrained("LiheYoung/depth-anything-small-hf")

>>> # prepare image for the model
>>> inputs = image_processor(images=image, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> # interpolate to original size
>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs,
...     target_sizes=[(image.height, image.width)],
... )

>>> # visualize the prediction
>>> predicted_depth = post_processed_output[0]["predicted_depth"]
>>> depth = predicted_depth * 255 / predicted_depth.max()
>>> depth = depth.detach().cpu().numpy()
>>> depth = Image.fromarray(depth.astype("uint8"))

< > 在 GitHub 上更新

Transformers

Depth Anything

注意事項

DepthAnythingConfig

class transformers.DepthAnythingConfig

to_dict

DepthAnythingForDepthEstimation

class transformers.DepthAnythingForDepthEstimation

forward