DepthPro

概述

DepthPro 模型由 Aleksei Bochkovskii、Amaël Delaunoy、Hugo Germain、Marcel Santos、Yichao Zhou、Stephan R. Richter 和 Vladlen Koltun 在 Depth Pro: Sharp Monocular Metric Depth in Less Than a Second 中提出。

DepthPro 是一種用於零樣本度量單目深度估計的基礎模型，旨在生成具有卓越清晰度和精細細節的高解析度深度圖。它採用多尺度 Vision Transformer (ViT) 架構，其中影像被下采樣，分成塊，並使用共享的 Dinov2 編碼器進行處理。提取的塊級特徵被合併、上取樣，並使用 DPT 類似的融合階段進行細化，從而實現精確的深度估計。

論文摘要如下：

我們提出了一種用於零樣本度量單目深度估計的基礎模型。我們的模型 Depth Pro 合成高解析度深度圖，具有無與倫比的清晰度和高頻細節。預測是度量的，具有絕對尺度，不依賴於相機內參等元資料的可用性。該模型速度快，可在標準 GPU 上在 0.3 秒內生成 2.25 兆畫素的深度圖。這些特性得益於多項技術貢獻，包括用於密集預測的高效多尺度視覺 Transformer、結合真實和合成資料集以實現高度量精度和精細邊界追蹤的訓練協議、用於估計深度圖中邊界精度的專用評估指標，以及從單張影像進行最先進焦距估計。大量的實驗分析了具體的D設計選擇，並證明 Depth Pro 在多個維度上優於現有工作。

DepthPro 輸出。摘自官方程式碼。

該模型由 geetu040 貢獻。原始程式碼可在此處找到。

使用技巧

DepthPro 模型透過首先在多個尺度上對輸入影像進行下采樣，並將每個縮放版本分成塊來處理。然後使用共享的基於 Vision Transformer (ViT) 的 Dinov2 塊編碼器對這些塊進行編碼，同時使用單獨的影像編碼器處理完整影像。提取的塊特徵被合併到特徵圖中，上取樣，並使用 DPT 類似的解碼器進行融合，以生成最終的深度估計。如果啟用，額外的視野 (FOV) 編碼器會處理影像以估計相機的視野，從而有助於提高深度精度。

>>> import requests
>>> from PIL import Image
>>> import torch
>>> from transformers import DepthProImageProcessorFast, DepthProForDepthEstimation

>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = DepthProImageProcessorFast.from_pretrained("apple/DepthPro-hf")
>>> model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf").to(device)

>>> inputs = image_processor(images=image, return_tensors="pt").to(device)

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs, target_sizes=[(image.height, image.width)],
... )

>>> field_of_view = post_processed_output[0]["field_of_view"]
>>> focal_length = post_processed_output[0]["focal_length"]
>>> depth = post_processed_output[0]["predicted_depth"]
>>> depth = (depth - depth.min()) / depth.max()
>>> depth = depth * 255.
>>> depth = depth.detach().cpu().numpy()
>>> depth = Image.fromarray(depth.astype("uint8"))

架構和配置

DepthPro 架構。摘自原始論文。

DepthProForDepthEstimation 模型使用 DepthProEncoder 來編碼輸入影像，並使用 FeatureFusionStage 來融合編碼器的輸出特徵。

DepthProEncoder 進一步使用兩個編碼器：

patch_encoder（補丁編碼器）
- 輸入影像以多種比例進行縮放，具體由 scaled_images_ratios 配置指定。
- 每個縮放後的影像都被分成大小為 patch_size 的小補丁，其重疊區域由 scaled_images_overlap_ratios 確定。
- 這些補丁由patch_encoder處理。
image_encoder（影像編碼器）
- 輸入影像也會被重新縮放到 patch_size 並由image_encoder處理。

這兩個編碼器都可以透過 patch_model_config 和 image_model_config 進行配置，兩者預設都為獨立的 Dinov2Model。

來自兩個編碼器（last_hidden_state）的輸出以及來自patch_encoder的選定中間狀態（hidden_states）由基於 DPT 的 FeatureFusionStage 進行融合，以進行深度估計。

視野 (FOV) 預測

該網路補充了一個焦距估算頭。一個小型卷積頭接收來自深度估算網路的凍結特徵和來自獨立 ViT 影像編碼器的任務特定特徵，以預測水平視角。

DepthProConfig 中的 use_fov_model 引數控制是否啟用 FOV 預測。預設情況下，它設定為 False 以節省記憶體和計算。啟用時，FOV 編碼器根據 fov_model_config 引數例項化，該引數預設為 Dinov2Model。初始化 DepthProForDepthEstimation 模型時也可以傳遞 use_fov_model 引數。

檢查點 apple/DepthPro-hf 處的預訓練模型使用 FOV 編碼器。要在不使用 FOV 編碼器的情況下使用預訓練模型，請在載入模型時將 use_fov_model=False 設定為 False，這可以節省計算量。

>>> from transformers import DepthProForDepthEstimation
>>> model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf", use_fov_model=False)

要例項化一個帶有 FOV 編碼器的新模型，請在配置中設定 use_fov_model=True。

>>> from transformers import DepthProConfig, DepthProForDepthEstimation
>>> config = DepthProConfig(use_fov_model=True)
>>> model = DepthProForDepthEstimation(config)

或者在初始化模型時設定 use_fov_model=True，這會覆蓋配置中的值。

>>> from transformers import DepthProConfig, DepthProForDepthEstimation
>>> config = DepthProConfig()
>>> model = DepthProForDepthEstimation(config, use_fov_model=True)

使用縮放點積注意力 (SDPA)

PyTorch 包含一個原生縮放點積注意力 (SDPA) 運算子，作為 torch.nn.functional 的一部分。此函式包含幾種實現，可根據輸入和所使用的硬體進行應用。有關更多資訊，請參閱官方文件或 GPU 推理頁面。

當實現可用時，SDPA 預設用於 `torch>=2.1.1`，但你也可以在 `from_pretrained()` 中設定 `attn_implementation="sdpa"` 來明確請求使用 SDPA。

from transformers import DepthProForDepthEstimation
model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf", attn_implementation="sdpa", torch_dtype=torch.float16)

為了獲得最佳加速效果，我們建議以半精度（例如 `torch.float16` 或 `torch.bfloat16`）載入模型。

在本地基準測試中（A100-40GB，PyTorch 2.3.0，作業系統 Ubuntu 22.04），使用 float32 和 google/vit-base-patch16-224 模型，我們在推理過程中觀察到以下加速。

批次大小	平均推理時間（毫秒），eager 模式	平均推理時間（毫秒），sdpa 模型	加速，Sdpa / Eager (x)
1	7	6	1.17
2	8	6	1.33
4	8	6	1.33
8	8	6	1.33

資源

Hugging Face 官方和社群（🌎 表示）資源列表，幫助您開始使用 DepthPro

研究論文：Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
官方實現：apple/ml-depth-pro
DepthPro 推理 Notebook：DepthPro 推理
DepthPro 用於超解析度和影像分割
- 閱讀 Medium 部落格：Depth Pro: Beyond Depth
- GitHub 上的程式碼：geetu040/depthpro-beyond-depth

如果您有興趣在此處提交資源，請隨時開啟 Pull Request，我們將對其進行審查！該資源最好能展示一些新內容，而不是重複現有資源。

DepthProConfig

類 transformers.DepthProConfig

< 來源 >

( fusion_hidden_size = 256 patch_size = 384 initializer_range = 0.02 intermediate_hook_ids = [11, 5] intermediate_feature_dims = [256, 256] scaled_images_ratios = [0.25, 0.5, 1] scaled_images_overlap_ratios = [0.0, 0.5, 0.25] scaled_images_feature_dims = [1024, 1024, 512] merge_padding_value = 3 use_batch_norm_in_fusion_residual = False use_bias_in_fusion_residual = True use_fov_model = False num_fov_head_layers = 2 image_model_config = None patch_model_config = None fov_model_config = None **kwargs )

引數

fusion_hidden_size (int, 可選, 預設為 256) — 融合前的通道數。
patch_size (int, 可選, 預設為 384) — 每個補丁的大小（解析度）。這也是骨幹模型的影像大小。
initializer_range (float, 可選, 預設為 0.02) — 用於初始化所有權重矩陣的 truncated_normal_initializer 的標準差。
intermediate_hook_ids (list[int], 可選, 預設為 [11, 5]) — 用於融合的補丁編碼器中間隱藏狀態的索引。
intermediate_feature_dims (list[int], 可選, 預設為 [256, 256]) — intermediate_hook_ids 中每個中間隱藏狀態在上取樣時的隱藏狀態維度。
scaled_images_ratios (list[float], 可選, 預設為 [0.25, 0.5, 1]) — 補丁編碼器使用的縮放影像比例。
scaled_images_overlap_ratios (list[float], 可選, 預設為 [0.0, 0.5, 0.25]) — scaled_images_ratios 中每個縮放影像的補丁之間的重疊比例。
scaled_images_feature_dims (list[int], 可選, 預設為 [1024, 1024, 512]) — scaled_images_ratios 中每個縮放影像在上取樣時的隱藏狀態維度。
merge_padding_value (int, 可選, 預設為 3) — 將較小的補丁合併回影像大小時，會移除此大小的重疊部分。
use_batch_norm_in_fusion_residual (bool, 可選, 預設為 False) — 是否在融合塊的預啟用殘差單元中使用批次歸一化。
use_bias_in_fusion_residual (bool, 可選, 預設為 True) — 是否在融合塊的預啟用殘差單元中使用偏置。
use_fov_model (bool, 可選, 預設為 False) — 是否使用 DepthProFovModel 生成視野。
num_fov_head_layers (int, 可選, 預設為 2) — DepthProFovModel 頭部的卷積層數量。
image_model_config (Union[dict[str, Any], PretrainedConfig], 可選) — 影像編碼器模型的配置，該模型使用 AutoModel API 載入。預設情況下，Dinov2 模型用作骨幹。
patch_model_config (Union[dict[str, Any], PretrainedConfig], 可選) — 補丁編碼器模型的配置，該模型使用 AutoModel API 載入。預設情況下，Dinov2 模型用作骨幹。
fov_model_config (Union[dict[str, Any], PretrainedConfig], 可選) — fov 編碼器模型的配置，該模型使用 AutoModel API 載入。預設情況下，Dinov2 模型用作骨幹。

這是用於儲存 DepthProModel 配置的配置類。它用於根據指定引數例項化 DepthPro 模型，定義模型架構。使用預設值例項化配置將產生與 DepthPro apple/DepthPro 架構類似的配置。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請閱讀 PretrainedConfig 的文件。

示例

>>> from transformers import DepthProConfig, DepthProModel

>>> # Initializing a DepthPro apple/DepthPro style configuration
>>> configuration = DepthProConfig()

>>> # Initializing a model (with random weights) from the apple/DepthPro style configuration
>>> model = DepthProModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

DepthProImageProcessor

類 transformers.DepthProImageProcessor

< 來源 >

( do_resize: bool = True size: typing.Optional[dict[str, int]] = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None **kwargs )

引數

do_resize (bool, 可選, 預設為 True) — 是否將影像的 (height, width) 維度調整為指定的 (size["height"], size["width"])。可以透過 preprocess 方法中的 do_resize 引數覆蓋。
size (dict, 可選, 預設為 {"height": 1536, "width": 1536})：調整大小後輸出影像的大小。可以透過 preprocess 方法中的 size 引數覆蓋。
resample (PILImageResampling, 可選, 預設為 Resampling.BILINEAR) — 如果調整影像大小，使用的重取樣過濾器。可以透過 preprocess 方法中的 resample 引數覆蓋。
do_rescale (bool, 可選, 預設為 True) — 是否將影像值縮放到 [0 - 1] 之間。可以透過 preprocess 方法中的 do_rescale 引數覆蓋。
rescale_factor (int 或 float, 可選, 預設為 1/255) — 如果重新縮放影像，使用的縮放因子。可以透過 preprocess 方法中的 rescale_factor 引數覆蓋。
do_normalize (bool, 可選, 預設為 True) — 是否對影像進行歸一化。可以透過 preprocess 方法中的 do_normalize 引數覆蓋。
image_mean (float 或 list[float], 可選, 預設為 IMAGENET_STANDARD_MEAN) — 如果對影像進行歸一化，則使用的均值。這是一個浮點數或浮點數列表，其長度與影像中的通道數相同。可以透過 preprocess 方法中的 image_mean 引數覆蓋。
image_std (float 或 list[float], 可選, 預設為 IMAGENET_STANDARD_STD) — 如果對影像進行歸一化，則使用的標準差。這是一個浮點數或浮點數列表，其長度與影像中的通道數相同。可以透過 preprocess 方法中的 image_std 引數覆蓋。

構造一個 DepthPro 影像處理器。

預處理

< 來源 >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_resize: typing.Optional[bool] = None size: typing.Optional[dict[str, int]] = None resample: typing.Optional[PIL.Image.Resampling] = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Union[str, transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

引數

images (ImageInput) — 要預處理的影像。期望單個或批次影像的畫素值範圍為 0 到 255。如果傳入的影像畫素值在 0 到 1 之間，請設定 do_rescale=False。
do_resize (bool, 可選, 預設為 self.do_resize) — 是否調整影像大小。
size (dict[str, int], 可選, 預設為 self.size) — 字典格式 {"height": h, "width": w}，指定調整大小後輸出影像的大小。可以透過 preprocess 方法中的 size 引數覆蓋。
resample (PILImageResampling 過濾器, 可選, 預設為 self.resample) — 如果調整影像大小，使用的 PILImageResampling 過濾器，例如 PILImageResampling.BILINEAR。僅當 do_resize 設定為 True 時有效。
do_rescale (bool, 可選, 預設為 self.do_rescale) — 是否將影像值縮放到 [0 - 1] 之間。
rescale_factor (float, optional, 預設為 self.rescale_factor) — 如果 do_rescale 設定為 True，用於縮放影像的縮放因子。
do_normalize (bool, optional, 預設為 self.do_normalize) — 是否對影像進行歸一化。
image_mean (float 或 list[float], optional, 預設為 self.image_mean) — 如果 do_normalize 設定為 True，要使用的影像均值。
image_std (float 或 list[float], optional, 預設為 self.image_std) — 如果 do_normalize 設定為 True，要使用的影像標準差。
return_tensors (str 或 TensorType, optional) — 要返回的張量型別。可以是以下之一：
- 未設定：返回 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf'：返回 tf.Tensor 型別的批次。
- TensorType.PYTORCH 或 'pt'：返回 torch.Tensor 型別的批次。
- TensorType.NUMPY 或 'np'：返回 np.ndarray 型別的批次。
- TensorType.JAX 或 'jax'：返回 jax.numpy.ndarray 型別的批次。
data_format (ChannelDimension 或 str, optional, 預設為 ChannelDimension.FIRST) — 輸出影像的通道維度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：影像為 (num_channels, height, width) 格式。
- "channels_last" 或 ChannelDimension.LAST：影像為 (height, width, num_channels) 格式。
- 未設定：使用輸入影像的通道維度格式。
input_data_format (ChannelDimension 或 str, optional) — 輸入影像的通道維度格式。如果未設定，則從輸入影像推斷通道維度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：影像為 (num_channels, height, width) 格式。
- "channels_last" 或 ChannelDimension.LAST：影像為 (height, width, num_channels) 格式。
- "none" 或 ChannelDimension.NONE：影像為 (height, width) 格式。

預處理一張或一批影像。

後處理深度估計

< source >

( outputs: DepthProDepthEstimatorOutput target_sizes: typing.Union[transformers.utils.generic.TensorType, list[tuple[int, int]], NoneType] = None ) → list[dict[str, TensorType]]

引數

outputs (DepthProDepthEstimatorOutput) — 模型的原始輸出。
target_sizes (Optional[Union[TensorType, list[tuple[int, int]], None]], optional, 預設為 None) — 調整深度預測目標大小。可以是形狀為 (batch_size, 2) 的張量，也可以是批次中每個影像的元組 (height, width) 列表。如果為 None，則不執行大小調整。

list[dict[str, TensorType]]

張量字典列表，表示已處理的深度預測，如果 outputs 中給出了 field_of_view，則還包括視野（度）和焦距（畫素）。

引發

ValueError

ValueError — 如果 predicted_depths、fovs 或 target_sizes 的長度不匹配。

對模型的原始深度預測進行後處理，以生成最終的深度預測，如果提供了視野，則使用視野進行校準，如果提供了目標大小，則調整到指定的目標大小。

DepthProImageProcessorFast

class transformers.DepthProImageProcessorFast

< source >

( **kwargs: typing_extensions.Unpack[transformers.image_processing_utils_fast.DefaultFastImageProcessorKwargs] )

構造一個快速的 Depth Pro 影像處理器。

預處理

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] *args **kwargs: typing_extensions.Unpack[transformers.image_processing_utils_fast.DefaultFastImageProcessorKwargs] ) → <class 'transformers.image_processing_base.BatchFeature'>

引數

images (Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]) — 要預處理的影像。期望單個或批次影像的畫素值範圍為 0 到 255。如果傳入畫素值在 0 到 1 之間的影像，請設定 do_rescale=False。
do_resize (bool, optional) — 是否調整影像大小。
size (dict[str, int], optional) — 描述模型最大輸入維度。
default_to_square (bool, optional) — 當調整大小且大小為整數時，是否預設為正方形影像。
resample (Union[PILImageResampling, F.InterpolationMode, NoneType]) — 如果調整影像大小，要使用的重取樣濾鏡。這可以是列舉 PILImageResampling 之一。僅在 do_resize 設定為 True 時有效。
do_center_crop (bool, optional) — 是否中心裁剪影像。
crop_size (dict[str, int], optional) — 應用 center_crop 後輸出影像的大小。
do_rescale (bool, optional) — 是否縮放影像。
rescale_factor (Union[int, float, NoneType]) — 如果 do_rescale 設定為 True，用於縮放影像的縮放因子。
do_normalize (bool, optional) — 是否對影像進行歸一化。
image_mean (Union[float, list[float], NoneType]) — 用於歸一化的影像均值。僅在 do_normalize 設定為 True 時有效。
image_std (Union[float, list[float], NoneType]) — 用於歸一化的影像標準差。僅在 do_normalize 設定為 True 時有效。
do_convert_rgb (bool, optional) — 是否將影像轉換為 RGB。
return_tensors (Union[str, ~utils.generic.TensorType, NoneType]) — 如果設定為 `pt`，則返回堆疊的張量，否則返回張量列表。
data_format (~image_utils.ChannelDimension, optional) — 僅支援 ChannelDimension.FIRST。為與慢速處理器相容而新增。
input_data_format (Union[str, ~image_utils.ChannelDimension, NoneType]) — 輸入影像的通道維度格式。如果未設定，則從輸入影像推斷通道維度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：影像為 (num_channels, height, width) 格式。
- "channels_last" 或 ChannelDimension.LAST：影像為 (height, width, num_channels) 格式。
- "none" 或 ChannelDimension.NONE：影像為 (height, width) 格式。
device (torch.device, optional) — 處理影像的裝置。如果未設定，則從輸入影像推斷裝置。
disable_grouping (bool, optional) — 是否停用按大小對影像進行分組以單獨處理而不是批次處理。如果為 None，則如果影像在 CPU 上，將設定為 True，否則設定為 False。此選擇基於經驗觀察，詳情見此：https://github.com/huggingface/transformers/pull/38157

<class 'transformers.image_processing_base.BatchFeature'>

data (dict) — 由 call 方法返回的列表/陣列/張量字典（“pixel_values”等）。
tensor_type (Union[None, str, TensorType], 可選) — 您可以在此處提供一個`tensor_type`，以便在初始化時將整數列表轉換為PyTorch/TensorFlow/Numpy張量。

後處理深度估計

< source >

( outputs: DepthProDepthEstimatorOutput target_sizes: typing.Union[transformers.utils.generic.TensorType, list[tuple[int, int]], NoneType] = None ) → list[dict[str, TensorType]]

引數

outputs (DepthProDepthEstimatorOutput) — 模型的原始輸出。
target_sizes (Optional[Union[TensorType, list[tuple[int, int]], None]], optional, 預設為 None) — 調整深度預測目標大小。可以是形狀為 (batch_size, 2) 的張量，也可以是批次中每個影像的元組 (height, width) 列表。如果為 None，則不執行大小調整。

list[dict[str, TensorType]]

張量字典列表，表示已處理的深度預測，如果 outputs 中給出了 field_of_view，則還包括視野（度）和焦距（畫素）。

引發

ValueError

ValueError — 如果 predicted_depths、fovs 或 target_sizes 的長度不匹配。

DepthProModel

class transformers.DepthProModel

< source >

( config )

引數

config (DepthProModel) — 具有模型所有引數的模型配置類。使用配置檔案初始化不載入與模型關聯的權重，只加載配置。檢視 from_pretrained() 方法以載入模型權重。

裸 Depth Pro 模型，輸出原始隱藏狀態，頂部沒有任何特定頭部。

此模型繼承自 PreTrainedModel。請檢視超類文件，瞭解庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件中與通用用法和行為相關的所有事項。

forward

< source >

( pixel_values: FloatTensor head_mask: typing.Optional[torch.FloatTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.models.depth_pro.modeling_depth_pro.DepthProOutput 或 tuple(torch.FloatTensor)

引數

pixel_values (形狀為 (batch_size, num_channels, image_size, image_size) 的 torch.FloatTensor) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲得。有關詳細資訊，請參見 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
head_mask (形狀為 (num_heads,) 或 (num_layers, num_heads) 的 torch.FloatTensor, optional) — 用於使自注意力模組的選定頭部無效的掩碼。在 [0, 1] 中選擇掩碼值：
- 1 表示頭部未被遮蔽，
- 0 表示頭部被遮蔽。
output_attentions (bool, optional) — 是否返回所有注意力層的注意力張量。有關更多詳細資訊，請參見返回張量下的 attentions。
output_hidden_states (bool, optional) — 是否返回所有層的隱藏狀態。有關更多詳細資訊，請參見返回張量下的 hidden_states。
return_dict (bool, optional) — 是否返回 ModelOutput 而不是普通元組。

transformers.models.depth_pro.modeling_depth_pro.DepthProOutput 或 tuple(torch.FloatTensor)

transformers.models.depth_pro.modeling_depth_pro.DepthProOutput 或 torch.FloatTensor 的元組（如果傳遞了 return_dict=False 或當 config.return_dict=False 時），包含根據配置 (DepthProConfig) 和輸入的不同元素。

last_hidden_state (形狀為 (batch_size, n_patches_per_batch, sequence_length, hidden_size) 的 torch.FloatTensor) — 模型最後一層輸出的隱藏狀態序列。
features (Union[torch.FloatTensor, List[torch.FloatTensor]], optional) — 編碼器的特徵。可以是單個特徵或特徵列表。
hidden_states (tuple[torch.FloatTensor, ...], optional, 當傳遞 output_hidden_states=True 或當 config.output_hidden_states=True 時返回) — 形狀為 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor 元組（如果模型具有嵌入層，則為一個嵌入層輸出，加上每個層的一個輸出）。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。
attentions (tuple[torch.FloatTensor, ...], optional, 當傳遞 output_attentions=True 或當 config.output_attentions=True 時返回) — 形狀為 (batch_size, num_heads, sequence_length, sequence_length) 的 torch.FloatTensor 元組（每個層一個）。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

DepthProModel 前向方法，覆蓋 __call__ 特殊方法。

儘管前向傳播的配方需要在此函式中定義，但之後應呼叫 Module 例項而不是此函式，因為前者負責執行預處理和後處理步驟，而後者則默默地忽略它們。

示例

>>> import torch
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, DepthProModel

>>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> checkpoint = "apple/DepthPro-hf"
>>> processor = AutoProcessor.from_pretrained(checkpoint)
>>> model = DepthProModel.from_pretrained(checkpoint)

>>> # prepare image for the model
>>> inputs = processor(images=image, return_tensors="pt")

>>> with torch.no_grad():
...     output = model(**inputs)

>>> output.last_hidden_state.shape
torch.Size([1, 35, 577, 1024])

DepthProForDepthEstimation

class transformers.DepthProForDepthEstimation

< source >

( config use_fov_model = None )

引數

config (DepthProForDepthEstimation) — 具有模型所有引數的模型配置類。使用配置檔案初始化不載入與模型關聯的權重，只加載配置。檢視 from_pretrained() 方法以載入模型權重。
use_fov_model (bool, optional) — 是否使用視野模型。

帶有頂部深度估計頭（由 3 個卷積層組成）的 DepthPro 模型。

此模型繼承自 PreTrainedModel。請檢視超類文件，瞭解庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件中與通用用法和行為相關的所有事項。

forward

< source >

( pixel_values: FloatTensor head_mask: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.models.depth_pro.modeling_depth_pro.DepthProDepthEstimatorOutput 或 tuple(torch.FloatTensor)

引數

pixel_values (形狀為 (batch_size, num_channels, image_size, image_size) 的 torch.FloatTensor) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲得。有關詳細資訊，請參見 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
head_mask (形狀為 (num_heads,) 或 (num_layers, num_heads) 的 torch.FloatTensor, optional) — 用於使自注意力模組的選定頭部無效的掩碼。在 [0, 1] 中選擇掩碼值：
- 1 表示頭部未被遮蔽，
- 0 表示頭部被遮蔽。
labels (形狀為 (batch_size, height, width) 的 torch.LongTensor, optional) — 用於計算損失的真實深度估計圖。
output_attentions (bool, optional) — 是否返回所有注意力層的注意力張量。有關更多詳細資訊，請參見返回張量下的 attentions。
output_hidden_states (bool, optional) — 是否返回所有層的隱藏狀態。有關更多詳細資訊，請參見返回張量下的 hidden_states。
return_dict (bool, optional) — 是否返回 ModelOutput 而不是普通元組。

transformers.models.depth_pro.modeling_depth_pro.DepthProDepthEstimatorOutput 或 tuple(torch.FloatTensor)

transformers.models.depth_pro.modeling_depth_pro.DepthProDepthEstimatorOutput 或 torch.FloatTensor 的元組（如果傳遞了 return_dict=False 或當 config.return_dict=False 時），包含根據配置 (DepthProConfig) 和輸入的不同元素。

loss (形狀為 (1,) 的 torch.FloatTensor，可選，當提供 labels 時返回) — 分類損失（如果 config.num_labels==1，則為迴歸損失）。
predicted_depth (形狀為 (batch_size, height, width) 的 torch.FloatTensor, optional, 預設為 None) — 每個畫素的預測深度。
field_of_view (形狀為 (batch_size,) 的 torch.FloatTensor, optional, 當提供 use_fov_model 時返回) — 視野縮放器。
hidden_states (tuple[torch.FloatTensor, ...], optional, 當傳遞 output_hidden_states=True 或當 config.output_hidden_states=True 時返回) — 形狀為 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor 元組（如果模型具有嵌入層，則為一個嵌入層輸出，加上每個層的一個輸出）。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。
attentions (tuple[torch.FloatTensor, ...], optional, 當傳遞 output_attentions=True 或當 config.output_attentions=True 時返回) — 形狀為 (batch_size, num_heads, sequence_length, sequence_length) 的 torch.FloatTensor 元組（每個層一個）。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

DepthProForDepthEstimation 前向方法，覆蓋 __call__ 特殊方法。

示例

>>> from transformers import AutoImageProcessor, DepthProForDepthEstimation
>>> import torch
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> checkpoint = "apple/DepthPro-hf"
>>> processor = AutoImageProcessor.from_pretrained(checkpoint)
>>> model = DepthProForDepthEstimation.from_pretrained(checkpoint)

>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> model.to(device)

>>> # prepare image for the model
>>> inputs = processor(images=image, return_tensors="pt").to(device)

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> # interpolate to original size
>>> post_processed_output = processor.post_process_depth_estimation(
...     outputs, target_sizes=[(image.height, image.width)],
... )

>>> # get the field of view (fov) predictions
>>> field_of_view = post_processed_output[0]["field_of_view"]
>>> focal_length = post_processed_output[0]["focal_length"]

>>> # visualize the prediction
>>> predicted_depth = post_processed_output[0]["predicted_depth"]
>>> depth = predicted_depth * 255 / predicted_depth.max()
>>> depth = depth.detach().cpu().numpy()
>>> depth = Image.fromarray(depth.astype("uint8"))

< > 在 GitHub 上更新

Transformers

DepthPro

概述

使用技巧

架構和配置

視野 (FOV) 預測

使用縮放點積注意力 (SDPA)

資源

DepthProConfig

類 transformers.DepthProConfig

DepthProImageProcessor

類 transformers.DepthProImageProcessor

預處理

後處理深度估計

DepthProImageProcessorFast

class transformers.DepthProImageProcessorFast

預處理

後處理深度估計

DepthProModel

class transformers.DepthProModel

forward

DepthProForDepthEstimation

class transformers.DepthProForDepthEstimation

forward