GLPN

這是一個最近推出的模型，因此 API 尚未經過廣泛測試。未來可能會有一些錯誤或微小的重大變更來修復它。如果您發現任何異常，請提交 Github Issue。

概述

GLPN 模型由 Doyeon Kim、Woonghyun Ga、Pyungwhan Ahn、Donggyu Joo、Sehwan Chun 和 Junmo Kim 在論文《Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth》（用於單目深度估計的全域性-區域性路徑網路與垂直 CutDepth）中提出。GLPN 將 SegFormer 的分層混合 Transformer 與一個輕量級解碼器相結合，用於單目深度估計。所提出的解碼器比之前提出的解碼器表現出更好的效能，且計算複雜性要低得多。

論文摘要如下：

從單張影像估計深度是計算機視覺中一項重要任務，可應用於各個領域，並且隨著卷積神經網路的發展而迅速成長。在本文中，我們提出了一種用於單目深度估計的新穎結構和訓練策略，以進一步提高網路的預測精度。我們部署了一個分層 Transformer 編碼器來捕獲和傳遞全域性上下文，並設計了一個輕量級但功能強大的解碼器來生成估計的深度圖，同時考慮區域性連通性。透過我們提出的選擇性特徵融合模組，在多尺度區域性特徵和全域性解碼流之間構建連線路徑，網路可以整合兩種表示並恢復精細細節。此外，所提出的解碼器比之前提出的解碼器表現出更好的效能，且計算複雜性要低得多。此外，我們利用深度估計中的一個重要觀察，改進了特定於深度的增強方法，以增強模型效能。我們的網路在具有挑戰性的 NYU Depth V2 深度資料集上達到了最先進的效能。我們進行了廣泛的實驗來驗證並展示所提出方法的有效性。最後，我們的模型比其他對比模型顯示出更好的泛化能力和魯棒性。

方法摘要。摘自原始論文。

該模型由 nielsr 貢獻。原始程式碼可以在這裡找到。

資源

一份官方 Hugging Face 和社群（由 🌎 標誌表示）資源列表，幫助您開始使用 GLPN。

GLPNForDepthEstimation 的演示筆記本可以在這裡找到。
單目深度估計任務指南

GLPNConfig

class transformers.GLPNConfig

< 原始碼 >

( num_channels = 3 num_encoder_blocks = 4 depths = [2, 2, 2, 2] sr_ratios = [8, 4, 2, 1] hidden_sizes = [32, 64, 160, 256] patch_sizes = [7, 3, 3, 3] strides = [4, 2, 2, 2] num_attention_heads = [1, 2, 5, 8] mlp_ratios = [4, 4, 4, 4] hidden_act = 'gelu' hidden_dropout_prob = 0.0 attention_probs_dropout_prob = 0.0 initializer_range = 0.02 drop_path_rate = 0.1 layer_norm_eps = 1e-06 decoder_hidden_size = 64 max_depth = 10 head_in_index = -1 **kwargs )

引數

num_channels (int, 可選, 預設為 3) — 輸入通道的數量。
num_encoder_blocks (int, 可選, 預設為 4) — 編碼器塊的數量（即混合 Transformer 編碼器中的階段數）。
depths (list[int], 可選, 預設為 `[2, 2, 2, 2]`) — 每個編碼器塊中的層數。
sr_ratios (list[int], 可選, 預設為 `[8, 4, 2, 1]`) — 每個編碼器塊中的序列縮減比率。
hidden_sizes (list[int], 可選, 預設為 `[32, 64, 160, 256]`) — 每個編碼器塊的維度。
patch_sizes (list[int], 可選, 預設為 `[7, 3, 3, 3]`) — 每個編碼器塊前的補丁大小。
strides (list[int], 可選, 預設為 `[4, 2, 2, 2]`) — 每個編碼器塊前的步幅。
num_attention_heads (list[int], 可選, 預設為 `[1, 2, 5, 8]`) — Transformer 編碼器中每個塊的每個注意力層的注意力頭數。
mlp_ratios (list[int], 可選, 預設為 `[4, 4, 4, 4]`) — 編碼器塊中混合 FFNs 隱藏層大小與輸入層大小的比率。
hidden_act (str 或 function, 可選, 預設為 "gelu") — 編碼器和池化器中的非線性啟用函式（函式或字串）。如果為字串，支援 "gelu"、"relu"、"selu" 和 "gelu_new"。
hidden_dropout_prob (float, 可選, 預設為 0.0) — 嵌入、編碼器和池化器中所有全連線層的丟棄機率。
attention_probs_dropout_prob (float, 可選, 預設為 0.0) — 注意力機率的丟棄率。
initializer_range (float, 可選, 預設為 0.02) — 用於初始化所有權重矩陣的 truncated_normal_initializer 的標準差。
drop_path_rate (float, 可選, 預設為 0.1) — 隨機深度的丟棄機率，用於 Transformer 編碼器的塊中。
layer_norm_eps (float, 可選, 預設為 1e-06) — 層歸一化層使用的 epsilon。
decoder_hidden_size (int, 可選, 預設為 64) — 解碼器的維度。
max_depth (int, 可選, 預設為 10) — 解碼器的最大深度。
head_in_index (int, 可選, 預設為 -1) — 在頭部中使用的特徵索引。

這是用於儲存 GLPNModel 配置的配置類。它用於根據指定的引數例項化一個 GLPN 模型，定義模型架構。使用預設值例項化配置將產生與 GLPN vinvino02/glpn-kitti 架構類似的配置。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請閱讀 PretrainedConfig 的文件。

示例

>>> from transformers import GLPNModel, GLPNConfig

>>> # Initializing a GLPN vinvino02/glpn-kitti style configuration
>>> configuration = GLPNConfig()

>>> # Initializing a model from the vinvino02/glpn-kitti style configuration
>>> model = GLPNModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

GLPNFeatureExtractor

class transformers.GLPNFeatureExtractor

< 原始碼 >

( *args **kwargs )

call

< 原始碼 >

( images **kwargs )

預處理單張或批次影像。

GLPNImageProcessor

class transformers.GLPNImageProcessor

< 原始碼 >

( do_resize: bool = True size_divisor: int = 32 resample = <Resampling.BILINEAR: 2> do_rescale: bool = True **kwargs )

引數

do_resize (bool, 可選, 預設為 True) — 是否調整影像的（高、寬）尺寸，將其向下取整到最接近的 size_divisor 的倍數。可在 `preprocess` 中透過 `do_resize` 覆蓋。
size_divisor (int, 可選, 預設為 32) — 當 `do_resize` 為 `True` 時，影像會被調整大小，使其高度和寬度向下取整到最接近的 `size_divisor` 的倍數。可在 `preprocess` 中透過 `size_divisor` 覆蓋。
resample (`PIL.Image` 重取樣過濾器, 可選, 預設為 `Resampling.BILINEAR`) — 如果調整影像大小，使用的重取樣過濾器。可在 `preprocess` 中透過 `resample` 覆蓋。
do_rescale (bool, 可選, 預設為 True) — 是否應用縮放因子（使畫素值為 0 到 1 之間的浮點數）。可在 `preprocess` 中透過 `do_rescale` 覆蓋。

構建一個 GLPN 影像處理器。

preprocess

< 原始碼 >

( images: typing.Union[ForwardRef('PIL.Image.Image'), transformers.utils.generic.TensorType, list['PIL.Image.Image'], list[transformers.utils.generic.TensorType]] do_resize: typing.Optional[bool] = None size_divisor: typing.Optional[int] = None resample = None do_rescale: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

引數

images (PIL.Image.Image 或 TensorType 或 list[np.ndarray] 或 list[TensorType]) — 要預處理的影像。期望單個或一批影像，其畫素值範圍為 0 到 255。如果傳入畫素值在 0 到 1 之間的影像，請設定 `do_normalize=False`。
do_resize (bool, 可選, 預設為 self.do_resize) — 是否調整輸入大小，使其（高、寬）尺寸為 `size_divisor` 的倍數。
size_divisor (int, 可選, 預設為 self.size_divisor) — 當 `do_resize` 為 `True` 時，影像會被調整大小，使其高度和寬度向下取整到最接近的 `size_divisor` 的倍數。
resample (`PIL.Image` 重取樣過濾器, 可選, 預設為 self.resample) — 如果調整影像大小，使用的 `PIL.Image` 重取樣過濾器，例如 `PILImageResampling.BILINEAR`。僅當 `do_resize` 設定為 `True` 時生效。
do_rescale (bool, 可選, 預設為 self.do_rescale) — 是否應用縮放因子（使畫素值為 0 到 1 之間的浮點數）。
return_tensors (str 或 TensorType, 可選) — 要返回的張量型別。可以是以下之一：
- None: 返回一個 `np.ndarray` 列表。
- `TensorType.TENSORFLOW` 或 `'tf'`: 返回一個 `tf.Tensor` 型別的批次。
- `TensorType.PYTORCH` 或 `'pt'`: 返回一個 `torch.Tensor` 型別的批次。
- `TensorType.NUMPY` 或 `'np'`: 返回一個 `np.ndarray` 型別的批次。
- `TensorType.JAX` 或 `'jax'`: 返回一個 `jax.numpy.ndarray` 型別的批次。
data_format (ChannelDimension 或 str, 可選, 預設為 ChannelDimension.FIRST) — 輸出影像的通道維度格式。可以是以下之一：
- `ChannelDimension.FIRST`: 影像格式為 (num_channels, height, width)。
- `ChannelDimension.LAST`: 影像格式為 (height, width, num_channels)。
input_data_format (ChannelDimension 或 str, 可選) — 輸入影像的通道維度格式。如果未設定，則從輸入影像中推斷通道維度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST: 影像格式為 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST: 影像格式為 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE: 影像格式為 (height, width)。

預處理給定的影像。

GLPNModel

class transformers.GLPNModel

< source >

( config )

引數

config (GLPNModel) — 包含模型所有引數的模型配置類。使用配置檔案進行初始化不會載入與模型相關的權重，只會載入配置。請檢視 from_pretrained() 方法來載入模型權重。

基礎的 Glpn 模型，輸出原始的隱藏狀態，頂部沒有任何特定的頭（head）。

該模型繼承自 PreTrainedModel。請查閱超類的文件以瞭解該庫為所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入的大小、修剪頭部等）。

該模型也是 PyTorch torch.nn.Module 的子類。可以像使用常規 PyTorch 模組一樣使用它，並參考 PyTorch 文件瞭解所有與常規用法和行為相關的事項。

forward

< source >

( pixel_values: FloatTensor output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)

引數

pixel_values (torch.FloatTensor，形狀為 (batch_size, num_channels, image_size, image_size)) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲得。有關詳細資訊，請參閱 {image_processor_class}.__call__ ({processor_class} 使用 {image_processor_class} 處理影像)。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。更多細節請參閱返回張量下的 `attentions`。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。更多細節請參閱返回張量下的 `hidden_states`。
return_dict (bool, 可選) — 是否返回一個 ModelOutput 而不是一個普通的元組。

transformers.modeling_outputs.BaseModelOutput 或 tuple(torch.FloatTensor)

一個 transformers.modeling_outputs.BaseModelOutput 或一個 torch.FloatTensor 的元組 (如果傳遞了 return_dict=False 或當 config.return_dict=False 時)，包含根據配置 (GLPNConfig) 和輸入而定的各種元素。

last_hidden_state (torch.FloatTensor, 形狀為 (batch_size, sequence_length, hidden_size)) — 模型最後一層輸出的隱藏狀態序列。
hidden_states (tuple(torch.FloatTensor), 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — torch.FloatTensor 的元組 (一個用於嵌入層的輸出，如果模型有嵌入層，+ 一個用於每層的輸出)，形狀為 (batch_size, sequence_length, hidden_size)。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。
attentions (tuple(torch.FloatTensor), 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — torch.FloatTensor 的元組 (每層一個)，形狀為 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

GLPNModel 的 forward 方法重寫了 `__call__` 特殊方法。

雖然前向傳播的邏輯需要在此函式中定義，但之後應該呼叫 `Module` 例項而不是這個函式，因為前者會處理預處理和後處理步驟，而後者會默默地忽略它們。

示例

GLPNForDepthEstimation

class transformers.GLPNForDepthEstimation

< source >

( config )

引數

config (GLPNForDepthEstimation) — 包含模型所有引數的模型配置類。使用配置檔案進行初始化不會載入與模型相關的權重，只會載入配置。請檢視 from_pretrained() 方法來載入模型權重。

GLPN Transformer 模型，頂部帶有一個輕量級的深度估計頭，例如用於 KITTI、NYUv2 資料集。

該模型繼承自 PreTrainedModel。請查閱超類的文件以瞭解該庫為所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入的大小、修剪頭部等）。

該模型也是 PyTorch torch.nn.Module 的子類。可以像使用常規 PyTorch 模組一樣使用它，並參考 PyTorch 文件瞭解所有與常規用法和行為相關的事項。

forward

< source >

( pixel_values: FloatTensor labels: typing.Optional[torch.FloatTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.DepthEstimatorOutput or tuple(torch.FloatTensor)

引數

pixel_values (torch.FloatTensor，形狀為 (batch_size, num_channels, image_size, image_size)) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲得。有關詳細資訊，請參閱 {image_processor_class}.__call__ ({processor_class} 使用 {image_processor_class} 處理影像)。
labels (torch.FloatTensor，形狀為 (batch_size, height, width), 可選) — 用於計算損失的真實深度估計圖。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。更多細節請參閱返回張量下的 `attentions`。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。更多細節請參閱返回張量下的 `hidden_states`。
return_dict (bool, 可選) — 是否返回一個 ModelOutput 而不是一個普通的元組。

transformers.modeling_outputs.DepthEstimatorOutput 或 tuple(torch.FloatTensor)

一個 transformers.modeling_outputs.DepthEstimatorOutput 或一個 torch.FloatTensor 的元組 (如果傳遞了 return_dict=False 或當 config.return_dict=False 時)，包含根據配置 (GLPNConfig) 和輸入而定的各種元素。

loss (形狀為 (1,) 的 torch.FloatTensor，可選，當提供 labels 時返回) — 分類損失（如果 config.num_labels==1，則為迴歸損失）。
predicted_depth (torch.FloatTensor，形狀為 (batch_size, height, width)) — 每個畫素的預測深度。
hidden_states (tuple(torch.FloatTensor), 可選, 當傳遞 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — torch.FloatTensor 的元組 (一個用於嵌入層的輸出，如果模型有嵌入層，+ 一個用於每層的輸出)，形狀為 (batch_size, num_channels, height, width)。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。
attentions (tuple(torch.FloatTensor), 可選, 當傳遞 output_attentions=True 或 config.output_attentions=True 時返回) — torch.FloatTensor 的元組 (每層一個)，形狀為 (batch_size, num_heads, patch_size, sequence_length)。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

GLPNForDepthEstimation 的 forward 方法重寫了 `__call__` 特殊方法。

示例

>>> from transformers import AutoImageProcessor, GLPNForDepthEstimation
>>> import torch
>>> import numpy as np
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("vinvino02/glpn-kitti")
>>> model = GLPNForDepthEstimation.from_pretrained("vinvino02/glpn-kitti")

>>> # prepare image for the model
>>> inputs = image_processor(images=image, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> # interpolate to original size
>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs,
...     target_sizes=[(image.height, image.width)],
... )

>>> # visualize the prediction
>>> predicted_depth = post_processed_output[0]["predicted_depth"]
>>> depth = predicted_depth * 255 / predicted_depth.max()
>>> depth = depth.detach().cpu().numpy()
>>> depth = Image.fromarray(depth.astype("uint8"))

< > 在 GitHub 上更新

Transformers

GLPN

概述

資源

GLPNConfig

class transformers.GLPNConfig

GLPNFeatureExtractor

class transformers.GLPNFeatureExtractor

__call__

GLPNImageProcessor

class transformers.GLPNImageProcessor

preprocess

GLPNModel

class transformers.GLPNModel

forward

GLPNForDepthEstimation

class transformers.GLPNForDepthEstimation

forward

call