用於大規模影像識別的超深度卷積網路 (2014)

簡介

VGG 架構於 2014 年由牛津大學視覺幾何組（Visual Geometry Group，因此得名 VGG）的 Karen Simonyan 和 Andrew Zisserman 開發。該模型在當時（確切地說是 2014 年 ImageNet 挑戰賽，也稱為 ILSVRC）相比之前的模型表現出顯著改進。

VGG 網路架構

輸入為 224x224 影像。
卷積核形狀為 (3,3)，最大池化視窗形狀為 (2,2)。
每個卷積層的通道數：64 -> 128 -> 256 -> 512 -> 512。
VGG16 包含 16 個隱藏層（13 個卷積層和 3 個全連線層）。
VGG19 包含 19 個隱藏層（16 個卷積層和 3 個全連線層）。

關鍵比較

VGG（16 或 19 層）在當時比其他最先進的網路相對更深。ILSVRC 2012 的獲勝模型 AlexNet 只有 8 層。
使用 ReLU 啟用的多個小 (3X3) 感受野濾波器，而不是一個大 (7X7 或 11X11) 濾波器，可以更好地學習複雜特徵。更小的濾波器也意味著每層引數更少，並且在層之間引入了額外的非線性。
多尺度訓練和推理。每張影像都經過多輪不同尺度的訓練，以確保在不同尺寸下捕獲相似的特徵。
VGG 網路的一致性和簡潔性使其易於擴充套件或修改以進行未來改進。

PyTorch 示例

下面是 VGG19 的 PyTorch 實現。

import torch.nn as nn


class VGG19(nn.Module):
    def __init__(self, num_classes=1000):
        super(VGG19, self).__init__()

        # Feature extraction layers: Convolutional and pooling layers
        self.feature_extractor = nn.Sequential(
            nn.Conv2d(
                3, 64, kernel_size=3, padding=1
            ),  # 3 input channels, 64 output channels, 3x3 kernel, 1 padding
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(
                kernel_size=2, stride=2
            ),  # Max pooling with 2x2 kernel and stride 2
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )

        # Pooling Layer
        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(7, 7))

        # Fully connected layers for classification
        self.classifier = nn.Sequential(
            nn.Linear(
                512 * 7 * 7, 4096
            ),  # 512 channels, 7x7 spatial dimensions after max pooling
            nn.ReLU(),
            nn.Dropout(0.5),  # Dropout layer with 0.5 dropout probability
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, num_classes),  # Output layer with 'num_classes' output units
        )

    def forward(self, x):
        x = self.feature_extractor(x)  # Pass input through the feature extractor layers
        x = self.avgpool(x)  # Pass Data through a pooling layer
        x = x.view(x.size(0), -1)  # Flatten the output for the fully connected layers
        x = self.classifier(x)  # Pass flattened output through the classifier layers
        return x

< > 在 GitHub 上更新

社群計算機視覺課程

用於大規模影像識別的超深度卷積網路 (2014)

簡介

VGG 網路架構

關鍵比較

PyTorch 示例