第18讲:目标检测:找到并框出它


目录

  1. 目标检测:从分类到定位
  2. 两阶段检测:Faster R-CNN
  3. 单阶段检测:YOLO与SSD
  4. Anchor机制:预设的候选框
  5. NMS:去重与筛选
  6. 评价指标:mAP
  7. 实战:自定义数据集训练
  8. 小结

1. 目标检测:从分类到定位

1.1 检测任务的定义

目标检测 = 图像分类 + 目标定位

  • 分类:框里是什么?(猫/狗/人)
  • 定位:框在哪里?(左上角x,y,宽w,高h)
输入图像
    ↓
【检测模型】
    ↓
输出:[N, 6] 的检测结果
      [x1, y1, x2, y2, class, confidence]
      
      例如:
      [100, 200, 300, 400, "cat", 0.95]  ← 左上角(100,200),右下角(300,400),猫,置信度95%
      [500, 100, 700, 600, "dog", 0.88]  ← 另一只狗

1.2 检测的难点

难点 说明 解决方案
多尺度 目标大小差异大(远处的人 vs 近处的车) 特征金字塔(FPN)、多尺度训练
密集场景 多个目标重叠、遮挡 NMS、Soft-NMS、DIoU-NMS
实时性 自动驾驶需要30+ FPS 单阶段检测器、模型轻量化
小目标 远处的小物体难以检测 高分辨率特征、上下文融合
类别不平衡 背景框远多于目标框 Focal Loss、OHEM

2. 两阶段检测:Faster R-CNN

2.1 演进历程

R-CNN (2014)          Fast R-CNN (2015)        Faster R-CNN (2015)
    │                       │                         │
    ▼                       ▼                         ▼
1. Selective Search    1. 整张图过CNN             1. CNN提取特征
   找2000个候选框      2. ROI Pooling提取         2. RPN生成候选框
2. 每个框单独过CNN       每个ROI特征              3. ROI Pooling + 分类
3. SVM分类            3. 同时分类+回归             (端到端训练)
4. 回归修正框

速度:~50s/图          ~2s/图                   ~0.2s/图
精度:mAP 53.3%        mAP 68.8%                mAP 73.2%

2.2 Faster R-CNN架构详解

import torch
import torch.nn as nn
import torchvision.models as models
from torchvision.ops import RoIPool, nms

class FasterRCNN(nn.Module):
    """
    Faster R-CNN 简化实现
    核心组件:Backbone + RPN + ROI Pooling + Head
    """
    def __init__(self, num_classes=21):
        super().__init__()
        
        # 1. Backbone:提取特征
        resnet = models.resnet50(pretrained=True)
        self.backbone = nn.Sequential(
            resnet.conv1, resnet.bn1, resnet.relu, resnet.maxpool,
            resnet.layer1, resnet.layer2, resnet.layer3
        )
        # 输出特征:1/16分辨率,通道=1024
        
        # 2. RPN:Region Proposal Network
        self.rpn = RegionProposalNetwork(in_channels=1024)
        
        # 3. ROI Pooling:统一ROI尺寸
        self.roi_pool = RoIPool(output_size=(7, 7), spatial_scale=1/16)
        
        # 4. Head:分类 + 回归
        self.fc = nn.Sequential(
            nn.Linear(1024 * 7 * 7, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(0.5)
        )
        self.classifier = nn.Linear(4096, num_classes)  # 分类
        self.bbox_regressor = nn.Linear(4096, num_classes * 4)  # 回归(每类一个框)
    
    def forward(self, images, targets=None):
        """
        训练时:需要targets计算RPN和Head的损失
        推理时:只输出检测结果
        """
        # 提取特征
        features = self.backbone(images)  # [N, 1024, H/16, W/16]
        
        # RPN生成候选框
        proposals, rpn_losses = self.rpn(features, images.shape[2:], targets)
        
        # ROI Pooling
        if self.training:
            # 训练时: proposals + 真实框 混合采样
            proposals = self._sample_proposals(proposals, targets)
        
        pooled_features = self.roi_pool(features, proposals)  # [num_rois, 1024, 7, 7]
        pooled_features = pooled_features.view(pooled_features.size(0), -1)
        
        # Head
        fc_out = self.fc(pooled_features)
        class_logits = self.classifier(fc_out)      # [num_rois, num_classes]
        bbox_deltas = self.bbox_regressor(fc_out)    # [num_rois, num_classes*4]
        
        if self.training:
            # 计算损失
            losses = self._compute_loss(class_logits, bbox_deltas, proposals, targets)
            losses.update(rpn_losses)
            return losses
        
        # 推理:后处理得到最终结果
        detections = self._postprocess(class_logits, bbox_deltas, proposals)
        return detections

class RegionProposalNetwork(nn.Module):
    """
    RPN:在特征图上滑动小网络,预测候选框
    """
    def __init__(self, in_channels=1024, num_anchors=9):
        super().__init__()
        
        # 3x3卷积提取RPN特征
        self.conv = nn.Conv2d(in_channels, 512, 3, padding=1)
        
        # 分类分支:前景/背景(2类)
        self.cls_logits = nn.Conv2d(512, num_anchors * 2, 1)
        
        # 回归分支:框偏移(4个参数)
        self.bbox_pred = nn.Conv2d(512, num_anchors * 4, 1)
        
        # Anchor生成器
        self.anchor_generator = AnchorGenerator(
            sizes=[128, 256, 512],
            aspect_ratios=[0.5, 1.0, 2.0]
        )
    
    def forward(self, features, image_size, targets=None):
        # features: [N, 1024, H, W]
        x = torch.relu(self.conv(features))
        
        # 分类:每个anchor是前景还是背景
        cls_logits = self.cls_logits(x)  # [N, num_anchors*2, H, W]
        
        # 回归:每个anchor的偏移
        bbox_pred = self.bbox_pred(x)     # [N, num_anchors*4, H, W]
        
        # 生成anchors
        anchors = self.anchor_generator(features, image_size)
        
        # 训练时计算RPN损失
        if self.training and targets is not None:
            rpn_losses = self._compute_rpn_loss(cls_logits, bbox_pred, anchors, targets)
            return anchors, rpn_losses
        
        # 推理时:根据分数筛选proposals
        proposals = self._filter_proposals(cls_logits, bbox_pred, anchors)
        return proposals, {}

class AnchorGenerator:
    """
    生成多尺度、多比例的Anchor框
    """
    def __init__(self, sizes=[128, 256, 512], aspect_ratios=[0.5, 1.0, 2.0]):
        self.sizes = sizes
        self.aspect_ratios = aspect_ratios
        self.num_anchors = len(sizes) * len(aspect_ratios)
        
        # 预计算anchor模板
        self.cell_anchors = self._generate_cell_anchors()
    
    def _generate_cell_anchors(self):
        anchors = []
        for size in self.sizes:
            for ratio in self.aspect_ratios:
                w = size * np.sqrt(ratio)
                h = size / np.sqrt(ratio)
                anchors.append([-w/2, -h/2, w/2, h/2])
        return torch.tensor(anchors)  # [num_anchors, 4]
    
    def __call__(self, features, image_size):
        # 在特征图每个位置放置anchors
        # 实际实现需要映射到原图坐标
        pass

2.3 RPN的核心思想

比喻:RPN就像在地图上撒网捕鱼

  • 先撒不同大小、不同形状的网(anchors)
  • 判断哪个网里有鱼(前景/背景分类)
  • 调整网的位置和大小(框回归)
  • 最后只保留有鱼的网(NMS筛选)
特征图上的每个位置:
    ↓
生成K个anchors(如K=9:3种尺寸 × 3种比例)
    ↓
分类分支:判断每个anchor是前景还是背景
    ↓
回归分支:修正anchor的位置和大小
    ↓
筛选:只保留高置信度的前景anchor作为proposals

3. 单阶段检测:YOLO与SSD

3.1 YOLO:You Only Look Once

核心思想:把检测当回归问题,一次前向直接输出所有框的位置和类别。

class YOLOv1(nn.Module):
    """
    YOLOv1简化实现(原始版本,网格预测)
    将图像分成S×S网格,每个网格预测B个框
    """
    def __init__(self, S=7, B=2, num_classes=20):
        super().__init__()
        self.S = S  # 网格数
        self.B = B  # 每个网格的框数
        self.num_classes = num_classes
        
        # Backbone(类似VGG)
        self.backbone = nn.Sequential(
            # 大量卷积层提取特征
            nn.Conv2d(3, 64, 7, stride=2, padding=3),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(2, 2),
            
            nn.Conv2d(64, 192, 3, padding=1),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(2, 2),
            
            # ... 更多卷积层 ...
            
            nn.Conv2d(512, 1024, 3, padding=1),
            nn.LeakyReLU(0.1),
        )
        
        # 全连接层输出
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(1024 * S * S, 4096),
            nn.LeakyReLU(0.1),
            nn.Dropout(0.5),
            nn.Linear(4096, S * S * (B * 5 + num_classes))
            # 每个网格输出:B*(x,y,w,h,confidence) + num_classes
        )
    
    def forward(self, x):
        # x: [N, 3, 448, 448]
        features = self.backbone(x)  # [N, 1024, S, S]
        out = self.fc(features)       # [N, S*S*(B*5+C)]
        out = out.view(-1, self.S, self.S, self.B * 5 + self.num_classes)
        
        # 解析输出
        # out[..., :B*5] = 框参数(每个框5个数:x,y,w,h,conf)
        # out[..., B*5:] = 类别概率
        
        return out

# YOLO输出解析
def decode_yolo_output(output, S=7, B=2, num_classes=20, conf_thresh=0.5):
    """
    将YOLO输出解析为检测结果
    """
    batch_size = output.size(0)
    predictions = []
    
    for b in range(batch_size):
        for i in range(S):
            for j in range(S):
                grid_output = output[b, i, j]  # [B*5 + C]
                
                for box_idx in range(B):
                    # 提取框参数
                    start = box_idx * 5
                    x = grid_output[start]
                    y = grid_output[start + 1]
                    w = grid_output[start + 2]
                    h = grid_output[start + 3]
                    conf = torch.sigmoid(grid_output[start + 4])
                    
                    if conf > conf_thresh:
                        # 转换到原图坐标
                        x = (j + x) / S  # 网格中心 + 偏移,归一化
                        y = (i + y) / S
                        w = w ** 2       # YOLOv1用平方根
                        h = h ** 2
                        
                        # 类别
                        class_scores = grid_output[B*5:]
                        class_id = class_scores.argmax()
                        class_conf = torch.softmax(class_scores, dim=0)[class_id]
                        
                        predictions.append({
                            'bbox': [x - w/2, y - h/2, x + w/2, y + h/2],
                            'confidence': conf.item(),
                            'class_id': class_id.item(),
                            'class_conf': class_conf.item()
                        })
    
    return predictions

3.2 SSD:Single Shot MultiBox Detector

class SSD(nn.Module):
    """
    SSD:多尺度特征图预测
    在不同分辨率的特征图上检测不同大小的目标
    """
    def __init__(self, num_classes=21):
        super().__init__()
        
        # Backbone(VGG)
        self.backbone = models.vgg16(pretrained=True).features
        
        # 额外特征层(逐渐变小,检测大目标)
        self.extras = nn.ModuleList([
            nn.Sequential(
                nn.Conv2d(512, 1024, 3, padding=1),
                nn.ReLU(),
                nn.Conv2d(1024, 1024, 1),
                nn.ReLU()
            ),
            nn.Sequential(
                nn.Conv2d(1024, 256, 1),
                nn.ReLU(),
                nn.Conv2d(256, 512, 3, stride=2, padding=1),
                nn.ReLU()
            ),
            # ... 更多层 ...
        ])
        
        # 检测头:每个特征图位置预测框和类别
        self.loc_layers = nn.ModuleList()   # 回归分支
        self.conf_layers = nn.ModuleList()  # 分类分支
        
        # 为每个特征图配置检测头
        channels = [512, 1024, 512, 256, 256, 256]  # 各层通道数
        num_priors = [4, 6, 6, 6, 4, 4]  # 每层每个位置的anchor数
        
        for in_channels, num_prior in zip(channels, num_priors):
            self.loc_layers.append(
                nn.Conv2d(in_channels, num_prior * 4, 3, padding=1)
            )
            self.conf_layers.append(
                nn.Conv2d(in_channels, num_prior * num_classes, 3, padding=1)
            )
        
        # Anchor生成
        self.prior_boxes = self._generate_priors()
    
    def forward(self, x):
        # Backbone特征
        sources = []
        for i, layer in enumerate(self.backbone):
            x = layer(x)
            if i in [22, 34]:  # VGG的特定层输出
                sources.append(x)
        
        # 额外特征层
        for extra in self.extras:
            x = extra(x)
            sources.append(x)
        
        # 在每个特征图上预测
        loc = []    # [batch, num_priors*4, H, W]
        conf = []   # [batch, num_priors*num_classes, H, W]
        
        for source, loc_layer, conf_layer in zip(sources, self.loc_layers, self.conf_layers):
            loc.append(loc_layer(source).permute(0, 2, 3, 1).contiguous())
            conf.append(conf_layer(source).permute(0, 2, 3, 1).contiguous())
        
        # 拼接所有预测
        loc = torch.cat([o.view(o.size(0), -1) for o in loc], dim=1)
        conf = torch.cat([o.view(o.size(0), -1) for o in conf], dim=1)
        
        #  reshape
        loc = loc.view(loc.size(0), -1, 4)
        conf = conf.view(conf.size(0), -1, self.num_classes)
        
        return loc, conf, self.prior_boxes

    def _generate_priors(self):
        # 生成不同尺度、不同比例的prior boxes
        pass

3.3 YOLO vs SSD vs Faster R-CNN

特性 YOLO (v3/v5/v8) SSD Faster R-CNN
速度 ⚡ 最快(~140 FPS) 快(~60 FPS) 慢(~10 FPS)
精度 高(mAP ~50) 中高(mAP ~46) 最高(mAP ~42-75)
小目标 一般(v8改进) 差(依赖浅层) (多尺度ROI)
定位精度 中高 (两阶段精修)
训练难度 简单 中等 复杂
适用场景 实时检测 平衡场景 精度优先

4. Anchor机制:预设的候选框

4.1 什么是Anchor

比喻:Anchor就像钓鱼前绑好的鱼钩,提前准备了各种大小和形状,只等判断哪个钩上有鱼。

import numpy as np
import matplotlib.pyplot as plt

def generate_anchors(base_size=16, ratios=[0.5, 1, 2], scales=[8, 16, 32]):
    """
    生成Anchor模板
    base_size: 特征图上的基础网格大小
    ratios: 宽高比
    scales: 缩放倍数
    """
    anchors = []
    
    for scale in scales:
        for ratio in ratios:
            # 计算anchor的宽和高
            # 面积 = (base_size * scale)^2
            # w = sqrt(area * ratio), h = sqrt(area / ratio)
            area = (base_size * scale) ** 2
            w = np.sqrt(area * ratio)
            h = np.sqrt(area / ratio)
            
            # 以中心为原点
            x1 = -w / 2
            y1 = -h / 2
            x2 = w / 2
            y2 = h / 2
            
            anchors.append([x1, y1, x2, y2])
    
    return np.array(anchors)

# 可视化anchors
anchors = generate_anchors()

fig, ax = plt.subplots(figsize=(10, 10))
colors = plt.cm.viridis(np.linspace(0, 1, len(anchors)))

for i, anchor in enumerate(anchors):
    x1, y1, x2, y2 = anchor
    width = x2 - x1
    height = y2 - y1
    
    rect = plt.Rectangle((x1, y1), width, height, 
                        fill=False, edgecolor=colors[i], linewidth=2,
                        label=f'Ratio={["0.5","1.0","2.0"][i%3]}, Scale={[8,16,32][i//3]}')
    ax.add_patch(rect)

ax.set_xlim(-400, 400)
ax.set_ylim(-400, 400)
ax.set_aspect('equal')
ax.axhline(y=0, color='k', linestyle='-', alpha=0.3)
ax.axvline(x=0, color='k', linestyle='-', alpha=0.3)
ax.set_title('Anchor Templates (Centered at Origin)')
ax.legend(loc='upper right', fontsize=8)
plt.show()

print(f"生成了 {len(anchors)} 个anchor模板")
print("形状:", anchors.shape)

4.2 Anchor与特征图的映射

def map_anchors_to_image(anchors, feature_map_size, image_size):
    """
    将anchor模板映射到原图的所有位置
    """
    feat_h, feat_w = feature_map_size
    img_h, img_w = image_size
    
    # 计算stride(特征图每个像素对应原图多少像素)
    stride_h = img_h / feat_h
    stride_w = img_w / feat_w
    
    all_anchors = []
    
    for i in range(feat_h):
        for j in range(feat_w):
            # 当前网格的中心在原图上的位置
            center_y = i * stride_h + stride_h / 2
            center_x = j * stride_w + stride_w / 2
            
            for anchor in anchors:
                # 将anchor平移到当前位置
                x1 = center_x + anchor[0]
                y1 = center_y + anchor[1]
                x2 = center_x + anchor[2]
                y2 = center_y + anchor[3]
                
                all_anchors.append([x1, y1, x2, y2])
    
    return np.array(all_anchors)

# 示例:特征图32x32,原图512x512
anchors = generate_anchors(base_size=16)
mapped_anchors = map_anchors_to_image(anchors, (32, 32), (512, 512))

print(f"特征图尺寸: 32x32")
print(f"Anchor模板数: {len(anchors)}")
print(f"映射后总anchor数: {len(mapped_anchors)}")  # 32*32*9 = 9216
print(f"第一个anchor: {mapped_anchors[0]}")

4.3 Anchor-free:下一代检测器

趋势:YOLOv8、FCOS等现代检测器抛弃Anchor,直接预测中心点和宽高。

class AnchorFreeDetector(nn.Module):
    """
    Anchor-free检测器(FCOS风格)
    直接预测中心点、距离、类别
    """
    def __init__(self, num_classes=80):
        super().__init__()
        
        # Backbone + FPN
        self.backbone = models.resnet50(pretrained=True)
        self.fpn = FPN()
        
        # 检测头(每个位置预测)
        self.cls_head = nn.Sequential(
            nn.Conv2d(256, 256, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(256, num_classes, 3, padding=1)  # 分类
        )
        
        self.reg_head = nn.Sequential(
            nn.Conv2d(256, 256, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(256, 4, 3, padding=1)  # 回归(到边界的距离:l,t,r,b)
        )
        
        self.centerness_head = nn.Sequential(
            nn.Conv2d(256, 256, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(256, 1, 3, padding=1)  # 中心度
        )
    
    def forward(self, x):
        features = self.backbone(x)
        fpn_features = self.fpn(features)
        
        cls_scores = []
        bbox_preds = []
        centernesses = []
        
        for feat in fpn_features:
            cls_scores.append(self.cls_head(feat))
            bbox_preds.append(self.reg_head(feat))
            centernesses.append(self.centerness_head(feat))
        
        return cls_scores, bbox_preds, centernesses

5. NMS:去重与筛选

5.1 为什么需要NMS

问题:同一个目标可能被多个anchor检测到,产生大量重叠的框。

检测结果(未去重):
┌─────────────────┐
│  ┌───┐          │
│  │猫 │ 0.95     │  ← 正确框
│  └───┘          │
│    ┌───┐        │
│    │猫 │ 0.92   │  ← 重叠框(应该去掉)
│    └───┘        │
│      ┌───┐      │
│      │猫 │ 0.88 │  ← 重叠框(应该去掉)
│      └───┘      │
└─────────────────┘

5.2 NMS算法

def nms(boxes, scores, iou_threshold=0.5):
    """
    非极大值抑制(Non-Maximum Suppression)
    
    boxes: [N, 4] (x1, y1, x2, y2)
    scores: [N] 置信度
    iou_threshold: IoU阈值,超过则抑制
    """
    # 按分数降序排序
    indices = scores.argsort(descending=True)
    
    keep = []  # 保留的框
    
    while len(indices) > 0:
        # 保留当前最高分的框
        current = indices[0]
        keep.append(current.item())
        
        if len(indices) == 1:
            break
        
        # 计算当前框与其余框的IoU
        current_box = boxes[current].unsqueeze(0)  # [1, 4]
        other_boxes = boxes[indices[1:]]            # [M, 4]
        
        ious = compute_iou(current_box, other_boxes)  # [M]
        
        # 保留IoU小于阈值的框
        mask = ious <= iou_threshold
        indices = indices[1:][mask]
    
    return torch.tensor(keep)

def compute_iou(box1, box2):
    """
    计算两个框的IoU(交并比)
    box1: [N, 4], box2: [M, 4]
    """
    # 交集区域
    x1 = torch.max(box1[:, 0].unsqueeze(1), box2[:, 0])  # [N, M]
    y1 = torch.max(box1[:, 1].unsqueeze(1), box2[:, 1])
    x2 = torch.min(box1[:, 2].unsqueeze(1), box2[:, 2])
    y2 = torch.min(box1[:, 3].unsqueeze(1), box2[:, 3])
    
    intersection = torch.clamp(x2 - x1, min=0) * torch.clamp(y2 - y1, min=0)
    
    # 并集区域
    area1 = (box1[:, 2] - box1[:, 0]) * (box1[:, 3] - box1[:, 1])  # [N]
    area2 = (box2[:, 2] - box2[:, 0]) * (box2[:, 3] - box2[:, 1])  # [M]
    
    union = area1.unsqueeze(1) + area2 - intersection
    
    return intersection / (union + 1e-6)

# 测试NMS
boxes = torch.tensor([
    [100, 100, 200, 200],  # 猫 0.95
    [105, 105, 205, 205],  # 猫 0.92(与第一个IoU高)
    [300, 300, 400, 400],  # 狗 0.90(与第一个IoU低)
    [110, 110, 210, 210],  # 猫 0.88
])

scores = torch.tensor([0.95, 0.92, 0.90, 0.88])

keep = nms(boxes, scores, iou_threshold=0.5)
print(f"保留的框索引: {keep.numpy()}")  # [0, 2](去掉重叠的猫框,保留狗框)

5.3 NMS的改进

class SoftNMS:
    """
    Soft-NMS:降低重叠框的分数,而不是直接删除
    更适合密集场景
    """
    def __call__(self, boxes, scores, sigma=0.5, method='gaussian'):
        indices = scores.argsort(descending=True)
        keep = []
        
        while len(indices) > 0:
            current = indices[0]
            keep.append(current.item())
            
            if len(indices) == 1:
                break
            
            current_box = boxes[current].unsqueeze(0)
            other_boxes = boxes[indices[1:]]
            other_scores = scores[indices[1:]]
            
            ious = compute_iou(current_box, other_boxes).squeeze(0)
            
            # 根据IoU衰减分数(而不是直接删除)
            if method == 'linear':
                weight = torch.where(ious > 0.5, 1 - ious, torch.ones_like(ious))
            else:  # gaussian
                weight = torch.exp(-(ious ** 2) / sigma)
            
            other_scores *= weight
            scores[indices[1:]] = other_scores
            
            # 重新排序
            indices = scores.argsort(descending=True)
            # 去掉已经处理的
            indices = indices[indices != current]
        
        return torch.tensor(keep)

# DIoU-NMS:考虑中心点距离
def diou_nms(boxes, scores, iou_threshold=0.5):
    """
    DIoU = IoU - (中心点距离)^2 / (对角线距离)^2
    更好处理重叠但不包含的情况
    """
    # 实现类似NMS,但用DIoU替代IoU
    pass

6. 评价指标:mAP

6.1 基础指标:Precision & Recall

def compute_precision_recall(pred_boxes, pred_scores, gt_boxes, iou_thresh=0.5):
    """
    计算单张图的Precision和Recall
    
    pred_boxes: [N, 4] 预测框
    pred_scores: [N] 预测分数
    gt_boxes: [M, 4] 真实框
    """
    # 按分数排序
    sorted_indices = pred_scores.argsort(descending=True)
    pred_boxes = pred_boxes[sorted_indices]
    
    tp = 0  # True Positive
    fp = 0  # False Positive
    matched_gt = set()
    
    for pred in pred_boxes:
        # 找最佳匹配的真实框
        best_iou = 0
        best_gt = -1
        
        for i, gt in enumerate(gt_boxes):
            if i in matched_gt:
                continue
            iou = compute_iou(pred.unsqueeze(0), gt.unsqueeze(0)).item()
            if iou > best_iou:
                best_iou = iou
                best_gt = i
        
        if best_iou >= iou_thresh and best_gt not in matched_gt:
            tp += 1
            matched_gt.add(best_gt)
        else:
            fp += 1
    
    fn = len(gt_boxes) - len(matched_gt)  # False Negative
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    return precision, recall

6.2 AP(Average Precision)

def compute_ap(precisions, recalls):
    """
    计算AP:Precision-Recall曲线下面积
    使用11点插值或所有点插值
    """
    # 11点插值
    ap = 0
    for t in np.arange(0, 1.1, 0.1):
        # 找到recall >= t的最大precision
        if np.sum(recalls >= t) == 0:
            p = 0
        else:
            p = np.max(precisions[recalls >= t])
        ap += p / 11
    
    return ap

# 完整mAP计算
def compute_map(all_predictions, all_ground_truths, num_classes, iou_threshold=0.5):
    """
    计算mAP(mean Average Precision)
    
    all_predictions: list of dict, 每张图的预测 {class_id: [(bbox, score), ...]}
    all_ground_truths: list of dict, 每张图的真实框 {class_id: [bbox, ...]}
    """
    aps = []
    
    for class_id in range(num_classes):
        # 收集该类别的所有预测和真实框
        class_preds = []  # [(image_id, bbox, score), ...]
        class_gts = {}    # {image_id: [bbox, ...], ...}
        
        for img_id, preds in enumerate(all_predictions):
            if class_id in preds:
                for bbox, score in preds[class_id]:
                    class_preds.append((img_id, bbox, score))
            
            if class_id in all_ground_truths[img_id]:
                class_gts[img_id] = all_ground_truths[img_id][class_id]
        
        # 按分数排序
        class_preds.sort(key=lambda x: x[2], reverse=True)
        
        # 计算Precision-Recall曲线
        tp_cumsum = []
        fp_cumsum = []
        total_gt = sum(len(gts) for gts in class_gts.values())
        
        matched = {img_id: set() for img_id in class_gts.keys()}
        
        for img_id, bbox, score in class_preds:
            if img_id not in class_gts:
                fp_cumsum.append(1)
                tp_cumsum.append(0)
                continue
            
            gt_boxes = class_gts[img_id]
            best_iou = 0
            best_gt = -1
            
            for i, gt in enumerate(gt_boxes):
                if i in matched[img_id]:
                    continue
                iou = compute_iou(bbox.unsqueeze(0), gt.unsqueeze(0)).item()
                if iou > best_iou:
                    best_iou = iou
                    best_gt = i
            
            if best_iou >= iou_threshold and best_gt != -1:
                tp_cumsum.append(1)
                fp_cumsum.append(0)
                matched[img_id].add(best_gt)
            else:
                tp_cumsum.append(0)
                fp_cumsum.append(1)
        
        # 计算累计precision和recall
        tp_cumsum = np.cumsum(tp_cumsum)
        fp_cumsum = np.cumsum(fp_cumsum)
        
        precisions = tp_cumsum / (tp_cumsum + fp_cumsum)
        recalls = tp_cumsum / total_gt
        
        # 计算AP
        ap = compute_ap(precisions, recalls)
        aps.append(ap)
    
    return np.mean(aps)  # mAP = mean of APs

6.3 COCO评价指标

指标 说明 用途
mAP@0.5 IoU=0.5时的mAP 宽松标准,快速评估
mAP@0.75 IoU=0.75时的mAP 严格标准,定位精度
mAP@[.5:.95] IoU从0.5到0.95的平均 COCO标准,综合评估
APs/APm/APl 小/中/大目标的AP 尺度分析

7. 实战:自定义数据集训练

7.1 数据集格式:COCO格式

# COCO格式示例
coco_format = {
    "images": [
        {"id": 1, "file_name": "000001.jpg", "height": 480, "width": 640},
        # ...
    ],
    "annotations": [
        {
            "id": 1,
            "image_id": 1,
            "category_id": 1,  # 1=猫
            "bbox": [100, 200, 50, 80],  # [x, y, width, height]
            "area": 4000,
            "iscrowd": 0
        },
        # ...
    ],
    "categories": [
        {"id": 1, "name": "cat"},
        {"id": 2, "name": "dog"},
        # ...
    ]
}

7.2 自定义Dataset

from PIL import Image
import json
import os

class COCODetectionDataset(torch.utils.data.Dataset):
    """
    COCO格式目标检测数据集
    """
    def __init__(self, root_dir, annotation_file, transform=None):
        self.root_dir = root_dir
        self.transform = transform
        
        # 加载标注
        with open(annotation_file) as f:
            self.coco_data = json.load(f)
        
        self.images = {img['id']: img for img in self.coco_data['images']}
        self.categories = {cat['id']: cat for cat in self.coco_data['categories']}
        
        # 按图像组织标注
        self.annotations = {}
        for ann in self.coco_data['annotations']:
            img_id = ann['image_id']
            if img_id not in self.annotations:
                self.annotations[img_id] = []
            self.annotations[img_id].append(ann)
        
        self.image_ids = list(self.images.keys())
    
    def __len__(self):
        return len(self.image_ids)
    
    def __getitem__(self, idx):
        img_id = self.image_ids[idx]
        img_info = self.images[img_id]
        
        # 加载图像
        img_path = os.path.join(self.root_dir, img_info['file_name'])
        image = Image.open(img_path).convert('RGB')
        
        # 加载标注
        anns = self.annotations.get(img_id, [])
        
        boxes = []
        labels = []
        
        for ann in anns:
            # COCO格式: [x, y, width, height] → [x1, y1, x2, y2]
            x, y, w, h = ann['bbox']
            boxes.append([x, y, x + w, y + h])
            labels.append(ann['category_id'])
        
        boxes = torch.tensor(boxes, dtype=torch.float32)
        labels = torch.tensor(labels, dtype=torch.long)
        
        # 构建target字典
        target = {
            'boxes': boxes,
            'labels': labels,
            'image_id': torch.tensor([img_id]),
            'area': torch.tensor([ann['area'] for ann in anns]),
            'iscrowd': torch.tensor([ann.get('iscrowd', 0) for ann in anns])
        }
        
        if self.transform:
            image, target = self.transform(image, target)
        
        return image, target

# 数据增强(同步变换图像和框)
class DetectionTransform:
    def __init__(self, image_size=800):
        self.image_size = image_size
    
    def __call__(self, image, target):
        # 图像变换
        image = transforms.Resize((self.image_size, self.image_size))(image)
        image = transforms.ToTensor()(image)
        image = transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])(image)
        
        # 同步变换框坐标
        w, h = image.shape[2], image.shape[1]
        scale_x = self.image_size / w
        scale_y = self.image_size / h
        
        boxes = target['boxes']
        boxes[:, [0, 2]] *= scale_x
        boxes[:, [1, 3]] *= scale_y
        target['boxes'] = boxes
        
        return image, target

7.3 使用torchvision训练Faster R-CNN

from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

def get_detection_model(num_classes):
    """
    加载预训练Faster R-CNN,修改分类头
    """
    # 加载预训练模型(COCO数据集,91类)
    model = fasterrcnn_resnet50_fpn(pretrained=True)
    
    # 替换分类头
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    
    return model

# 训练配置
def train_detection(model, train_loader, num_epochs=10):
    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    model.to(device)
    
    # 优化器
    params = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)
    
    # 学习率调度
    lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)
    
    for epoch in range(num_epochs):
        model.train()
        
        for images, targets in train_loader:
            images = [img.to(device) for img in images]
            targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
            
            # 前向传播(返回损失字典)
            loss_dict = model(images, targets)
            losses = sum(loss for loss in loss_dict.values())
            
            optimizer.zero_grad()
            losses.backward()
            optimizer.step()
        
        lr_scheduler.step()
        
        print(f"Epoch {epoch}: loss={losses.item():.4f}")
    
    return model

# 推理
def detect(model, image, device, conf_thresh=0.5):
    model.eval()
    with torch.no_grad():
        prediction = model([image.to(device)])
    
    # 后处理
    boxes = prediction[0]['boxes'].cpu()
    scores = prediction[0]['scores'].cpu()
    labels = prediction[0]['labels'].cpu()
    
    # 阈值筛选
    keep = scores >= conf_thresh
    boxes = boxes[keep]
    scores = scores[keep]
    labels = labels[keep]
    
    # NMS
    keep_indices = nms(boxes, scores, iou_threshold=0.5)
    
    return boxes[keep_indices], scores[keep_indices], labels[keep_indices]

7.4 可视化检测结果

import matplotlib.patches as patches

def visualize_detection(image, boxes, scores, labels, class_names, threshold=0.5):
    """
    可视化检测结果
    """
    fig, ax = plt.subplots(1, figsize=(12, 8))
    ax.imshow(image.permute(1, 2, 0).cpu().numpy())
    
    colors = plt.cm.tab20(np.linspace(0, 1, len(class_names)))
    
    for box, score, label in zip(boxes, scores, labels):
        if score < threshold:
            continue
        
        x1, y1, x2, y2 = box
        width = x2 - x1
        height = y2 - y1
        
        color = colors[label % len(colors)]
        
        # 画框
        rect = patches.Rectangle((x1, y1), width, height, 
                                linewidth=2, edgecolor=color, facecolor='none')
        ax.add_patch(rect)
        
        # 标签
        label_text = f"{class_names[label]}: {score:.2f}"
        ax.text(x1, y1 - 5, label_text, color='white', fontsize=10,
                bbox=dict(facecolor=color, alpha=0.7, edgecolor='none'))
    
    ax.axis('off')
    plt.tight_layout()
    plt.show()

# 使用
# class_names = ['background', 'cat', 'dog', 'person', ...]
# visualize_detection(image, boxes, scores, labels, class_names)

8. 小结

知识点 核心要点
两阶段检测 RPN生成候选框 + ROI分类,精度高速度慢
单阶段检测 直接回归框和类别,速度快精度略低
Anchor 预设多尺度候选框,判断前景/背景 + 回归修正
Anchor-free 直接预测中心点和宽高,更简洁
RPN 在特征图上滑动小网络,输出proposals
ROI Pooling 将不同大小的ROI统一为固定尺寸
NMS 去重:保留高分框,抑制重叠框
Soft-NMS 衰减分数而非删除,适合密集场景
mAP 综合Precision-Recall的指标,COCO标准
COCO格式 图像 + 标注(bbox为[x,y,w,h])

课后练习

  1. Anchor设计:为你的数据集(如小目标为主的场景)设计最优的anchor尺寸和比例,对比默认设置的效果。

  2. NMS调参:在密集场景(如人群)上对比NMS、Soft-NMS、DIoU-NMS的效果,找到最佳IoU阈值。

  3. YOLOv8训练:使用Ultralytics YOLOv8训练自定义数据集,对比不同模型大小(n/s/m/l/x)的速度-精度权衡。

  4. mAP计算:手动实现mAP计算,与pycocotools对比验证正确性。

  5. 挑战:实现TTFNet(Training-Time-Friendly Network),理解如何平衡训练速度和推理速度。


Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐