目标检测:找到并框出它
·
第18讲:目标检测:找到并框出它
目录
1. 目标检测:从分类到定位
1.1 检测任务的定义
目标检测 = 图像分类 + 目标定位
- 分类:框里是什么?(猫/狗/人)
- 定位:框在哪里?(左上角x,y,宽w,高h)
输入图像
↓
【检测模型】
↓
输出:[N, 6] 的检测结果
[x1, y1, x2, y2, class, confidence]
例如:
[100, 200, 300, 400, "cat", 0.95] ← 左上角(100,200),右下角(300,400),猫,置信度95%
[500, 100, 700, 600, "dog", 0.88] ← 另一只狗
1.2 检测的难点
| 难点 | 说明 | 解决方案 |
|---|---|---|
| 多尺度 | 目标大小差异大(远处的人 vs 近处的车) | 特征金字塔(FPN)、多尺度训练 |
| 密集场景 | 多个目标重叠、遮挡 | NMS、Soft-NMS、DIoU-NMS |
| 实时性 | 自动驾驶需要30+ FPS | 单阶段检测器、模型轻量化 |
| 小目标 | 远处的小物体难以检测 | 高分辨率特征、上下文融合 |
| 类别不平衡 | 背景框远多于目标框 | Focal Loss、OHEM |
2. 两阶段检测:Faster R-CNN
2.1 演进历程
R-CNN (2014) Fast R-CNN (2015) Faster R-CNN (2015)
│ │ │
▼ ▼ ▼
1. Selective Search 1. 整张图过CNN 1. CNN提取特征
找2000个候选框 2. ROI Pooling提取 2. RPN生成候选框
2. 每个框单独过CNN 每个ROI特征 3. ROI Pooling + 分类
3. SVM分类 3. 同时分类+回归 (端到端训练)
4. 回归修正框
速度:~50s/图 ~2s/图 ~0.2s/图
精度:mAP 53.3% mAP 68.8% mAP 73.2%
2.2 Faster R-CNN架构详解
import torch
import torch.nn as nn
import torchvision.models as models
from torchvision.ops import RoIPool, nms
class FasterRCNN(nn.Module):
"""
Faster R-CNN 简化实现
核心组件:Backbone + RPN + ROI Pooling + Head
"""
def __init__(self, num_classes=21):
super().__init__()
# 1. Backbone:提取特征
resnet = models.resnet50(pretrained=True)
self.backbone = nn.Sequential(
resnet.conv1, resnet.bn1, resnet.relu, resnet.maxpool,
resnet.layer1, resnet.layer2, resnet.layer3
)
# 输出特征:1/16分辨率,通道=1024
# 2. RPN:Region Proposal Network
self.rpn = RegionProposalNetwork(in_channels=1024)
# 3. ROI Pooling:统一ROI尺寸
self.roi_pool = RoIPool(output_size=(7, 7), spatial_scale=1/16)
# 4. Head:分类 + 回归
self.fc = nn.Sequential(
nn.Linear(1024 * 7 * 7, 4096),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(4096, 4096),
nn.ReLU(),
nn.Dropout(0.5)
)
self.classifier = nn.Linear(4096, num_classes) # 分类
self.bbox_regressor = nn.Linear(4096, num_classes * 4) # 回归(每类一个框)
def forward(self, images, targets=None):
"""
训练时:需要targets计算RPN和Head的损失
推理时:只输出检测结果
"""
# 提取特征
features = self.backbone(images) # [N, 1024, H/16, W/16]
# RPN生成候选框
proposals, rpn_losses = self.rpn(features, images.shape[2:], targets)
# ROI Pooling
if self.training:
# 训练时: proposals + 真实框 混合采样
proposals = self._sample_proposals(proposals, targets)
pooled_features = self.roi_pool(features, proposals) # [num_rois, 1024, 7, 7]
pooled_features = pooled_features.view(pooled_features.size(0), -1)
# Head
fc_out = self.fc(pooled_features)
class_logits = self.classifier(fc_out) # [num_rois, num_classes]
bbox_deltas = self.bbox_regressor(fc_out) # [num_rois, num_classes*4]
if self.training:
# 计算损失
losses = self._compute_loss(class_logits, bbox_deltas, proposals, targets)
losses.update(rpn_losses)
return losses
# 推理:后处理得到最终结果
detections = self._postprocess(class_logits, bbox_deltas, proposals)
return detections
class RegionProposalNetwork(nn.Module):
"""
RPN:在特征图上滑动小网络,预测候选框
"""
def __init__(self, in_channels=1024, num_anchors=9):
super().__init__()
# 3x3卷积提取RPN特征
self.conv = nn.Conv2d(in_channels, 512, 3, padding=1)
# 分类分支:前景/背景(2类)
self.cls_logits = nn.Conv2d(512, num_anchors * 2, 1)
# 回归分支:框偏移(4个参数)
self.bbox_pred = nn.Conv2d(512, num_anchors * 4, 1)
# Anchor生成器
self.anchor_generator = AnchorGenerator(
sizes=[128, 256, 512],
aspect_ratios=[0.5, 1.0, 2.0]
)
def forward(self, features, image_size, targets=None):
# features: [N, 1024, H, W]
x = torch.relu(self.conv(features))
# 分类:每个anchor是前景还是背景
cls_logits = self.cls_logits(x) # [N, num_anchors*2, H, W]
# 回归:每个anchor的偏移
bbox_pred = self.bbox_pred(x) # [N, num_anchors*4, H, W]
# 生成anchors
anchors = self.anchor_generator(features, image_size)
# 训练时计算RPN损失
if self.training and targets is not None:
rpn_losses = self._compute_rpn_loss(cls_logits, bbox_pred, anchors, targets)
return anchors, rpn_losses
# 推理时:根据分数筛选proposals
proposals = self._filter_proposals(cls_logits, bbox_pred, anchors)
return proposals, {}
class AnchorGenerator:
"""
生成多尺度、多比例的Anchor框
"""
def __init__(self, sizes=[128, 256, 512], aspect_ratios=[0.5, 1.0, 2.0]):
self.sizes = sizes
self.aspect_ratios = aspect_ratios
self.num_anchors = len(sizes) * len(aspect_ratios)
# 预计算anchor模板
self.cell_anchors = self._generate_cell_anchors()
def _generate_cell_anchors(self):
anchors = []
for size in self.sizes:
for ratio in self.aspect_ratios:
w = size * np.sqrt(ratio)
h = size / np.sqrt(ratio)
anchors.append([-w/2, -h/2, w/2, h/2])
return torch.tensor(anchors) # [num_anchors, 4]
def __call__(self, features, image_size):
# 在特征图每个位置放置anchors
# 实际实现需要映射到原图坐标
pass
2.3 RPN的核心思想
比喻:RPN就像在地图上撒网捕鱼。
- 先撒不同大小、不同形状的网(anchors)
- 判断哪个网里有鱼(前景/背景分类)
- 调整网的位置和大小(框回归)
- 最后只保留有鱼的网(NMS筛选)
特征图上的每个位置:
↓
生成K个anchors(如K=9:3种尺寸 × 3种比例)
↓
分类分支:判断每个anchor是前景还是背景
↓
回归分支:修正anchor的位置和大小
↓
筛选:只保留高置信度的前景anchor作为proposals
3. 单阶段检测:YOLO与SSD
3.1 YOLO:You Only Look Once
核心思想:把检测当回归问题,一次前向直接输出所有框的位置和类别。
class YOLOv1(nn.Module):
"""
YOLOv1简化实现(原始版本,网格预测)
将图像分成S×S网格,每个网格预测B个框
"""
def __init__(self, S=7, B=2, num_classes=20):
super().__init__()
self.S = S # 网格数
self.B = B # 每个网格的框数
self.num_classes = num_classes
# Backbone(类似VGG)
self.backbone = nn.Sequential(
# 大量卷积层提取特征
nn.Conv2d(3, 64, 7, stride=2, padding=3),
nn.LeakyReLU(0.1),
nn.MaxPool2d(2, 2),
nn.Conv2d(64, 192, 3, padding=1),
nn.LeakyReLU(0.1),
nn.MaxPool2d(2, 2),
# ... 更多卷积层 ...
nn.Conv2d(512, 1024, 3, padding=1),
nn.LeakyReLU(0.1),
)
# 全连接层输出
self.fc = nn.Sequential(
nn.Flatten(),
nn.Linear(1024 * S * S, 4096),
nn.LeakyReLU(0.1),
nn.Dropout(0.5),
nn.Linear(4096, S * S * (B * 5 + num_classes))
# 每个网格输出:B*(x,y,w,h,confidence) + num_classes
)
def forward(self, x):
# x: [N, 3, 448, 448]
features = self.backbone(x) # [N, 1024, S, S]
out = self.fc(features) # [N, S*S*(B*5+C)]
out = out.view(-1, self.S, self.S, self.B * 5 + self.num_classes)
# 解析输出
# out[..., :B*5] = 框参数(每个框5个数:x,y,w,h,conf)
# out[..., B*5:] = 类别概率
return out
# YOLO输出解析
def decode_yolo_output(output, S=7, B=2, num_classes=20, conf_thresh=0.5):
"""
将YOLO输出解析为检测结果
"""
batch_size = output.size(0)
predictions = []
for b in range(batch_size):
for i in range(S):
for j in range(S):
grid_output = output[b, i, j] # [B*5 + C]
for box_idx in range(B):
# 提取框参数
start = box_idx * 5
x = grid_output[start]
y = grid_output[start + 1]
w = grid_output[start + 2]
h = grid_output[start + 3]
conf = torch.sigmoid(grid_output[start + 4])
if conf > conf_thresh:
# 转换到原图坐标
x = (j + x) / S # 网格中心 + 偏移,归一化
y = (i + y) / S
w = w ** 2 # YOLOv1用平方根
h = h ** 2
# 类别
class_scores = grid_output[B*5:]
class_id = class_scores.argmax()
class_conf = torch.softmax(class_scores, dim=0)[class_id]
predictions.append({
'bbox': [x - w/2, y - h/2, x + w/2, y + h/2],
'confidence': conf.item(),
'class_id': class_id.item(),
'class_conf': class_conf.item()
})
return predictions
3.2 SSD:Single Shot MultiBox Detector
class SSD(nn.Module):
"""
SSD:多尺度特征图预测
在不同分辨率的特征图上检测不同大小的目标
"""
def __init__(self, num_classes=21):
super().__init__()
# Backbone(VGG)
self.backbone = models.vgg16(pretrained=True).features
# 额外特征层(逐渐变小,检测大目标)
self.extras = nn.ModuleList([
nn.Sequential(
nn.Conv2d(512, 1024, 3, padding=1),
nn.ReLU(),
nn.Conv2d(1024, 1024, 1),
nn.ReLU()
),
nn.Sequential(
nn.Conv2d(1024, 256, 1),
nn.ReLU(),
nn.Conv2d(256, 512, 3, stride=2, padding=1),
nn.ReLU()
),
# ... 更多层 ...
])
# 检测头:每个特征图位置预测框和类别
self.loc_layers = nn.ModuleList() # 回归分支
self.conf_layers = nn.ModuleList() # 分类分支
# 为每个特征图配置检测头
channels = [512, 1024, 512, 256, 256, 256] # 各层通道数
num_priors = [4, 6, 6, 6, 4, 4] # 每层每个位置的anchor数
for in_channels, num_prior in zip(channels, num_priors):
self.loc_layers.append(
nn.Conv2d(in_channels, num_prior * 4, 3, padding=1)
)
self.conf_layers.append(
nn.Conv2d(in_channels, num_prior * num_classes, 3, padding=1)
)
# Anchor生成
self.prior_boxes = self._generate_priors()
def forward(self, x):
# Backbone特征
sources = []
for i, layer in enumerate(self.backbone):
x = layer(x)
if i in [22, 34]: # VGG的特定层输出
sources.append(x)
# 额外特征层
for extra in self.extras:
x = extra(x)
sources.append(x)
# 在每个特征图上预测
loc = [] # [batch, num_priors*4, H, W]
conf = [] # [batch, num_priors*num_classes, H, W]
for source, loc_layer, conf_layer in zip(sources, self.loc_layers, self.conf_layers):
loc.append(loc_layer(source).permute(0, 2, 3, 1).contiguous())
conf.append(conf_layer(source).permute(0, 2, 3, 1).contiguous())
# 拼接所有预测
loc = torch.cat([o.view(o.size(0), -1) for o in loc], dim=1)
conf = torch.cat([o.view(o.size(0), -1) for o in conf], dim=1)
# reshape
loc = loc.view(loc.size(0), -1, 4)
conf = conf.view(conf.size(0), -1, self.num_classes)
return loc, conf, self.prior_boxes
def _generate_priors(self):
# 生成不同尺度、不同比例的prior boxes
pass
3.3 YOLO vs SSD vs Faster R-CNN
| 特性 | YOLO (v3/v5/v8) | SSD | Faster R-CNN |
|---|---|---|---|
| 速度 | ⚡ 最快(~140 FPS) | 快(~60 FPS) | 慢(~10 FPS) |
| 精度 | 高(mAP ~50) | 中高(mAP ~46) | 最高(mAP ~42-75) |
| 小目标 | 一般(v8改进) | 差(依赖浅层) | 好(多尺度ROI) |
| 定位精度 | 中高 | 中 | 高(两阶段精修) |
| 训练难度 | 简单 | 中等 | 复杂 |
| 适用场景 | 实时检测 | 平衡场景 | 精度优先 |
4. Anchor机制:预设的候选框
4.1 什么是Anchor
比喻:Anchor就像钓鱼前绑好的鱼钩,提前准备了各种大小和形状,只等判断哪个钩上有鱼。
import numpy as np
import matplotlib.pyplot as plt
def generate_anchors(base_size=16, ratios=[0.5, 1, 2], scales=[8, 16, 32]):
"""
生成Anchor模板
base_size: 特征图上的基础网格大小
ratios: 宽高比
scales: 缩放倍数
"""
anchors = []
for scale in scales:
for ratio in ratios:
# 计算anchor的宽和高
# 面积 = (base_size * scale)^2
# w = sqrt(area * ratio), h = sqrt(area / ratio)
area = (base_size * scale) ** 2
w = np.sqrt(area * ratio)
h = np.sqrt(area / ratio)
# 以中心为原点
x1 = -w / 2
y1 = -h / 2
x2 = w / 2
y2 = h / 2
anchors.append([x1, y1, x2, y2])
return np.array(anchors)
# 可视化anchors
anchors = generate_anchors()
fig, ax = plt.subplots(figsize=(10, 10))
colors = plt.cm.viridis(np.linspace(0, 1, len(anchors)))
for i, anchor in enumerate(anchors):
x1, y1, x2, y2 = anchor
width = x2 - x1
height = y2 - y1
rect = plt.Rectangle((x1, y1), width, height,
fill=False, edgecolor=colors[i], linewidth=2,
label=f'Ratio={["0.5","1.0","2.0"][i%3]}, Scale={[8,16,32][i//3]}')
ax.add_patch(rect)
ax.set_xlim(-400, 400)
ax.set_ylim(-400, 400)
ax.set_aspect('equal')
ax.axhline(y=0, color='k', linestyle='-', alpha=0.3)
ax.axvline(x=0, color='k', linestyle='-', alpha=0.3)
ax.set_title('Anchor Templates (Centered at Origin)')
ax.legend(loc='upper right', fontsize=8)
plt.show()
print(f"生成了 {len(anchors)} 个anchor模板")
print("形状:", anchors.shape)
4.2 Anchor与特征图的映射
def map_anchors_to_image(anchors, feature_map_size, image_size):
"""
将anchor模板映射到原图的所有位置
"""
feat_h, feat_w = feature_map_size
img_h, img_w = image_size
# 计算stride(特征图每个像素对应原图多少像素)
stride_h = img_h / feat_h
stride_w = img_w / feat_w
all_anchors = []
for i in range(feat_h):
for j in range(feat_w):
# 当前网格的中心在原图上的位置
center_y = i * stride_h + stride_h / 2
center_x = j * stride_w + stride_w / 2
for anchor in anchors:
# 将anchor平移到当前位置
x1 = center_x + anchor[0]
y1 = center_y + anchor[1]
x2 = center_x + anchor[2]
y2 = center_y + anchor[3]
all_anchors.append([x1, y1, x2, y2])
return np.array(all_anchors)
# 示例:特征图32x32,原图512x512
anchors = generate_anchors(base_size=16)
mapped_anchors = map_anchors_to_image(anchors, (32, 32), (512, 512))
print(f"特征图尺寸: 32x32")
print(f"Anchor模板数: {len(anchors)}")
print(f"映射后总anchor数: {len(mapped_anchors)}") # 32*32*9 = 9216
print(f"第一个anchor: {mapped_anchors[0]}")
4.3 Anchor-free:下一代检测器
趋势:YOLOv8、FCOS等现代检测器抛弃Anchor,直接预测中心点和宽高。
class AnchorFreeDetector(nn.Module):
"""
Anchor-free检测器(FCOS风格)
直接预测中心点、距离、类别
"""
def __init__(self, num_classes=80):
super().__init__()
# Backbone + FPN
self.backbone = models.resnet50(pretrained=True)
self.fpn = FPN()
# 检测头(每个位置预测)
self.cls_head = nn.Sequential(
nn.Conv2d(256, 256, 3, padding=1),
nn.ReLU(),
nn.Conv2d(256, num_classes, 3, padding=1) # 分类
)
self.reg_head = nn.Sequential(
nn.Conv2d(256, 256, 3, padding=1),
nn.ReLU(),
nn.Conv2d(256, 4, 3, padding=1) # 回归(到边界的距离:l,t,r,b)
)
self.centerness_head = nn.Sequential(
nn.Conv2d(256, 256, 3, padding=1),
nn.ReLU(),
nn.Conv2d(256, 1, 3, padding=1) # 中心度
)
def forward(self, x):
features = self.backbone(x)
fpn_features = self.fpn(features)
cls_scores = []
bbox_preds = []
centernesses = []
for feat in fpn_features:
cls_scores.append(self.cls_head(feat))
bbox_preds.append(self.reg_head(feat))
centernesses.append(self.centerness_head(feat))
return cls_scores, bbox_preds, centernesses
5. NMS:去重与筛选
5.1 为什么需要NMS
问题:同一个目标可能被多个anchor检测到,产生大量重叠的框。
检测结果(未去重):
┌─────────────────┐
│ ┌───┐ │
│ │猫 │ 0.95 │ ← 正确框
│ └───┘ │
│ ┌───┐ │
│ │猫 │ 0.92 │ ← 重叠框(应该去掉)
│ └───┘ │
│ ┌───┐ │
│ │猫 │ 0.88 │ ← 重叠框(应该去掉)
│ └───┘ │
└─────────────────┘
5.2 NMS算法
def nms(boxes, scores, iou_threshold=0.5):
"""
非极大值抑制(Non-Maximum Suppression)
boxes: [N, 4] (x1, y1, x2, y2)
scores: [N] 置信度
iou_threshold: IoU阈值,超过则抑制
"""
# 按分数降序排序
indices = scores.argsort(descending=True)
keep = [] # 保留的框
while len(indices) > 0:
# 保留当前最高分的框
current = indices[0]
keep.append(current.item())
if len(indices) == 1:
break
# 计算当前框与其余框的IoU
current_box = boxes[current].unsqueeze(0) # [1, 4]
other_boxes = boxes[indices[1:]] # [M, 4]
ious = compute_iou(current_box, other_boxes) # [M]
# 保留IoU小于阈值的框
mask = ious <= iou_threshold
indices = indices[1:][mask]
return torch.tensor(keep)
def compute_iou(box1, box2):
"""
计算两个框的IoU(交并比)
box1: [N, 4], box2: [M, 4]
"""
# 交集区域
x1 = torch.max(box1[:, 0].unsqueeze(1), box2[:, 0]) # [N, M]
y1 = torch.max(box1[:, 1].unsqueeze(1), box2[:, 1])
x2 = torch.min(box1[:, 2].unsqueeze(1), box2[:, 2])
y2 = torch.min(box1[:, 3].unsqueeze(1), box2[:, 3])
intersection = torch.clamp(x2 - x1, min=0) * torch.clamp(y2 - y1, min=0)
# 并集区域
area1 = (box1[:, 2] - box1[:, 0]) * (box1[:, 3] - box1[:, 1]) # [N]
area2 = (box2[:, 2] - box2[:, 0]) * (box2[:, 3] - box2[:, 1]) # [M]
union = area1.unsqueeze(1) + area2 - intersection
return intersection / (union + 1e-6)
# 测试NMS
boxes = torch.tensor([
[100, 100, 200, 200], # 猫 0.95
[105, 105, 205, 205], # 猫 0.92(与第一个IoU高)
[300, 300, 400, 400], # 狗 0.90(与第一个IoU低)
[110, 110, 210, 210], # 猫 0.88
])
scores = torch.tensor([0.95, 0.92, 0.90, 0.88])
keep = nms(boxes, scores, iou_threshold=0.5)
print(f"保留的框索引: {keep.numpy()}") # [0, 2](去掉重叠的猫框,保留狗框)
5.3 NMS的改进
class SoftNMS:
"""
Soft-NMS:降低重叠框的分数,而不是直接删除
更适合密集场景
"""
def __call__(self, boxes, scores, sigma=0.5, method='gaussian'):
indices = scores.argsort(descending=True)
keep = []
while len(indices) > 0:
current = indices[0]
keep.append(current.item())
if len(indices) == 1:
break
current_box = boxes[current].unsqueeze(0)
other_boxes = boxes[indices[1:]]
other_scores = scores[indices[1:]]
ious = compute_iou(current_box, other_boxes).squeeze(0)
# 根据IoU衰减分数(而不是直接删除)
if method == 'linear':
weight = torch.where(ious > 0.5, 1 - ious, torch.ones_like(ious))
else: # gaussian
weight = torch.exp(-(ious ** 2) / sigma)
other_scores *= weight
scores[indices[1:]] = other_scores
# 重新排序
indices = scores.argsort(descending=True)
# 去掉已经处理的
indices = indices[indices != current]
return torch.tensor(keep)
# DIoU-NMS:考虑中心点距离
def diou_nms(boxes, scores, iou_threshold=0.5):
"""
DIoU = IoU - (中心点距离)^2 / (对角线距离)^2
更好处理重叠但不包含的情况
"""
# 实现类似NMS,但用DIoU替代IoU
pass
6. 评价指标:mAP
6.1 基础指标:Precision & Recall
def compute_precision_recall(pred_boxes, pred_scores, gt_boxes, iou_thresh=0.5):
"""
计算单张图的Precision和Recall
pred_boxes: [N, 4] 预测框
pred_scores: [N] 预测分数
gt_boxes: [M, 4] 真实框
"""
# 按分数排序
sorted_indices = pred_scores.argsort(descending=True)
pred_boxes = pred_boxes[sorted_indices]
tp = 0 # True Positive
fp = 0 # False Positive
matched_gt = set()
for pred in pred_boxes:
# 找最佳匹配的真实框
best_iou = 0
best_gt = -1
for i, gt in enumerate(gt_boxes):
if i in matched_gt:
continue
iou = compute_iou(pred.unsqueeze(0), gt.unsqueeze(0)).item()
if iou > best_iou:
best_iou = iou
best_gt = i
if best_iou >= iou_thresh and best_gt not in matched_gt:
tp += 1
matched_gt.add(best_gt)
else:
fp += 1
fn = len(gt_boxes) - len(matched_gt) # False Negative
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
return precision, recall
6.2 AP(Average Precision)
def compute_ap(precisions, recalls):
"""
计算AP:Precision-Recall曲线下面积
使用11点插值或所有点插值
"""
# 11点插值
ap = 0
for t in np.arange(0, 1.1, 0.1):
# 找到recall >= t的最大precision
if np.sum(recalls >= t) == 0:
p = 0
else:
p = np.max(precisions[recalls >= t])
ap += p / 11
return ap
# 完整mAP计算
def compute_map(all_predictions, all_ground_truths, num_classes, iou_threshold=0.5):
"""
计算mAP(mean Average Precision)
all_predictions: list of dict, 每张图的预测 {class_id: [(bbox, score), ...]}
all_ground_truths: list of dict, 每张图的真实框 {class_id: [bbox, ...]}
"""
aps = []
for class_id in range(num_classes):
# 收集该类别的所有预测和真实框
class_preds = [] # [(image_id, bbox, score), ...]
class_gts = {} # {image_id: [bbox, ...], ...}
for img_id, preds in enumerate(all_predictions):
if class_id in preds:
for bbox, score in preds[class_id]:
class_preds.append((img_id, bbox, score))
if class_id in all_ground_truths[img_id]:
class_gts[img_id] = all_ground_truths[img_id][class_id]
# 按分数排序
class_preds.sort(key=lambda x: x[2], reverse=True)
# 计算Precision-Recall曲线
tp_cumsum = []
fp_cumsum = []
total_gt = sum(len(gts) for gts in class_gts.values())
matched = {img_id: set() for img_id in class_gts.keys()}
for img_id, bbox, score in class_preds:
if img_id not in class_gts:
fp_cumsum.append(1)
tp_cumsum.append(0)
continue
gt_boxes = class_gts[img_id]
best_iou = 0
best_gt = -1
for i, gt in enumerate(gt_boxes):
if i in matched[img_id]:
continue
iou = compute_iou(bbox.unsqueeze(0), gt.unsqueeze(0)).item()
if iou > best_iou:
best_iou = iou
best_gt = i
if best_iou >= iou_threshold and best_gt != -1:
tp_cumsum.append(1)
fp_cumsum.append(0)
matched[img_id].add(best_gt)
else:
tp_cumsum.append(0)
fp_cumsum.append(1)
# 计算累计precision和recall
tp_cumsum = np.cumsum(tp_cumsum)
fp_cumsum = np.cumsum(fp_cumsum)
precisions = tp_cumsum / (tp_cumsum + fp_cumsum)
recalls = tp_cumsum / total_gt
# 计算AP
ap = compute_ap(precisions, recalls)
aps.append(ap)
return np.mean(aps) # mAP = mean of APs
6.3 COCO评价指标
| 指标 | 说明 | 用途 |
|---|---|---|
| mAP@0.5 | IoU=0.5时的mAP | 宽松标准,快速评估 |
| mAP@0.75 | IoU=0.75时的mAP | 严格标准,定位精度 |
| mAP@[.5:.95] | IoU从0.5到0.95的平均 | COCO标准,综合评估 |
| APs/APm/APl | 小/中/大目标的AP | 尺度分析 |
7. 实战:自定义数据集训练
7.1 数据集格式:COCO格式
# COCO格式示例
coco_format = {
"images": [
{"id": 1, "file_name": "000001.jpg", "height": 480, "width": 640},
# ...
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1, # 1=猫
"bbox": [100, 200, 50, 80], # [x, y, width, height]
"area": 4000,
"iscrowd": 0
},
# ...
],
"categories": [
{"id": 1, "name": "cat"},
{"id": 2, "name": "dog"},
# ...
]
}
7.2 自定义Dataset
from PIL import Image
import json
import os
class COCODetectionDataset(torch.utils.data.Dataset):
"""
COCO格式目标检测数据集
"""
def __init__(self, root_dir, annotation_file, transform=None):
self.root_dir = root_dir
self.transform = transform
# 加载标注
with open(annotation_file) as f:
self.coco_data = json.load(f)
self.images = {img['id']: img for img in self.coco_data['images']}
self.categories = {cat['id']: cat for cat in self.coco_data['categories']}
# 按图像组织标注
self.annotations = {}
for ann in self.coco_data['annotations']:
img_id = ann['image_id']
if img_id not in self.annotations:
self.annotations[img_id] = []
self.annotations[img_id].append(ann)
self.image_ids = list(self.images.keys())
def __len__(self):
return len(self.image_ids)
def __getitem__(self, idx):
img_id = self.image_ids[idx]
img_info = self.images[img_id]
# 加载图像
img_path = os.path.join(self.root_dir, img_info['file_name'])
image = Image.open(img_path).convert('RGB')
# 加载标注
anns = self.annotations.get(img_id, [])
boxes = []
labels = []
for ann in anns:
# COCO格式: [x, y, width, height] → [x1, y1, x2, y2]
x, y, w, h = ann['bbox']
boxes.append([x, y, x + w, y + h])
labels.append(ann['category_id'])
boxes = torch.tensor(boxes, dtype=torch.float32)
labels = torch.tensor(labels, dtype=torch.long)
# 构建target字典
target = {
'boxes': boxes,
'labels': labels,
'image_id': torch.tensor([img_id]),
'area': torch.tensor([ann['area'] for ann in anns]),
'iscrowd': torch.tensor([ann.get('iscrowd', 0) for ann in anns])
}
if self.transform:
image, target = self.transform(image, target)
return image, target
# 数据增强(同步变换图像和框)
class DetectionTransform:
def __init__(self, image_size=800):
self.image_size = image_size
def __call__(self, image, target):
# 图像变换
image = transforms.Resize((self.image_size, self.image_size))(image)
image = transforms.ToTensor()(image)
image = transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])(image)
# 同步变换框坐标
w, h = image.shape[2], image.shape[1]
scale_x = self.image_size / w
scale_y = self.image_size / h
boxes = target['boxes']
boxes[:, [0, 2]] *= scale_x
boxes[:, [1, 3]] *= scale_y
target['boxes'] = boxes
return image, target
7.3 使用torchvision训练Faster R-CNN
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
def get_detection_model(num_classes):
"""
加载预训练Faster R-CNN,修改分类头
"""
# 加载预训练模型(COCO数据集,91类)
model = fasterrcnn_resnet50_fpn(pretrained=True)
# 替换分类头
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
return model
# 训练配置
def train_detection(model, train_loader, num_epochs=10):
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
# 优化器
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)
# 学习率调度
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)
for epoch in range(num_epochs):
model.train()
for images, targets in train_loader:
images = [img.to(device) for img in images]
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
# 前向传播(返回损失字典)
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
optimizer.zero_grad()
losses.backward()
optimizer.step()
lr_scheduler.step()
print(f"Epoch {epoch}: loss={losses.item():.4f}")
return model
# 推理
def detect(model, image, device, conf_thresh=0.5):
model.eval()
with torch.no_grad():
prediction = model([image.to(device)])
# 后处理
boxes = prediction[0]['boxes'].cpu()
scores = prediction[0]['scores'].cpu()
labels = prediction[0]['labels'].cpu()
# 阈值筛选
keep = scores >= conf_thresh
boxes = boxes[keep]
scores = scores[keep]
labels = labels[keep]
# NMS
keep_indices = nms(boxes, scores, iou_threshold=0.5)
return boxes[keep_indices], scores[keep_indices], labels[keep_indices]
7.4 可视化检测结果
import matplotlib.patches as patches
def visualize_detection(image, boxes, scores, labels, class_names, threshold=0.5):
"""
可视化检测结果
"""
fig, ax = plt.subplots(1, figsize=(12, 8))
ax.imshow(image.permute(1, 2, 0).cpu().numpy())
colors = plt.cm.tab20(np.linspace(0, 1, len(class_names)))
for box, score, label in zip(boxes, scores, labels):
if score < threshold:
continue
x1, y1, x2, y2 = box
width = x2 - x1
height = y2 - y1
color = colors[label % len(colors)]
# 画框
rect = patches.Rectangle((x1, y1), width, height,
linewidth=2, edgecolor=color, facecolor='none')
ax.add_patch(rect)
# 标签
label_text = f"{class_names[label]}: {score:.2f}"
ax.text(x1, y1 - 5, label_text, color='white', fontsize=10,
bbox=dict(facecolor=color, alpha=0.7, edgecolor='none'))
ax.axis('off')
plt.tight_layout()
plt.show()
# 使用
# class_names = ['background', 'cat', 'dog', 'person', ...]
# visualize_detection(image, boxes, scores, labels, class_names)
8. 小结
| 知识点 | 核心要点 |
|---|---|
| 两阶段检测 | RPN生成候选框 + ROI分类,精度高速度慢 |
| 单阶段检测 | 直接回归框和类别,速度快精度略低 |
| Anchor | 预设多尺度候选框,判断前景/背景 + 回归修正 |
| Anchor-free | 直接预测中心点和宽高,更简洁 |
| RPN | 在特征图上滑动小网络,输出proposals |
| ROI Pooling | 将不同大小的ROI统一为固定尺寸 |
| NMS | 去重:保留高分框,抑制重叠框 |
| Soft-NMS | 衰减分数而非删除,适合密集场景 |
| mAP | 综合Precision-Recall的指标,COCO标准 |
| COCO格式 | 图像 + 标注(bbox为[x,y,w,h]) |
课后练习
-
Anchor设计:为你的数据集(如小目标为主的场景)设计最优的anchor尺寸和比例,对比默认设置的效果。
-
NMS调参:在密集场景(如人群)上对比NMS、Soft-NMS、DIoU-NMS的效果,找到最佳IoU阈值。
-
YOLOv8训练:使用Ultralytics YOLOv8训练自定义数据集,对比不同模型大小(n/s/m/l/x)的速度-精度权衡。
-
mAP计算:手动实现mAP计算,与pycocotools对比验证正确性。
-
挑战:实现TTFNet(Training-Time-Friendly Network),理解如何平衡训练速度和推理速度。
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐


所有评论(0)