前些天发现了一个巨牛的人工智能学习网站,通俗易懂,风趣幽默,忍不住分享一下给大家。点击跳转到网站

这篇论文 《SpatialMosaic: A Multiview VLM Dataset for Partial Visibility》 主要解决了当前多模态大语言模型在多视角、部分可见、遮挡严重、低重叠度的真实场景中3D空间推理能力不足的问题。
在这里插入图片描述


一、论文解决了什么问题?

核心问题:

现有的多模态大语言模型在3D空间推理任务中,通常依赖:

  • 预先构建好的3D表示(如点云、网格)
  • 或使用现成的3D重建流水线

但这些方法在真实世界中面临三大挑战,导致模型表现不佳:

  1. 部分可见性:一个物体只在部分视角中出现,而不是所有视角都可见。
  2. 遮挡:在单一视角中,物体被其他物体或图像边界部分遮挡。
  3. 低重叠度:不同视角之间的共同可见区域非常少,难以通过传统匹配方法建立对应关系。

这些问题在当前的多视角数据集中未被充分探索,而人类可以通过整合不完整的视觉信息进行3D推理,但现有模型在这方面表现很差。


二、论文怎么解决这个问题?

论文从数据模型两个角度入手,提出了完整的解决方案:
在这里插入图片描述

1. 数据层面:构建高质量、挑战性的多视角数据集

提出方法:
  • 设计了一个自动化的数据生成与标注流水线,基于现有的高质量3D场景数据集 ScanNet++
  • 该流水线能够:
    • 计算每个物体的遮挡比例(物体级遮挡 + 视野边界截断)
    • 根据低重叠度采样多视角组合
    • 自动生成6类空间推理任务的QA对
产出数据集:
  • SpatialMosaic:包含 200万 个QA对的指令微调数据集
  • SpatialMosaic-Bench:包含 100万 个QA对的评测基准,涵盖6类任务:
    • 物体计数
    • 最优视角选择
    • 物体定位
    • 遮挡感知的存在判断
    • 遮挡感知的属性判断
    • 遮挡感知的空间关系判断

2. 模型层面:融合3D几何信息的视觉-语言模型

提出模型:SpatialMosaicVLM

架构特点:

  • 使用 VGGT 作为几何编码器,提取多视角图像的3D结构特征
  • 使用 CLIP ViT 作为视觉编码器,提取每张图像的2D外观特征
  • 通过交叉注意力机制融合几何特征与视觉特征
  • 融合后的特征与问题一起输入到大语言模型中进行答案生成
优势:
  • 不依赖显式3D重建
  • 能有效处理遮挡、部分可见、低重叠度的多视角输入
  • 在训练中学习了如何从碎片化视觉线索中整合出3D空间结构

三、实验验证了什么?

主要实验结果:

1. 在 SpatialMosaic-Bench 上:
  • 现有的开源多模态模型(如LLaVA、InternVL等)在部分可见、遮挡、低重叠条件下表现大幅下降
  • 经过 SpatialMosaic 微调的 VLM-3R 和 SpatialMosaicVLM 显著优于所有基线模型
  • SpatialMosaicVLM 比 LLaVA-NeXT-Video-7B 高出 34%
2. 在 VSTI-Bench(时间空间推理)上:
  • 模型在零样本设置下,直接迁移到未见过的任务类型(如相机位移、物体相对距离等)
  • 即使没有训练过这些任务,SpatialMosaicVLM 仍然超过所有开源模型,甚至优于 LLaVA-NeXT-Video-72B

四、论文的核心贡献总结

贡献 说明
问题定义 明确了多视角空间推理中的三大挑战:部分可见、遮挡、低重叠
数据流水线 提出可扩展的自动标注与QA生成方法,适用于现有3D场景数据集
数据集 构建了 SpatialMosaic 和 SpatialMosaic-Bench,规模大、挑战性强
模型架构 提出 SpatialMosaicVLM,融合几何与视觉特征,提升多视角空间推理能力
实验验证 在多个基准上验证了模型在遮挡、低重叠、零样本迁移下的优越性

五、论文的局限与未来方向(可补充思考)

  • 当前数据来源于室内场景 ScanNet++,是否能推广到室外、动态场景仍需验证
  • 模型依赖 VGGT 作为几何编码器,推理效率可能受限于3D重建模型的速度
  • 未来可以探索更轻量的几何编码器,或端到端的学习方式

Json样例

{
  "dataset": "SpatialMosaic",
  "version": "1.0",
  "description": "Multiview VLM dataset for partial visibility, occlusion, and low-overlap conditions",
  "source_scene": "ScanNet++",
  "scene_id": "scene_0042",
  "qa_sample": [
    {
      "task_type": "object_count",
      "question": "How many chair(s) are visible across these frames?",
      "frames": [0, 1, 2, 3],
      "answer": "3",
      "options": ["1", "2", "3", "4"],
      "correct_option_index": 2,
      "metadata": {
        "target_category": "chair",
        "visible_instance_ids": ["chair_01", "chair_03", "chair_05"],
        "total_instances_in_scene": 5,
        "visibility_scenario": "partially_visible",
        "gt_scenario": "partial_coverage"
      }
    },
    {
      "task_type": "best_view_selection",
      "question": "How many chair(s) are visible across these frames? And tell me which frame provides the most informative view of the chair(s).",
      "frames": [0, 1, 2, 3],
      "answer": "3, Frame 2",
      "options": [
        "2, Frame 0",
        "3, Frame 1",
        "3, Frame 2",
        "4, Frame 3"
      ],
      "correct_option_index": 2,
      "metadata": {
        "target_category": "chair",
        "count_ground_truth": 3,
        "best_frame_id": 2,
        "best_frame_reason": "highest_visible_count_and_pixel_area",
        "per_frame_visible_counts": [1, 2, 3, 1],
        "per_frame_visible_pixels": [12450, 28760, 45320, 9870]
      }
    },
    {
      "task_type": "object_localization",
      "question": "Is there a(n) monitor in Frame 1? If so, what is the bounding box center coordinates?",
      "frames": [0, 1, 2, 3],
      "query_frame_id": 1,
      "answer": "Yes; (512, 384)",
      "options": [
        "Yes; (512, 384)",
        "Yes; (256, 512)",
        "Yes; (768, 256)",
        "No"
      ],
      "correct_option_index": 0,
      "metadata": {
        "target_instance": "monitor_01",
        "is_visible_in_query_frame": true,
        "bbox_center": [512, 384],
        "bbox_2d": [384, 256, 640, 512],
        "occlusion_ratio_in_frame": 0.12,
        "fov_occlusion_ratio": 0.0
      }
    },
    {
      "task_type": "occlusion_aware_existence",
      "subtask": "left_right",
      "question": "In Frame 3, is the mouse to the right of the laptop in this viewpoint?",
      "frames": [0, 1, 2, 3],
      "query_frame_id": 3,
      "answer": "Yes",
      "options": ["Yes", "No"],
      "correct_option_index": 0,
      "metadata": {
        "source_instance": "laptop_01",
        "target_instance": "mouse_01",
        "source_visible_in_frame": true,
        "target_visible_in_frame": false,
        "target_occlusion_ratio": 0.67,
        "ground_truth_relation": "right",
        "evaluated_axis": "x",
        "relation_camera_frame": {
          "source_bbox_3d": [[-0.5, 0.2, 1.2], [0.5, 0.1, 1.0]],
          "target_bbox_3d": [[0.6, 0.15, 1.15], [0.9, 0.05, 1.05]]
        }
      }
    },
    {
      "task_type": "occlusion_aware_existence",
      "subtask": "farther_closer",
      "question": "In Frame 2, is the plant farther from the camera than the container in this viewpoint?",
      "frames": [0, 1, 2, 3],
      "query_frame_id": 2,
      "answer": "No",
      "options": ["Yes", "No"],
      "correct_option_index": 1,
      "metadata": {
        "source_instance": "container_01",
        "target_instance": "plant_01",
        "source_visible_in_frame": true,
        "target_visible_in_frame": false,
        "target_occlusion_ratio": 0.52,
        "ground_truth_relation": "closer",
        "evaluated_axis": "z",
        "depth_source": 2.3,
        "depth_target": 1.8
      }
    },
    {
      "task_type": "occlusion_aware_attribute",
      "subtask": "higher_lower",
      "question": "In Frame 4, which object appears higher than the table in this viewpoint?",
      "frames": [0, 1, 2, 3, 4],
      "query_frame_id": 4,
      "answer": "Lamp",
      "options": ["Lamp", "Chair", "Book", "Cup"],
      "correct_option_index": 0,
      "metadata": {
        "source_instance": "table_01",
        "target_instance": "lamp_01",
        "source_visible_in_frame": true,
        "target_visible_in_frame": false,
        "target_occlusion_ratio": 0.73,
        "ground_truth_relation": "higher",
        "evaluated_axis": "y",
        "distractor_objects": [
          {"instance": "chair_02", "relation_to_source": "lower"},
          {"instance": "book_03", "relation_to_source": "lower"},
          {"instance": "cup_01", "relation_to_source": "lower"}
        ]
      }
    },
    {
      "task_type": "occlusion_aware_spatial_relation",
      "subtask": "full_3d",
      "question": "In Frame 1, where does the pillow appear in this view relative to the sofa?",
      "frames": [0, 1, 2],
      "query_frame_id": 1,
      "answer": "To the left of the sofa",
      "options": [
        "To the left of the sofa",
        "To the right of the sofa",
        "Higher than the sofa",
        "Farther from the camera than the sofa"
      ],
      "correct_option_index": 0,
      "metadata": {
        "source_instance": "sofa_01",
        "target_instance": "pillow_01",
        "source_visible_in_frame": true,
        "target_visible_in_frame": false,
        "target_occlusion_ratio": 0.81,
        "ground_truth_relations": {
          "x_axis": "left",
          "y_axis": "higher",
          "z_axis": "closer"
        },
        "evaluated_axis": "x",
        "distractor_relations": [
          "right",
          "lower",
          "farther"
        ]
      }
    }
  ],
  "difficulty_metadata": {
    "overlap_ratio_between_frames": [0.12, 0.08, 0.15, 0.10],
    "max_overlap": 0.15,
    "avg_occlusion_ratio_targets": 0.68,
    "visibility_scenario": "partially_visible",
    "difficulty_level": "high"
  }
}
Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐