Robot learning, autonomous science, and new interfaces as instances of an emerging paradigm for physical AI

机器人学习、自主科学和新界面作为物理人工智能新兴范式的实例

https://www.a16z.news/p/frontier-systems-for-the-physical


The dominant paradigm in AI today, insofar as it is used in production-ready settings, is organized around language and code. The scaling laws governing large language models are well-characterized, the commercial flywheel of data, compute, and algorithmic improvement is spinning, and the returns to incremental capability gains remain large and mostly legible. This paradigm has earned the capital and attention it commands.

But a set of adjacent and related fields has been making meaningful strides in its gestation phase. These areas of activity include VLAs, WAMs, and other approaches to generalist robotics models, physical and scientific reasoning in the pursuit of AI scientists, and novel interfaces for human-computer interaction (including BCIs and neurotech) that take advantage of advances in AI to rethink how we interact with machines. Beyond technical progress, each of these areas has seen the beginnings of an influx in talent, capital, and founder activity. The technical primitives for extending frontier AI into the physical world are maturing concurrently, and the pace of progress over the past eighteen months suggests that these fields could soon enter a scaling regime of their own.

In a given technology paradigm, the areas with the greatest delta between their current perceived capabilities and medium-term upside potential tend to be those that benefit from the same scaling dynamics driving the current frontier, but sit one step removed from the incumbent paradigm — close enough to inherit its infrastructure and research momentum, but distant enough to require non-trivial additional work. This distance serves a dual function: it creates a natural moat against fast-following, and it defines a problem space that is richer, less explored, and more likely to produce emergent capabilities precisely because the easy paths have not already been taken.

当前人工智能领域的主流范式(就其应用于生产就绪环境而言)主要围绕语言和代码构建。支配大语言模型的扩展规律已得到充分验证,由数据、算力和算法改进构成的商业飞轮持续运转,增量能力提升带来的回报依然巨大且基本可预测。这一范式已赢得与其地位相称的资本与关注。

但一组相邻相关领域在其孕育阶段正取得实质性进展。这些领域包括通用机器人模型(如VLA、WAM等)、追求AI科学家的物理与科学推理,以及利用AI进步重新构想人机交互的新型界面(包括脑机接口与神经科技)。除技术突破外,这些领域都开始出现人才、资本和创业者活动的涌入。将前沿AI延伸至物理世界的基础技术要素正在同步成熟,过去18个月的进展速度表明,这些领域可能很快会形成独立的扩展体系。

在特定技术范式下,当前认知能力与中期上升潜力差距最大的领域,往往既受益于推动当前前沿的相同扩展动力,又与现有范式保持适度距离——足够接近以继承其基础设施和研究势能,又足够疏离而需要实质性创新工作。这种距离具有双重作用:它天然抵御快速模仿,同时界定出更丰富、更少被探索的问题空间,正因为简单路径已被前人踏平,反而更可能催生涌现能力。

Three domains fit this description today: robot learning, autonomous science (particularly in materials and life sciences), and new human-machine interfaces (including brain-computer interfaces, silent speech, neural wearables, and novel sensory modalities like digitized olfaction). These are not entirely separate efforts, and thematically are part of a group of emerging frontier systems for the physical world. They share a common substrate of technical primitives, like learned representations of physical dynamics, architectures for embodied action, simulation and synthetic data infrastructure, an expanding sensory manifold, and closed-loop agentic orchestration. They are mutually reinforcing in ways that create compounding dynamics across domains. And they are the areas where qualitatively new AI capabilities are most likely to emerge from the interaction of model scale, physical grounding, and novel data modalities.

This essay surveys the technical primitives underlying these systems, examines why these three domains specifically represent frontier opportunities, and proposes that their mutual reinforcement constitutes a structural flywheel for extending AI into the physical world.

如今有三个领域符合这一描述:机器人学习、自主科学(尤其在材料与生命科学领域)以及新型人机交互界面(包括脑机接口、无声语音技术、神经可穿戴设备及嗅觉数字化等新型感知模态)。这些领域并非完全独立,从主题上看同属物理世界新兴前沿系统的一部分。它们共享技术原语的共同基础,例如物理动力学的学习表征、具身行动架构、仿真与合成数据基础设施、不断扩展的感知维度,以及闭环智能体协调系统。这些领域通过跨域复合效应相互增强,正是在模型规模、物理基础与新型数据模态的交互作用下,最有可能催生质的飞跃式AI新能力。

本文梳理了支撑这些系统的技术原语,阐释为何这三个领域尤其代表前沿机遇,并提出它们之间的协同效应构成了AI向物理世界扩展的结构性飞轮。

Primitives

原始人

Before examining specific application domains, it’s worth understanding the shared technical foundations that make these frontier systems possible. Five main primitives underpin the advance of frontier AI into the physical world. These technologies aren’t necessarily specific to any particular application area; rather, they are the building blocks that enable the creation of systems that extend AI to the physical world. Their concurrent maturation is what makes the emerging moment distinctive.

在探讨具体应用领域之前,值得先了解这些前沿系统共同依托的技术基础。五大核心要素支撑着前沿人工智能向物理世界的拓展。这些技术并不专属于任何特定应用领域,而是构成人工智能实体化系统的基石。正是这些技术的同步成熟,造就了这个独特的发展契机。

Learned Representations of Physical Dynamics

物理动力学的学习表示

The most fundamental primitive is the ability to learn compressed, general-purpose representations of how the physical world behaves — how objects move, deform, collide, and respond to force. Without this, every physical world AI system must learn the physics of its domain from scratch, a prohibitively expensive proposition.

Multiple architectural families are converging on this capability from different directions. Vision-Language-Action models (VLAs) approach it from above: they take pretrained vision-language models—already rich with semantic understanding of objects, spatial relations, and language—and extend them with action decoders that output motor commands. The key insight is that the enormous cost of learning to see and understand the world can be amortized across internet-scale image-text pretraining. Models like π₀ from Physical Intelligence, Google DeepMind’s Gemini Robotics, or NVIDIA’s GR00T N1 have demonstrated this architecture at increasing scale.

World Action Models (WAMs) approach the same capability from below: they build on video diffusion transformers pretrained on internet-scale video, inheriting rich priors about physical dynamics—how objects fall, occlude, and interact under force—and coupling these priors with action generation. NVIDIA’s DreamZero demonstrates zero-shot generalization to entirely new tasks and environments, achieving a meaningful improvement in real-world generalization while enabling cross-embodiment transfer from human video demonstrations with only small amounts of adaptation data.

最基础的底层能力,是学会对物理世界运行规律进行压缩的通用表征——包括物体如何移动、形变、碰撞以及受力响应。缺乏这种能力,每个物理世界AI系统都必须从零开始学习所处领域的物理规律,这种成本将高得令人望而却步。

多种架构体系正从不同方向朝此能力汇聚。视觉-语言-行动模型(VLA)采取自上而下的路径:它们基于已具备物体语义理解、空间关系认知及语言能力的预训练视觉语言模型,通过添加输出运动指令的动作解码器进行扩展。其核心洞见在于,学习观察和理解世界的巨大成本可以通过互联网规模的图文预训练来分摊。Physical Intelligence的π₀、Google DeepMind的Gemini Robotics或英伟达GR00T N1等模型,已在大规模实践中验证了这种架构。

世界行动模型(WAM)则自下而上实现相同能力:它们建立在互联网规模视频预训练的视频扩散Transformer基础上,继承了关于物理动态的丰富先验知识(如物体坠落、遮挡及受力交互的规律),并将这些先验与动作生成相结合。英伟达DreamZero展示了在全新任务和环境中的零样本泛化能力,仅需少量适配数据就能实现跨具身智能体的人类视频演示迁移,在现实世界泛化性能上取得了显著提升。

A third path, and one that may be the most instructive for understanding where this field is heading, dispenses with both pretrained VLMs and video diffusion backbones entirely. Generalist’s GEN-1 is a native embodied foundation model trained from scratch on over half a million hours of real-world physical interaction data, collected primarily through low-cost wearable devices on humans performing everyday manipulation tasks. It is not a VLA in the standard sense (as there is no vision-language backbone being fine-tuned), nor is it a WAM. It is instead a first-class foundation model for physical interaction, designed from the ground up to learn representations of dynamics from the statistics of human-object contact rather than from internet images, text, or video.

第三条路径或许最能揭示该领域的发展方向——它完全摒弃了预训练视觉语言模型和视频扩散主干。Generalist公司的GEN-1是从零开始训练的具身基础模型,其训练数据来自超过50万小时的真实物理交互记录,这些数据主要通过低成本可穿戴设备采集自人类执行日常操作任务的过程。它既非标准意义上的视觉语言模型(因为没有微调视觉语言主干),也非世界行动模型。这是专为物理交互打造的一流基础模型,其底层设计理念是通过统计人-物接触数据来学习动态表征,而非依赖网络图像、文本或视频数据。

Spatial intelligence, like that being built by companies like World Labs, is valuable for this primitive because it addresses a representation gap that VLAs, WAMs, and native embodied models all share: none of them explicitly model the three-dimensional structure of the scenes they operate in. VLAs inherit 2D visual features from image-text pretraining. WAMs learn dynamics from video, which is a 2D projection of 3D reality. Models that learn from wearable sensor data capture forces and kinematics, but not scene geometry. Spatial intelligence models can help fill this gap by learning to reconstruct, generate, and reason about the full 3D structure of physical environments — geometry, lighting, occlusion, object relationships, and spatial layout.

空间智能,如World Labs等公司正在构建的技术,对这一原始领域具有重要价值,因为它解决了视觉语言模型(VLAs)、世界动作模型(WAMs)和原生具身模型的共同缺陷:这些模型都未能显式建模其操作场景的三维结构。VLAs从图文预训练中继承的是二维视觉特征;WAMs从视频中学习动态变化,而视频本质上是三维现实的二维投影;基于可穿戴传感器数据的模型虽能捕捉力与运动学特征,却无法获取场景几何信息。空间智能模型通过学习物理环境的完整三维结构重建、生成与推理能力——包括几何形状、光照、遮挡、物体关系及空间布局——将有效填补这一空白。

Convergence between approaches here is the point. Whether the representations are inherited from VLMs, learned through video co-training, or built natively from physical interaction data, the underlying primitive is the same: compressed, transferable models of how the physical world behaves. The data flywheel for these representations is enormous and largely untapped — encompassing not just internet video and robot trajectories, but the vast corpus of human physical experience that wearable devices are now beginning to capture at scale. The same representations serve a robot learning to fold towels, a self-driving laboratory predicting reaction outcomes, and a neural decoder interpreting the motor cortex’s plan for grasping.

方法之间的融合才是关键。无论表征是继承自视觉语言模型、通过视频协同训练习得,还是从物理交互数据原生构建,其底层本质是相同的:关于物理世界运行方式的压缩可迁移模型。这些表征的数据飞轮庞大且尚未充分开发——不仅包含网络视频和机器人运动轨迹,更涵盖可穿戴设备正开始大规模采集的人类物理活动经验宝库。同一套表征既能服务于学习叠毛巾的机器人,也可用于预测反应结果的自驱动实验室,还能帮助神经解码器解读运动皮层抓取动作的意图。

Architectures for Embodied Action

体现行动的建筑

Representations of physics are necessary but insufficient. Translating understanding into reliable physical action requires architectures that solve several interrelated problems: mapping high-level intent to continuous motor commands, maintaining coherence over long action horizons, operating within real-time latency constraints, and improving with experience.

The dual-system hierarchical architecture — separating a slow, powerful vision-language model for scene understanding and task reasoning (System 2) from a fast, lightweight visuomotor policy for real-time control (System 1)—has converged as the standard design pattern for complex embodiments. GR00T N1, Gemini Robotics, and Figure’s Helix all adopt variants of this approach, resolving the fundamental tension between the rich reasoning that large models provide and the millisecond-scale control frequencies that physical tasks demand. Alternatively, Generalist takes an approach of harmonic reasoning for simultaneous thinking and action.

The action generation mechanism itself is evolving rapidly. Flow matching and diffusion-based action heads, pioneered by π₀, have emerged as the dominant approach for producing smooth, high-frequency continuous actions, displacing the discrete tokenization methods borrowed from language modeling. These methods treat action generation as a denoising process analogous to image synthesis, yielding trajectories that are physically smoother and more robust to compounding errors than autoregressive token prediction.

物理想象不可或缺但远非充分。将认知转化为可靠物理行动需要解决多重关联问题的架构体系:将高层意图映射至连续运动指令、维持长时行动连贯性、满足实时延迟约束,并通过经验持续优化。

双系统分层架构——将用于场景理解与任务推理的慢速强力视觉语言模型(系统2)与实现实时控制的轻量级视觉运动策略(系统1)相分离——已成为复杂具身系统的标准设计范式。GR00T N1、Gemini Robotics与Figure的Helix均采用该方案的变体,化解了大模型提供的丰富推理能力与物理任务所需的毫秒级控制频率间的根本矛盾。而Generalist则采用谐波推理方法实现同步思考与行动。

行动生成机制本身正快速演进。由π₀首创的流匹配与扩散式行动头成为主流方案,取代了从语言建模借鉴的离散标记方法,能生成平滑的高频连续动作。这些方法将行动生成视作类似图像合成的去噪过程,相比自回归标记预测,产生的运动轨迹物理特性更平滑,对误差累积更具鲁棒性。

But the most consequential architectural development may be the extension of reinforcement learning to pretrained VLAs — the idea that a foundation model trained on demonstrations can then improve through its own autonomous practice, much as a person refines a skill through repetition and self-correction. Physical Intelligence’s work on π*₀.₆ represents the clearest demonstration of this principle at scale. Their method, RECAP (RL with Experience and Corrections via Advantage-conditioned Policies), addresses a problem that pure imitation learning cannot solve: credit assignment over long task horizons. If a robot grasps an espresso machine’s portafilter at a slightly wrong angle, the failure may not manifest until several steps later when insertion fails. Imitation learning has no mechanism to attribute the failure to the earlier grasp; RL does. RECAP trains a value function that estimates the probability of success from any intermediate state, then conditions the VLA to select high-advantage actions. Critically, it incorporates heterogeneous data (demonstrations, on-policy autonomous experience, expert teleoperated corrections provided during execution, etc.) into a unified training pipeline.

但最具深远影响的架构发展可能是将强化学习扩展到预训练的视觉语言动作模型(VLA)——这个理念认为,通过示范训练的基础模型能通过自主实践持续提升,就像人类通过重复和自我修正来精进技能。Physical Intelligence团队在π*₀.₆项目的工作最清晰地大规模验证了这一原理。他们提出的RECAP方法(基于优势条件策略的经验与修正强化学习)解决了纯模仿学习无法攻克的难题:长任务跨度中的功劳分配问题。比如当机器人以轻微错误角度抓取意式咖啡机的把手时,故障可能要到后续步骤中无法插入时才显现。模仿学习无法将失败归因于先前的抓取动作,而强化学习可以。RECAP通过训练价值函数来估算任意中间状态的成功概率,进而引导VLA选择高优势值动作。其关键在于将异构数据(示范样本、策略内自主经验、执行过程中专家远程操作提供的修正等)整合到统一训练流程中。

The results of this approach are encouraging for the future of RL for actions. π*₀.₆ folds laundry across 50 novel garment types in real homes, reliably assembles boxes, and prepares espresso drinks on a professional machine, running continuously for hours without human intervention. On the most difficult tasks, RECAP more than doubles throughput and cuts failure rates by half or more compared to the imitation-only baseline. The system also demonstrates that RL post-training yields qualitatively different behaviors from imitation, like smoother recoveries, more efficient grasp strategies, and adaptive error correction that were not present in the demonstration data.

These gains suggest that the same compute-scaling dynamics that drove LLMs from GPT-2 to GPT-4 are beginning to operate in the embodied domain — just earlier on the curve, and with an action space that is continuous, high-dimensional, and subject to the unforgiving constraints of real-world physics.

这种方法的结果对于强化学习(RL)在动作领域的未来充满希望。π*₀.₆能在真实家庭环境中折叠50种新型衣物,可靠地组装箱子,并在专业咖啡机上制作浓缩咖啡,连续运行数小时无需人工干预。在最困难的任务上,与纯模仿基线相比,RECAP的吞吐量提升了一倍以上,故障率降低了一半或更多。该系统还表明,经过RL后训练的行为与模仿存在质的差异,比如更流畅的恢复动作、更高效的抓取策略以及演示数据中未曾出现的自适应纠错能力。

这些进展表明,推动大语言模型从GPT-2发展到GPT-4的算力扩展效应,正在具身智能领域开始显现——只是处于发展曲线的更早期阶段,且动作空间具有连续性、高维度特性,并受现实世界物理规律的严格约束。

Simulation and Synthetic Data as Scaling Infrastructure

模拟与合成数据作为扩展基础设施

In language, the data problem was solved by the internet: trillions of tokens of naturally occurring text, freely available. In the physical world, the data problem is orders of magnitude harder – as is now very well understood, indicated by a rapid increase in startups aiming to build data vendors for the physical world. Real-world robot trajectories are expensive to collect, dangerous to scale, and limited in diversity. A language model can learn from a billion conversations; a robot cannot have a billion physical interactions (yet).

Simulation and synthetic data generation are the infrastructure layer that resolves this constraint, and their maturation is one of the key reasons physical world AI is accelerating now rather than five years ago.

在语言领域,互联网解决了数据问题:数万亿个自然生成的文本标记可以免费获取。而在物理世界,数据问题的难度高出几个数量级——这一点如今已广为人知,从那些旨在为物理世界构建数据供应商的初创企业迅速增加就可见一斑。现实世界的机器人轨迹数据收集成本高昂、规模化存在风险,且多样性有限。语言模型可以从十亿次对话中学习,但机器人目前还无法进行十亿次物理交互(至少目前如此)。

仿真和合成数据生成正是打破这一限制的基础设施层,它们的成熟是物理世界人工智能现在加速发展而非五年前的关键原因之一。

The modern simulation stack combines physics-based simulation engines, photorealistic rendering via ray tracing, procedural environment generation, and world foundation models that bridge the sim-to-real gap by generating photorealistic video from simulation inputs. The pipeline runs from neural reconstruction of real environments (using only a smartphone), through population with physically accurate 3D assets, to large-scale synthetic data generation with automatic annotation.

The significance of improvements in the simulation stack is, intuitively, changing the economic assumptions that underpin physical world AI. If the bottleneck in physical AI shifts from collecting real data to designing diverse virtual environments, the cost curve collapses. Simulation scales with compute, not with human labor or physical hardware. This transforms the economics of training physical world AI systems in the same way that internet-scale text data transformed the economics of training language models, and it means that investment in simulation infrastructure has outsized leverage on the entire ecosystem.

Simulation, however, is not only a robotics primitive. The same infrastructure serves autonomous science (digital twins of laboratory equipment, simulated reaction environments for hypothesis pre-screening), new interfaces (simulated neural environments for training BCI decoders, synthetic sensory data for calibrating novel sensors), and other domains where AI interacts with the physical world. Simulation is the universal data engine for physical world AI.

现代仿真技术栈融合了基于物理的仿真引擎、通过光线追踪实现的照片级渲染、程序化环境生成,以及通过仿真输入生成逼真视频来弥合虚拟与现实差距的世界基础模型。该流程从仅需智能手机即可实现的真实环境神经重建开始,经过植入物理精确的3D资产,最终实现带自动标注的大规模合成数据生成。

直观来看,仿真技术栈的进步正在颠覆支撑物理世界AI的经济学假设。如果物理AI的瓶颈从采集真实数据转向设计多样化的虚拟环境,成本曲线将呈现断崖式下降。仿真规模随算力增长而扩展,而非依赖人力或实体硬件。这彻底改变了训练物理世界AI系统的经济模式,正如互联网级文本数据改变了语言模型的训练经济学,意味着对仿真基础设施的投资对整个生态系统具有超常杠杆效应。

但仿真不仅是机器人技术的底层构件。同一套基础设施还服务于自主科学(实验室设备的数字孪生、用于假设预筛的模拟反应环境)、新型交互界面(用于训练脑机接口解码器的模拟神经环境、用于校准新型传感器的合成感知数据),以及其他AI与物理世界交互的领域。仿真是物理世界AI的通用数据引擎。

Expanding the Sensory Manifold

扩展感官流形

The physical world communicates through a far richer set of signals than vision and language. Touch conveys information about material properties, grip stability, and contact geometry that is invisible to cameras. Neural signals encode motor intent, cognitive state, and perceptual experience at bandwidths that dwarf any current human-computer interface. Subvocal muscle activity encodes speech intention before any sound is produced. The fourth primitive is the rapid expansion of AI’s sensory access to these previously inaccessible modalities, driven not only by research, but by an ecosystem building the devices, software, and infrastructure to capture and process these signals at consumer scale.

物理世界通过远比视觉和语言更为丰富的信号进行交流。触觉传递着摄像头无法捕捉的材料特性、抓握稳定性和接触几何信息。神经信号以远超当前任何人机交互界面的带宽,编码着运动意图、认知状态和感知体验。在声音产生之前,发声肌群的活动就已编码了言语意图。第四大基础要素是人工智能对这些原本难以获取的感知模态的快速拓展,这不仅由研究推动,更得益于一个构建设备、软件和基础设施的生态系统,使其能够以消费级规模捕获并处理这些信号。

The most visible indicator of this expansion is the emergence of new device categories. These include AR devices, which have massively improved in user experience and form factor in recent years (with companies building applications on this platform for both consumer and industrial use cases); voice-first AI wearables, provide more comprehensive context for language-based AI by accompanying users into the physical world. Longer term, neural interfaces may open even more comprehensive modalities of interaction. AI has presented a shift in computing that has created an opportunity to dramatically advance the way humans interact with computers, and companies like Sesame are building new modalities and devices to do that.

More dominant modalities like voice create tailwinds for emerging means of interacting with computers. As products like Wispr Flow push voice into more of a primary input modality (an advantage given its high information density), the market dynamics around silent speech interfaces also become more favorable. Silent speech devices, which use various sensors to detect tongue and vocal cord movements to decipher speech without sound, represent an even higher information density modality for interacting with computers and AI.

Brain-computer interfaces, both invasive and non-invasive, represent the deeper frontier, and the commercial ecosystem around them continues to progress. The signal there would be progress towards the convergence of clinical validation, regulatory clearance, platform integration, and institutional capital around a technology category that was purely academic a few years ago.

Tactile sensing is entering embodied AI architectures, as some models in robot learning begin to explicitly include touch as a first class part of their approach. Olfactory interfaces are becoming real engineering artifacts: wearable displays using miniaturized odor generators with millisecond response times have been demonstrated for mixed-reality applications, while smell models are being built to pair with visual AI systems for chemical process monitoring.

The pattern across all of these developments is that they converge on each other in the limit. AR glasses generate continuous visual and spatial data about how users interact with physical environments. EMG wristbands capture the statistics of human motor intent. Silent speech interfaces capture the mapping between subvocal articulation and linguistic output. BCIs capture neural activity at the highest resolution available. Tactile sensors capture the contact dynamics of physical manipulation. Each new device category is also a data-generation platform that feeds the models underlying multiple application domains. A robot trained on EMG-derived motor intent data learns different grasping strategies than one trained only on teleoperation. A laboratory interface that responds to subvocal commands enables a different kind of scientist-machine interaction than a keyboard. A neural decoder trained on high-density BCI data produces representations of motor planning that are inaccessible through any other channel.

The proliferation of these devices is expanding the effective dimensionality of the data manifold available for training frontier physical world AI systems — and the fact that much of this expansion is being driven by well-capitalized consumer product companies, not just academic labs, means that the data flywheel can scale with market adoption.

Closed-Loop Agentic Systems

The final primitive is more architectural. It is the ability to orchestrate perception, reasoning, and action into sustained, autonomous, closed-loop systems that operate over long time horizons without human intervention.

In language models, the analogous development was the emergence of agentic systems — multi-step reasoning chains, tool use, and self-correcting workflows that advanced models from single-turn question-answerers into autonomous problem-solvers. In the physical world, the same transition is underway, but the requirements are far more demanding. A language agent that makes an error can backtrack costlessly, whereas a physical agent that drops a beaker of reagent cannot.

Three properties distinguish physical world agentic systems from their digital counterparts. First, they require embodiment in the experimental or operational loop: direct interfaces to raw instrument streams, physical state sensors, and actuation primitives that ground reasoning in physical reality rather than text descriptions of it. Second, they require long-horizon persistence: memory, provenance tracking, safety monitoring, and recovery behaviors that maintain continuity across operational cycles rather than treating each task as a standalone episode. Third, they require closed-loop adaptation: the ability to revise strategies based on physical outcomes, not just textual feedback.

This primitive is what transforms individual capabilities (a good world model, a reliable action architecture, a rich sensor suite) into functioning systems that can operate autonomously in the physical world. It is the integration layer, and its maturation is what makes the three application domains described below possible as real-world deployments rather than isolated research demonstrations.

Three Domains

The primitives described above are general-purpose enabling layers. They do not, by themselves, specify where the most important applications will emerge. Many domains involve physical action, physical measurement, or physical sensing. What distinguishes a frontier system from a merely improved existing system is the degree to which increasing model capabilities and scaling infrastructure compound within the domain — creating not just better performance, but qualitatively new capabilities that were previously impossible.

Robotics, AI-driven science, and new human-machine interfaces are the three domains where this compounding is strongest. Each one assembles the primitives in a distinct configuration. Each one is bottlenecked by limitations that the primitives discussed are lifting. And each one generates, as a byproduct of its operation, exactly the kind of structured physical data that makes the primitives themselves better, closing a feedback loop that accelerates the entire system. They are not the only physical AI domains worth watching, but they are the ones where the interaction between frontier AI capabilities and physical reality is densest, and where the distance from the current language/code paradigm creates the most space for emergence while remaining highly complementary and benefitting from these capabilities.

Robotics

Robotics is the most literal embodiment of the thesis: a domain that requires an AI system to perceive, reason about, and physically act upon the material world in real time. It is also the domain that most directly stress-tests every primitive simultaneously.

Consider what a general-purpose robot must do to fold a towel. It needs a learned representation of how deformable materials behave under force—a physics prior that no amount of language pretraining provides. It needs an action architecture that can translate a high-level instruction into a sequence of continuous motor commands at control frequencies of 20Hz or more. It needs simulation-generated training data, because no one has collected millions of real-world towel-folding demonstrations. It needs tactile feedback to detect slip and adjust grip force, because vision alone cannot distinguish a firm grasp from one about to fail. And it needs a closed-loop controller that can detect when the fold has gone wrong and recover, rather than blindly executing a memorized trajectory.

This is why robotics is a frontier system rather than a mature engineering discipline with better tools. The primitives do not merely improve existing robotic capabilities; they unlock categories of manipulation, locomotion, and interaction that were previously impossible outside of narrowly controlled industrial settings.

The frontier has advanced meaningfully in recent years, as we’ve previously written. The first generation of VLAs demonstrated that foundation models can control robots across diverse tasks. Architectural advancements have made progress on bridging the high level reasoning and low level controls in robotic systems. On-device inference is becoming feasible, and cross-embodiment transfer means a model can adapt to an entirely new robot platform with limited amounts of data. The central remaining challenge is reliability at scale, which remains the bottleneck to deployments. Even 95% per-step success yields only 60% on a 10-step task chain, and production environments demand far better. This is where the RL post-training holds high potential, and can help us move towards the capabilities and robustness that would indicate a domain entering its scaling regime.

These advances have implications for market structure. For decades, value in robotics accrued to the mechanical system itself, and while that remains a key part of the stack, as learned policies become more standard, value migrates to models, training infrastructure, and data flywheels. But robotics also feeds back into the primitives previously discussed: every real-world trajectory is training data for better world models, every deployment failure reveals gaps in simulation coverage, and every new embodiment tested expands the diversity of physical experience available for pretraining. Robotics is both the most demanding consumer of the primitives and one of their most important sources of improvement signal.

Autonomous Science

If robotics tests the primitives against the demands of real-time physical action, autonomous science tests them against something slightly different – sustained, multi-step reasoning about causally complex physical systems, over time horizons measured in hours or days, with experimental outcomes that must be interpreted, contextualized, and used to revise strategy.

AI-driven science is the domain where the primitives combine most completely. A self-driving laboratory requires learned representations of physical and chemical dynamics to predict what an experiment will produce. It requires embodied action to pipette reagents, position samples, and operate analytical instruments. It requires simulation to pre-screen candidate experiments and allocate scarce instrument time. It requires expanded sensing, such as spectroscopy, chromatography, mass spectrometry, and increasingly novel chemical and biological sensors, to characterize outcomes. And it requires the closed-loop agentic orchestration primitive more than any other domain – the ability to sustain multi-cycle hypothesis-experiment-analysis-revision workflows without human intervention, maintaining provenance, monitoring safety, and adapting strategy based on what each cycle reveals.

No other domain draws on these primitives this deeply. This is what makes autonomous science a frontier system rather than simply laboratory automation with better software. Companies like Periodic Labs and Medra unify scientific reasoning capabilities with the physical capabilities of testing that reasoning in materials science and life sciences respectively, enabling scientific iteration and generating experimental training data along the way.

The value to such systems is fairly intuitive. Traditional materials discovery takes several years from concept to commercialization, and AI-accelerated workflows can potentially compress this process to far less. The binding constraint is shifting from hypothesis generation, which foundation models can assist with readily, to fabrication and validation, which requires physical instrumentation, robotic execution, and closed-loop optimization. SDLs aim to address exactly this bottleneck.

An additional important property of autonomous science, across the landscape of these systems for the physical world, is its role as a data engine. Every experiment an SDL runs produces not just a scientific result, but a physically grounded, experimentally validated training signal. A measurement of how a polymer crystallizes under specific conditions enriches the world model’s understanding of material dynamics. A validated synthesis route becomes training data for physical reasoning. A characterized failure teaches the agentic system where its predictions break down. This data produced by an AI scientist conducting a real experiment is qualitatively different from internet-scraped text or simulation output, in that it is structured, causal, and empirically verified. It is the kind of data that physical reasoning models need most and can get from no other source. Autonomous science is the domain that directly converts physical reality into the structured knowledge that improves the entire ecosystem of physical world AI.

New Interfaces

Robotics extends AI into physical action, and autonomous science extends it into physical investigation. New interfaces extend it into the direct coupling of artificial intelligence with human perception, sensory experience, and the body’s own signals, through devices that range from AR glasses and EMG wristbands to implantable brain-computer interfaces. What unifies this category is not a single technology but a shared function of expanding the bandwidth and modality of the channel between human intelligence and AI systems, and in the process generating data about human-world interaction that is directly useful for building physical world AI.

The distance from the incumbent paradigm is the source of both the challenge and the potential in this domain. Language models know conceptually about these modalities, but are not necessarily native to the movement patterns for silent speech, the geometry of olfactory receptor binding, or the temporal dynamics of EMG signals. The representations that decode these signals must be learned from the expanding sensory manifold. There is no internet-scale pretraining corpus for many of these modalities, and the data often must come from the interfaces themselves, which means the systems and their training data co-evolve in a way that has no analogue in language AI.

The near-term expression of this domain is the rapid emergence of AI wearables as a consumer product category. AR glasses are perhaps the most visible instance of this category, along with other wearable consumer devices that take a voice or vision-first input modality.

This ecosystem of consumer devices creates both new hardware platforms for AI to extend into the physical world, along with being infrastructure for physical world data. A person wearing AI glasses can produce a continuous first-person video stream of how humans navigate physical environments, manipulate objects, and interact with the world. Other wearables capture continuous biometric and movement data. Taken together, the installed base of AI wearables is becoming a distributed data-collection network for physical-world AI, instrumenting human physical experience at a scale that was previously impossible. Consider the scale of the smartphone as a consumer device – the proliferation of a new type of consumer device that enables new modalities for computers to sensing the world at that scale also creates a massive new channel for AI interacting with the physical world.

Brain-computer interfaces represent the deeper frontier. Neuralink has implanted multiple patients and is iterating on its surgical robotics and decoder software. Synchron’s endovascular Stentrode has been used to give paralyzed users control over digital and physical environments. Echo Neurotechnologies is developing a BCI system for speech restoration that builds on their work in high-resolution cortical speech decoding. Moreover, new companies like Nudge have been formed to aggregate talent and capital to build new neural interfaces and platforms for interacting with the brain. The technical milestones in the research sphere are also noteworthy. The BISC chip demonstrated wireless neural recording at a density of 65,536 electrodes on a single chip, and the BrainGate team decoded inner speech directly from motor cortex.

The through-line connecting everything from AR glasses, AI wearables, silent speech devices, and implantable BCIs is not just that they are all interfaces. It is that they collectively constitute a spectrum of increasingly high-bandwidth channels between human physical experience and AI systems — and every point on that spectrum helps support the primitives underlying all three domains in this essay for continued progress. A robot trained on high quality egocentric video from millions of AI glasses wearers learns different manipulation priors than one trained on curated teleoperation datasets; a laboratory AI that responds to subvocal commands operates with a different latency and fluidity than one controlled by a keyboard; a neural decoder trained on high-density BCI data produces representations of motor planning that are inaccessible through any other channel.

New interfaces are a mechanism by which the sensory manifold itself grows by opening data channels between the physical world and AI that did not previously exist. And the fact that this expansion is being driven by consumer device companies that seek to deploy products at scale means the data flywheel will accelerate with consumer adoption.

Systems for the Physical World

The reason to view robotics, autonomous science, and new interfaces as different instances of frontier systems combining the same primitives is that they are mutually enabling in ways that compound.

Robotics enables autonomous science. Self-driving laboratories are, at their core, robotic systems. The manipulation capabilities developed for general-purpose robotics, such as dexterous grasping, liquid handling, precise positioning, multi-step task execution, are directly transferable to laboratory automation. As robotics models improve in generality and robustness, the range of experimental protocols that SDLs can execute autonomously expands. Every advance in robot learning lowers the cost and raises the throughput of autonomous experimentation.

Autonomous science enables robotics. The scientific data generated by self-driving labs, such as validated physical measurements, causal experimental results, materials property databases, can provide structured, grounded training data that world models and physical reasoning engines need to improve. Moreover, the materials and devices that next-generation robots require (e.g., better actuators, more sensitive tactile sensors, higher-density batteries, etc.) are themselves products of materials science. Autonomous discovery platforms that accelerate materials innovation can directly improve the hardware substrate on which robot learning operates.

New interfaces enable robotics. AR devices are a scalable way of gathering data on perceiving and interacting with the physical environment. Neural interfaces generate data about human motor intent, cognitive planning, and sensory processing. These data are invaluable for training robot learning systems, particularly for tasks that involve human-robot collaboration or teleoperation.

There is a deeper point here about the nature of frontier AI progress itself. The language/code paradigm has achieved extraordinary results and continues to show strong improvement in the scaling era. The physical world offers an almost unbounded supply of novel problems, data types, feedback signals, and evaluation criteria. By grounding AI systems in physical reality (through robots that manipulate objects, laboratories that synthesize materials, and interfaces that connect to the biological and physical world) we open new scaling axes that are complementary to the existing digital frontier – and likely mutually improving.

机器人技术赋能自主科学。自动驾驶实验室本质上是机器人系统。为通用机器人开发的操控能力——如灵巧抓取、液体处理、精确定位、多步骤任务执行——可直接迁移至实验室自动化场景。随着机器人模型通用性与鲁棒性的提升,自动驾驶实验室能自主执行的实验方案范围将持续扩大。机器人学习的每项进步都在降低自主实验成本并提升其通量。

自主科学反哺机器人技术。自动驾驶实验室产生的科学数据(如已验证的物理测量值、因果性实验结果、材料特性数据库)可为世界模型和物理推理引擎提供结构化基础训练数据。更重要的是,下一代机器人所需的材料和器件(如更高性能的执行器、更灵敏的触觉传感器、更高密度的电池等)本身就是材料科学的产物。加速材料创新的自主发现平台,能直接改进机器人学习所依赖的硬件基底。

新型界面推动机器人发展。增强现实设备是采集物理环境感知与交互数据的可扩展方案。神经接口可生成人类运动意图、认知规划和感官处理的相关数据。这些数据对于训练机器人学习系统(特别是涉及人机协作或遥操作的任务)具有不可估量的价值。

这揭示了前沿AI发展的深层逻辑:语言/代码范式已取得非凡成就,并在规模化时代持续展现强劲潜力。物理世界提供了近乎无限的新问题、数据类型、反馈信号和评估标准。通过将AI系统锚定于物理现实(借助物体操控机器人、材料合成实验室、连接生物与物理世界的交互界面),我们开辟了与现有数字前沿互补且可能相互促进的全新扩展维度。

The emergent behaviors we should expect from these systems are difficult to predict precisely, because emergence by definition arises from the interaction of capabilities that are individually well-understood but collectively novel. But the historical pattern is certainly encouraging. When AI systems gain access to new modalities of interaction with the world — when they can see (computer vision), when they can speak (speech recognition), when they can read and write (language models) — the resulting capabilities are qualitatively larger than the sum of the constituent improvements. The transition to physical world systems represents the next such phase transition In this sense, the primitives discussed here are being built now, and could enable frontier AI systems to perceive, reason about, and interact with the physical world, unlocking significant amounts of value and progress in the physical realm.

我们难以精确预测这些系统将展现出的涌现行为,因为根据定义,涌现源于各个已被充分理解但组合起来却具有新颖性的能力之间的相互作用。然而历史规律确实令人振奋——当人工智能系统获得与世界交互的新模式时(当它们能看见时产生计算机视觉,能说话时诞生语音识别,能读写时发展出语言模型),最终形成的能力在质变层面远超各部分改进的简单叠加。向物理世界系统的过渡正代表着下一个这样的相变阶段。从这个意义上说,当前正在构建的基础模块,或将使前沿人工智能系统具备感知、推理和与物理世界互动的能力,从而在实体领域释放巨大的价值与进步潜能。

------

Frontier Systems for the Physical World

物理世界的前沿系统

Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐