PerceptTwin: Semantic Scene Reconstruction for Iterative LLM Planning and Verification

Mila

Université de Montréal

ICRA 2026

Video abstract

Abstract

Simulation environments are useful for both robot policy learning and planning verification and validation. Traditionally, the process of creating a simulation was onerous. Creating a bespoke simulation environment for each individual environment that a robot would operate in was simply infeasible. In this work, we introduce PerceptTwin, a fully automatic pipeline that constructs interactive simulations directly from semantic scene representations produced by a robot’s perception stack. PerceptTwin combines open-vocabulary object maps with 3D asset generation, affordance prediction, and commonsense condition checking. These interactive simulations can be used to validate and refine plans before they are executed on the robot hardware. Borrowing from the AI alignment literature, we also introduce an LLM judge that verifies plan correctness and alignment with human preferences. Experiments show that PerceptTwin feedback allows LLM planners to refine plans, enhance safety, and resist harmful black-box prompting attacks. In our suite of tasks, PerceptTwin improves plan success by an average of ~39% for GPT5, GPT5Mini, and GPT5Nano planners. Additionally, PerceptTwin also improves human plan verification by up to 8% on average for plans that fail due to unfilled skill preconditions. Our results demonstrate the potential of open-vocabulary scene simulation from robot perception as a foundation for safer, more reliable robot planning.

Approach

We propose an augmentation to semantic scene maps that enables the automatic creation of interactive simulations. We then investigate the value of this augmentation for both human and LLM-based plan verification and refinement.

To evaluate the effect of obtaining feedback from the reconstruction, we rollout 5 seeds for the three most recent OpenAI models (GPT5, GPT5Mini, and GPT5Nano) on a suite of 6 tasks. We allow PerceptTwin to give feedback for up to five feedback iterations per task. Below we show detailed results for each task:

Results for the green on yellow on black tower building task, with legend
Results for the yellow on black on blue tower building task
Results for the vegetable preparation task
Results for the put bell pepper in cooler task
Results for the slice and put in cooler task

We observe sizable improvements in plan success and safety.

Average improvement in plan success rates across different LLM planners when using PerceptTwin for feedback.

We also reroduce the scenario from this work, where an attacker jailbreaks an LLM planner into detonating a bomb to harm a human. The assumed-aligned LLM judge that we include in the reconstruction succesfully flags harmful plans as undeployable.

Results for the bomb task

See the video below for a visualization of the judge's ability to stop harmful plans.

We also observe improvements to human interpretability, as detailed in the paper.

Video Presentation

Another Carousel

Grid of all rollouts

Sample videos for each task and model

Videos will start automatically

Poster