Learning Action-Conditional and Object-Centric Gaussian Splatting World Models for Rigid Objects

Jens U. Kreber, Lukas Mack, Joerg Stueckler
Intelligent Perception in Technical Systems Group
University of Augsburg, Germany

Abstract

World models enable intelligent agents to predict the consequences of their actions on the environment. In this paper, we propose Multi Rigid Object Gaussian World Model (MRO-GWM), a novel model that learns action-conditional dynamics of rigid objects in 3D. By representing the scene by object-centric Gaussians, we can represent arbitrary object shapes and multi-object scenes. We develop a novel spatio-temporal transformer architecture that predicts future rigid body motion from a history of object Gaussians and future actions. Objects are represented by their Gaussians in a canonical frame, which allows for describing object motion as rigid body transformation. Our model is trained on reconstructions from multiple viewpoints, which requires the model to handle partial observations of objects due to occlusions. We analyze prediction performance of our approach on synthetic datasets composed of typical household objects with multi-object dynamics and interactions by a robot end effector. We also evaluate our model in model-predictive control for non-prehensile manipulation in simulation.

Method Overview

Method overview. Left: per-object anchors are obtained by object-centric Gaussian splatting. Right: Our proposed spatio-temporal transformer

Left: The proposed scene representation: By using object-centric Gaussian splatting, per-object anchors or Gaussians are obtained. These are rigidly transformed to encode the history of scene states given object and end-effector poses.
Right: The proposed spatio-temporal transformer: Spatial grid pooling and spatial attention blocks are combined with temporal attention between points and a newly proposed spatio-temporal attention layer. Given a history and future end-effector poses, future object poses are predicted. \(\hat{\tau}^\text{obj}_i\) indicates predicted object poses and \(\tau^\text{obj}_\emptyset\) a fixed dummy pose.

Predictions

Predicted sequences over horizon 2.4s (right image part) and ground truth (left image part) replayed by rigid transformations of the splatted Gaussian representation. Examples are selected according to their combined prediction error rank from the top 10% quantile of pose changes.

Smallest error rank

Median error rank

Second smallest error rank

Worst error rank

Planning

Episodes from planning with our model in a model-predictive control setting. We visualize successful episodes for two tasks on scenes with 5 objects with the largest initial objective value.

Task 1 "push object to position": The center of the screwdriver is successfully aligned with the target position indicated by the green sphere.

Task 2 "clear middle": All object centers are successfully pushed out of the red region.

BibTeX

@article{MRO-GWM,
  title={Learning Action-Conditional and Object-Centric Gaussian Splatting World Models for Rigid Objects},
  author={Jens U. Kreber and Lukas Mack and Joerg Stueckler},
  journal={arXiv preprint arXiv:2606.01950},
  year={2026},
  url={https://arxiv.org/abs/2606.01950}
}