Abstract
Vision-Language-Action (VLA) models trained on large robot datasets promise general-purpose, robust control across diverse domains and embodiments. However, existing approaches often fail out-of-the-box when deployed in novel environments, embodiments, or tasks. We introduce Mixture of Skills VLA (MoS-VLA), a framework that represents robot manipulation policies as linear combinations of a finite set of learned basis functions. During pretraining, MoS-VLA jointly learns these basis functions across datasets from the Open X-Embodiment project, producing a structured skill space. At test time, adapting to a new task requires only a single expert demonstration. The corresponding skill representation is then inferred via a lightweight convex optimization problem that minimizes the L1 action error, without requiring gradient updates. This gradient-free adaptation incurs minimal overhead while enabling rapid instantiation of new skills. Empirically, MoS-VLA achieves lower action-prediction error on five out of five unseen datasets and succeeds in both simulation and real-robot tasks where a pretrained VLA model fails outright.
The Selling Point

Our model adapts to new lab settings using only one expert demonstration. To do so, we solve a simple least absolute error optimization problem and project the expert demonstration onto a set of learned basis functions. This process takes only a few seconds even on a RTX 3090, and does not require any gradient updates. Then, we execute the adapted policy in the new environment. This process yields a model capable of deploying in new environments with minimal overhead.
The Architecture

We adapt OpenVLA by introducing k basis action heads, each acting as a basis function conditioned on the task description and image observation. The expert policy for each context (domain) is then represented as a linear combination of these basis functions, where the coefficients are calibrated from a single expert demonstration. This calibration step involves solving a simple linear program corresponding to least absolute error minimization. Crucially, this process does not require any gradient updates, making it extremely fast.
The Results

By calibrating our model on one expert trajectory, we greatly outperform the baseline OpenVLA model. On 18/27 in-distribution datasets and on all 5 out-of-distribution datasets, we achieve lower mean absolute error. This performance is not just limited to prediction accuracy. In both simulated and on-robot experiments, we find that our model achieves a 70% to 100% success rate while the baseline OpenVLA model fails outright.


BibTeX
@article{Mos-VLA2025,
title={MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation},
author={Ruihan Zhao and Tyler Ingebrand and Sandeep Chinchali and Ufuk Topcu},
journal={Under Review},
year={2025}
}