MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation

The University of Texas at Austin
Under Review, 2025

*Equal Contribution

Abstract

Vision-Language-Action (VLA) models trained on large robot datasets promise general-purpose, robust control across diverse domains and embodiments. However, existing approaches often fail out-of-the-box when deployed in novel environments, embodiments, or tasks. We introduce Mixture of Skills VLA (MoS-VLA), a framework that represents robot manipulation policies as linear combinations of a finite set of learned basis functions. During pretraining, MoS-VLA jointly learns these basis functions across datasets from the Open X-Embodiment project, producing a structured skill space. At test time, adapting to a new task requires only a single expert demonstration. The corresponding skill representation is then inferred via a lightweight convex optimization problem that minimizes the L1 action error, without requiring gradient updates. This gradient-free adaptation incurs minimal overhead while enabling rapid instantiation of new skills. Empirically, MoS-VLA achieves lower action-prediction error on five out of five unseen datasets and succeeds in both simulation and real-robot tasks where a pretrained VLA model fails outright.

The Selling Point

Our method uses one expert demonstration to calibrate the VLA model to a new setting. This process takes only a few seconds.

Our model adapts to new lab settings using only one expert demonstration. To do so, we solve a simple least absolute error optimization problem and project the expert demonstration onto a set of learned basis functions. This process takes only a few seconds even on a RTX 3090, and does not require any gradient updates. Then, we execute the adapted policy in the new environment. This process yields a model capable of deploying in new environments with minimal overhead.

The Architecture

Our architecture simply adds k-output heads to any VLA model, where k is the number of basis functions. The adapted policy is a linear combination of these basis functions.

We adapt OpenVLA by introducing k basis action heads, each acting as a basis function conditioned on the task description and image observation. The expert policy for each context (domain) is then represented as a linear combination of these basis functions, where the coefficients are calibrated from a single expert demonstration. This calibration step involves solving a simple linear program corresponding to least absolute error minimization. Crucially, this process does not require any gradient updates, making it extremely fast.

The Results

Our architecture simply adds k-output heads to any VLA model, where k is the number of basis functions. The adapted policy is a linear combination of these basis functions.

By calibrating our model on one expert trajectory, we greatly outperform the baseline OpenVLA model. On 18/27 in-distribution datasets and on all 5 out-of-distribution datasets, we achieve lower mean absolute error. This performance is not just limited to prediction accuracy. In both simulated and on-robot experiments, we find that our model achieves a 70% to 100% success rate while the baseline OpenVLA model fails outright.

In simulated and on-robot experiments, our model achieves 70-100% accuracy while the base OpenVLA model fails outright.
We use five experiments: Simulated block lifting, simulated door opening, on-robot goal reaching, on-robot block lifting, and on-robot pen insertion.

BibTeX

@article{Mos-VLA2025,
        title={MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation},
        author={Ruihan Zhao and Tyler Ingebrand and Sandeep Chinchali and Ufuk Topcu},
        journal={Under Review},
        year={2025}
}