MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation

Zhao, Ruihan; Ingebrand, Tyler

MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation

Ruihan Zhao^*, Tyler Ingebrand^*, Sandeep Chinchali, Ufuk Topcu

The University of Texas at Austin
Under Review, 2025
^*Equal Contribution

Abstract

Vision-Language-Action (VLA) models trained on large robot datasets promise general-purpose, robust control across diverse domains and embodiments. However, existing approaches often fail out-of-the-box when deployed in novel environments, embodiments, or tasks. We introduce Mixture of Skills VLA (MoS-VLA), a framework that represents robot manipulation policies as linear combinations of a finite set of learned basis functions. During pretraining, MoS-VLA jointly learns these basis functions across datasets from the Open X-Embodiment project, producing a structured skill space. At test time, adapting to a new task requires only a single expert demonstration. The corresponding skill representation is then inferred via a lightweight convex optimization problem that minimizes the L1 action error, without requiring gradient updates. This gradient-free adaptation incurs minimal overhead while enabling rapid instantiation of new skills. Empirically, MoS-VLA achieves lower action-prediction error on five out of five unseen datasets and succeeds in both simulation and real-robot tasks where a pretrained VLA model fails outright.

The Selling Point

Our method uses one expert demonstration to calibrate the VLA model to a new setting. This process takes only a few seconds.

Our model adapts to new lab settings using only one expert demonstration. To do so, we solve a simple least absolute error optimization problem and project the expert demonstration onto a set of learned basis functions. This process takes only a few seconds even on a RTX 3090, and does not require any gradient updates. Then, we execute the adapted policy in the new environment. This process yields a model capable of deploying in new environments with minimal overhead.

The Architecture

Our architecture simply adds k-output heads to any VLA model, where k is the number of basis functions. The adapted policy is a linear combination of these basis functions.

We adapt OpenVLA by introducing k basis action heads, each acting as a basis function conditioned on the task description and image observation. The expert policy for each context (domain) is then represented as a linear combination of these basis functions, where the coefficients are calibrated from a single expert demonstration. This calibration step involves solving a simple linear program corresponding to least absolute error minimization. Crucially, this process does not require any gradient updates, making it extremely fast.

The Results

By calibrating our model on one expert trajectory, we greatly outperform the baseline OpenVLA model. On 18/27 in-distribution datasets and on all 5 out-of-distribution datasets, we achieve lower mean absolute error. This performance is not just limited to prediction accuracy. In both simulated and on-robot experiments, we find that our model achieves a 70% to 100% success rate while the baseline OpenVLA model fails outright.

In simulated and on-robot experiments, our model achieves 70-100% accuracy while the base OpenVLA model fails outright.

We use five experiments: Simulated block lifting, simulated door opening, on-robot goal reaching, on-robot block lifting, and on-robot pen insertion.

BibTeX

@article{Mos-VLA2025,
        title={MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation},
        author={Ruihan Zhao and Tyler Ingebrand and Sandeep Chinchali and Ufuk Topcu},
        journal={Under Review},
        year={2025}
}

More Works from Our Lab

Function Encoders: A Principled Approach to Transfer Learning in Hilbert Spaces

LaNE: Accelerating Visual Sparse-Reward Learning with Latent Nearest-Demonstration-Guided Explorations​

MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation

Abstract

The Selling Point

The Architecture

The Results

BibTeX

LaNE: Accelerating Visual Sparse-Reward Learning with Latent Nearest-Demonstration-Guided Explorations