Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization
Vision-Language-Action (VLA) models have shown remarkable achievements, driven by the rich implicit knowledge of their vision-language components. However, achieving generalist robotic agents demands precise grounding into physical interactions, especially in contact-rich scenarios where fine-grained force control is essential. This paper introduces Tactile-VLA, a novel framework that deeply fuses vision, language, action, and tactile sensing. This framework incorporates a hybrid position-force controller to translate the model's intentions into precise physical actions and a reasoning module that allows the robot to adapt its strategy based on tactile feedback. A key finding is that the VLM's prior knowledge already contains semantic understanding of physical interaction; by connecting it to the robot's tactile sensors with only a few demonstrations, we can activate this prior knowledge to achieve zero-shot generalization in contact-rich tasks.
Figure 1: Tactile-VLA architecture. Vision, language, tactile, and proprioceptive inputs are encoded and fused via a pre-trained VLM. The tactile-aware action expert generates target position and force for hybrid position-force control.
Tactile-VLA unlocks physical knowledge in Vision-Language-Action models, translating abstract understanding into precise force control. Our token-level fusion approach integrates multimodal information, allowing all modalities to cross-attend freely for enhanced physical interaction.
Figure 2: Data collection setup. We augmented the UMI gripper with dual high-resolution tactile sensors and GoPro camera for capturing synchronized multimodal demonstrations.
Figure 3: Key capabilities of Tactile-VLA. (a) Tactile-aware instruction following: generalizes force-related language to new tasks (b) Utilizing tactile-relevant common sense: applies appropriate grasps for different objects (c) Adaptive tactile-involved reasoning: recovers from failures through reasoning.
Task Success Rate Evaluation: We evaluate our model on USB/Charger insertion and extraction tasks, comparing against π₀-base and π₀-fast baselines. Our model achieves significantly higher success rates across both tasks. Most notably, Tactile-VLA achieves 90% success on the charger task compared to 40% and 25% for baselines, demonstrating the critical importance of tactile feedback.
Model | USB (%) | Charger (%) |
---|---|---|
π₀-base | 5 | 40 |
π₀-fast | 0 | 25 |
Tactile-VLA | 35 | 90 |
Force Control Generalization: We test the model's ability to interpret force-related language commands. The model learns from USB task with 'softly' (0.51N) and 'hard' (2.57N) commands, then generalizes to unseen charger task and novel force words. Remarkably, our model shows clear force differentiation (4.68N vs 9.13N) in zero-shot scenarios, while baselines show no correlation between language and applied force.
Model | 'softly' | 'hard' | 'gently' | 'firmly' | Zero-shot 'softly' | Zero-shot 'hard' |
---|---|---|---|---|---|---|
π₀-base | 2.41 | 2.68 | 2.35 | 2.72 | 6.61 | 5.69 |
π₀-fast | 2.61 | 2.33 | 2.79 | 2.45 | 7.37 | 6.42 |
Tactile-VLA | 0.51 | 2.57 | 0.75 | 1.98 | 4.68 | 9.13 |
Chain-of-Thought Reasoning: We demonstrate that Tactile-VLA-CoT can reason about tactile feedback to recover from failures. When wiping fails due to insufficient force, the model analyzes the situation and autonomously increases force for successful completion.
Reasoning Performance Evaluation: We test the model's ability to adapt from whiteboard (in-domain) to blackboard (out-of-domain) wiping tasks. While baselines completely fail on the novel blackboard task (0% success), Tactile-VLA-CoT achieves 80% success through reasoning-based adaptation, matching its in-domain performance.
Model | In Domain (%) | Out of Domain (%) |
---|---|---|
π₀-base | 40 | 0 |
π₀-fast | 45 | 0 |
Tactile-VLA | 80 | 15 |
Tactile-VLA-CoT | 75 | 80 |
If you find our work useful, please consider citing:
@article{Huang2025TactileVLA,
author = {Jialei Huang and Shuo Wang and Fanqi Lin and Yihang Hu and Chuan Wen and Yang Gao},
title = {TACTILE-VLA: UNLOCKING VISION-LANGUAGE-ACTION MODEL'S PHYSICAL KNOWLEDGE FOR TACTILE GENERALIZATION},
journal = {arXiv preprint arXiv:2507.09160},
year = {2025}
}