Learning to Fold: prizewinning solution at LeHome Challenge 2026 (1st place online, 2nd offline)
Abstract
A vision-language-action policy improved with reinforcement learning uses shared network predictions for success estimation and advantage calculation in bimanual garment folding, employing established RL techniques with novel optimization and deployment strategies.
I describe my solution to the LeHome Challenge 2026, an ICRA 2026 competition on bimanual garment folding. The system placed 1st of 62 teams in the online (simulation) round and 2nd in the real-world final. It improves a vision-language-action (VLA) policy with a reinforcement-learning loop. The policy is its own value function: the same network that predicts actions also predicts success, progress, and a few task-relevant future quantities, and those predictions drive advantage estimation, live failure detection, and candidate selection. The work mostly recombines existing RL ideas with engineering and optimization contributions that can be used together as one recipe or individually: AWR + RECAP combined for flow-matching VLA; an asynchronous distributed training / rollout pipeline through HuggingFace Hub; inference-time hyperparameters optimization via Thompson sampling; a sim-to-real recipe with camera-alignment tooling, heavy augmentation and DAgger-like HIL data collection.
Community
Earlier this month I placed ๐ฅ 2nd in the LeHome Challenge at ICRA 2026 โ and ๐ฅ 1st of 62 teams in the simulation round before that. This paper explains my solution.
The task: teach a cheap two-armed robot to fold different garments โ long tops, short tops, long pants, and shorts โ both in simulation and on a real robot. The robot only sees three cameras and never gets told which garment it's folding, so it has to figure that out on its own.
Here's the short version of how it works ๐
๐ง The policy is its own value function. From the same forward pass that picks the next action chunk, cheap heads predict success probability, task completion %, garment type, and future keypoint distances + a Q-residual. Those become the advantage signal for RL - no separate critic.
๐ A fully asynchronous RL loop coordinated only through the HF Hub: 1 trainer (H200) ships a fresh checkpoint ~every 40 min while N rollout workers (and a human doing teleop / DAgger corrections) collect data in parallel. Nobody waits, it uses the off-policy nature of the loop to the fullest.
๐ Binary success is too sparse, so I densify it by combining multiple approaches simultaneously โ from objective keypoint distances, the success-probability value baseline, completion %, and relative success rate of different garments.
๐๏ธ The RL combines AWR (sample good actions more often) + RECAP (feed advantage as a conditioning input, then ask for good actions only at inference, with CFG). I also tune the inference knobs โ execution length, playback speed, inpainting overlap, CFG scale, best-of-N โ with a per-parameter Thompson-sampling bandit folded into rollout collection.
๐ Round 1: 1st of 62, 79.6% success (+6.1 over 2nd, top score on 3 of 4 garments). Round 2: 2nd place, with only ~1 week and no access to the eval robot โ so the pipeline was sim โ my robot โ their robot, leaning on heavy augmentation to make the policy more robust.
๐ก Biggest win left on the table: I ran full RL in sim and BC + human corrections on real separately. They're very complementary โ combining them should push success much higher.
Get this paper in your agent:
hf papers read 2606.27163 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
IliaLarchenko/lehome_real
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper