00Summary
A full fine-tuned π0.5 reaches 94.2% wrist-only success on LIBERO-Spatial when the third-person camera is physically removed from attention — almost the 96.6% both-camera ceiling — while the identical recipe that instead feeds a black (zeroed) frame collapses to 27.4%.
The headline finding is that the apparent difficulty of wrist-only π0.5 was largely an artifact of the masking mechanism, not a capacity ceiling. Frozen-backbone LoRA collapses to a ~5% behavior-cloning floor regardless of dropout schedule (issue #5); unfreezing the backbone (full-FT) is necessary but not sufficient — how the view is removed matters as much as whether the backbone is trainable. This report covers the five full-FT wrist-only baselines (two evaluated, three in-flight self-distillation ablations under issue #10), how each is trained and wired into the openpi pipeline, and the line of code where every mechanism lives.
01Background
Wrist-only deployment
LIBERO[3] exposes two cameras: a fixed third-person agentview and a wrist-mounted eye_in_hand. Many real deployments lack a reliable exterior view (occlusion, mounting cost, calibration drift), so a policy that runs wrist-only is operationally valuable. The question: can a model pretrained with both cameras be adapted to ignore the exterior view without collapsing?
π0.5, in one paragraph
π0.5[2] (a successor to π0[1]) is a vision-language-action flow-matching model: a PaliGemma-style backbone (gemma_2b) encodes images + language, and a smaller action expert (gemma_300m) denoises an action chunk via conditional flow matching[4]. Each camera is tokenized into image tokens that the transformer attends over jointly with the language and action tokens. The training target is the flow-matching velocity u_t = noise − actions; the loss is mean((v_θ − u_t)²).
The LoRA floor (issue #5) — why this report exists
The first wrist-only batch adapted π0.5 with LoRA[5] (~1.5% trainable params, frozen backbone). Both dropout schedules collapsed to a ~5% floor:
- LoRA, agentview dropped 50% of the time (p=0.5): 4.4% (22/500)
- LoRA, agentview dropped always (p=1.0): 5.0% (25/500, reproduced from a 4.6% r1)
A ~1.5%-trainable adapter on an agentview-pretrained backbone cannot rewire onto a wrist-only distribution. Issue #5's TODO — "re-test the hypothesis out of the floor regime: unfrozen backbone" — is the starting point for the full-FT baselines below. Reference anchors for the wrist-only task:
- Both-camera ceiling: 96.6% — the full-obs π0.5 LoRA (
PI05_LORA_r2), also the distillation teacher. - From-scratch wrist-only (diffusion policy): 50.8% (p=1.0) / 30.0% (p=0.5) — a small from-scratch policy trivially ignores a dead channel.
02Hypothesis
Three falsifiable claims drive the full-FT batch:
- H1 — capacity: the LoRA collapse is a frozen-backbone capacity artifact, not a wrist-only ceiling. Unfreezing the whole backbone (full-FT) should escape the ~5% floor.
- H2 — mechanism: a from-scratch diffusion policy reaches 50.8% but full-FT π0.5 zero-mask only 27.4%. The gap is suspected to come from the zero-mask substitution — a pretrained image tower still encodes a dead black frame and the transformer still attends to it. Physically excluding the agentview tokens from attention should beat the zero-mask.
- H3 — privileged signal: a teacher that sees both cameras can transfer its behavior to a wrist-only student at the flow-matching velocity level, lifting it toward the both-cam ceiling. Self-distillation (and a curriculum that gradually withdraws the exterior view) should beat plain behavior cloning under the same physical-removal condition.
03Approach
Shared recipe
All five baselines inherit the canonical upstream pi05_libero recipe (configs/experiment/pi05/libero_original.yaml): pi05_base loader, gemma_2b backbone + gemma_300m action expert, no LoRA (the freeze filter returns Nothing ⇒ all params trainable), model-side EMA 0.999, action_horizon=10, cosine LR with peak == decay 5e-5 (≈ constant after 10k warmup), 30k steps, seed 42. Upstream batch is 256 (multi-GPU FSDP); single-H200 runs override batch_size=32 (dry-run-confirmed: ~108 GB peak, no OOM). All run via the openpi trainer — the AiroPi 8-GPU FSDP backend has no view-masking support.
System diagram — where each mechanism is injected
flowchart LR DS["LeRobot LIBERO libero_spatial"]:::data DS --> RP["repack_transforms"]:::stage RP --> DTI["data_transforms.inputs after LiberoInputs"]:::stage DTI --> M["pi0.5 model forward : embed_prefix to make_attn_mask"]:::model M --> L["BC flow-matching loss : mean of (v - u_t) squared"]:::loss RP -. "M1 view_dropout zeros, TRAIN-only, _ViewDropoutRepack :219" .-> M DTI -. "M2 view_remove image_mask=False, TRAIN+EVAL, _ViewMaskOutTransform :274" .-> M TC["both-cam teacher self or checkpoint"]:::teacher -. "M3 distill in-loss mask, compute_distill_loss :208" .-> L classDef data fill:#E3DACC,stroke:#B85C3E,color:#141413; classDef stage fill:#F0EEE6,stroke:#87867F,color:#3D3D3A; classDef model fill:#FAF9F5,stroke:#D97757,color:#141413,stroke-width:2px; classDef loss fill:#FAF9F5,stroke:#788C5D,color:#141413,stroke-width:2px; classDef teacher fill:#F0EEE6,stroke:#788C5D,color:#3D3D3A;
Fig. 1 — Three mutually-exclusive masking seams. M1 (view_dropout) is repack-level ⇒ train-only. M2 (view_remove) is in data_transforms.inputs ⇒ applied at train and served eval. M3 (distill) leaves the data pipeline both-cam and masks inside the loss. Exclusivity is enforced loudly in run_openpi_train.py.
Mechanism diagram — how image_mask=False removes a camera
flowchart TD A["image_masks base_0_rgb = False (agentview)"]:::set --> B["embed_prefix pi0.py:106 : repeat mask over agentview tokens, input_mask = False"]:::stage B --> C["make_attn_mask pi0.py:19 : valid = input_mask outer-product"]:::stage C --> D["agentview rows AND cols zeroed in attention"]:::out D --> E["tokens excluded from attention = treated like padding"]:::win Z["view_dropout zeros : image_mask stays True"]:::bad -. "pixels=0 but tokens still VALID, model still attends" .-> Y["wasted capacity, OOD frame = 27.4%"]:::bad classDef set fill:#E3DACC,stroke:#B85C3E,color:#141413; classDef stage fill:#F0EEE6,stroke:#87867F,color:#3D3D3A; classDef out fill:#FAF9F5,stroke:#D97757,color:#141413; classDef win fill:#FAF9F5,stroke:#788C5D,color:#141413,stroke-width:2px; classDef bad fill:#FAF9F5,stroke:#B85C3E,color:#141413,stroke-dasharray:4 3;
Fig. 2 — Physical removal vs zero-mask. The outer product in make_attn_mask means any token with input_mask=False has its entire row and column zeroed: it can neither attend nor be attended to. This is openpi's own padding-image convention (the unused right_wrist_0_rgb slot), so num_images and the pi05_base weight contract stay intact.
make_attn_mask — the two lines that matter
def make_attn_mask(input_mask, mask_ar): # third_party/openpi/src/openpi/models/pi0.py:19
mask_ar = jnp.broadcast_to(mask_ar, input_mask.shape)
cumsum = jnp.cumsum(mask_ar, axis=1)
attn_mask = cumsum[:, None, :] <= cumsum[:, :, None]
valid_mask = input_mask[:, None, :] * input_mask[:, :, None] # ← False row/col => dropped
return jnp.logical_and(attn_mask, valid_mask)
The four masking mechanisms, contrasted
| Mechanism | Pixels | image_mask | Attention | Injection seam | Phase |
|---|---|---|---|---|---|
| view_dropout (zeros) | zeroed | True | tokens still attended (black frame) | repack_transforms | train-only |
| view_remove (physical) | zeroed (cosmetic) | False | tokens excluded entirely | data_transforms.inputs | train + eval |
| distill in-loss mask | both-cam batch | False on student forward | student excludes agentview; teacher keeps it | inside compute_distill_loss | train (eval via view_remove) |
| curriculum | both-cam batch | False with prob p(step) | ramps 0→1 over first 75% then holds | curriculum_p × in-loss mask | train |
Privileged distillation loss
The self-distillation variants add a velocity-matching term against a stop-gradient teacher that sees both cameras, plus a both-cam BC anchor that keeps the teacher pathway alive:
loss = BC(v_student_wrist, u_t) # always coefficient 1.0
+ lambda * || v_student_wrist - sg(v_teacher_bothcam) ||^2 # velocity distillation
+ anchor_frac * BC(v_teacher_bothcam, u_t) # both-cam anchor
curriculum: p(step) = 0.5 * (1 - cos(pi * progress)), progress = clip(step / (0.75*N), 0, 1)
The teacher shares the student's graphdef + params (mode=self, live, ema_decay=null); the only difference between the two forwards is the input observation. This adds activations but no extra parameter memory.
04The five full-FT baselines
Each card: how it is trained (config, knobs, slurm) and how it is implemented (the line of code that does the masking).
B1 · view_dropout (zeros), p=1.0 27.4% · 137/500
Full-FT control. agentview RGB is replaced by a zeros tensor while its image_mask stays True, so the image tower encodes a black frame and the transformer still attends to it. The injection is repack-level (dataset-only ⇒ train-only); standard eval is already effectively wrist-only because the trained model ignores the dead view.
- config
configs/experiment/pi05/libero_wristonly_full.yaml(view_dropout.train.agentview=1.0)- train
- slurm 3067991 ·
outputs/20260521_165625_PI05_FULL_WRISTONLY_FULL_r1_j3067991 - eval
- n=500 ·
outputs/eval_logs/PI05_FULL_WRISTONLY_FULL_EVAL_FULL - impl
_ViewDropoutRepack(run_openpi_train.py:219) →apply_view_dropout(view_mask.py:209); block read atrun_openpi_train.py:837
B2 · view_remove (physical), p=1.0 94.2% · 471/500
Matched control for B1 — byte-identical recipe, the only changed variable is the drop mechanism. _ViewMaskOutTransform flips image_mask['base_0_rgb']=False so the agentview tokens are excluded from make_attn_mask entirely (Fig. 2). Because it lives in data_transforms.inputs (run at both the data loader and the served Policy), the same physical removal applies at eval with no client-side hook.
- config
configs/experiment/pi05/libero_wristonly_physdrop_full.yaml(view_remove.{train,eval}.agentview=1.0)- train
- slurm 3071544 ·
outputs/20260523_024012_PI05_FULL_WRISTONLY_PHYSDROP_r1_j3071544 - eval
- n=500 ·
outputs/eval_logs/PI05_FULL_WRISTONLY_PHYSDROP_EVAL_FULL— removal verified inserver.log:eval physical view-removal {agentview=1} - impl
_ViewMaskOutTransform(run_openpi_train.py:274) +LeRobotLiberoViewRemoveDataConfig(:328) →apply_view_image_mask(view_mask.py:257); block read at:874
server.log "eval physical view-removal" line — a suspiciously high SR usually means the view was not actually removed. B2's 94.2% passed both checks.
B3 · curriculum-only eval pending
Issue-#10 ablation 1/3. The student's agentview mask probability ramps cosine 0→1 over the first 75% of steps then holds at 1.0; λ=0 (no distillation), anchor_frac=0. Isolates whether gradually withdrawing the privileged view (vs withdrawing it from step 0, as B2 does) helps the full-FT model land a better wrist-only policy. No data-level transform — the student masks agentview in-loss; the teacher forward runs but contributes nothing.
- config
configs/experiment/pi05/distill/curriculum_only.yaml- train
- slurm 3071609 (RUNNING) → eval afterok
3071617(n=50) - impl
curriculum_p(distill.py:165) ×_mask_view_per_sampleinsidecompute_distill_loss(distill.py:208); built by_build_distill(config_factory.py:190)
B4 · self-distill only eval pending
Issue-#10 ablation 2/3. Student is always wrist-only (curriculum off, p=1.0 constant); a live self-teacher (the same model M, both cameras, stop-gradient) supplies the velocity-matching signal at λ=1.0 with a anchor_frac=0.15 both-cam anchor. Isolates the distillation term without the curriculum: does matching M's both-cam velocities lift the wrist-only student above the plain physical-removal floor (B2)?
- config
configs/experiment/pi05/distill/selfdistill_only.yaml- train
- slurm 3071612 (RUNNING) → eval afterok
3071618(n=50) - impl
- live teacher:
mode=self, ema_decay=nullincompute_distill_loss(distill.py:208); checkpoint path usesload_teacher_state(distill.py:329)
B5 · curriculum + self-distill eval pending · headline run
Issue-#10 ablation 3/3 — the full method. Curriculum-masked student (ramp 0→1@75%) + live both-cam self-teacher (λ=1.0, anchor_frac=0.15). Early in training (small p) the student mostly matches a both-cam M; as p→1 it specialises wrist-only while still matching the privileged teacher's velocities. Tests whether the two mechanisms compound beyond each alone (B3, B4) and beyond the physical-removal floor (B2).
- config
configs/experiment/pi05/distill/curric_selfdistill.yaml- train
- slurm 3071611 (RUNNING) → eval afterok
3071619(n=50) - impl
- curriculum + distill both active in
compute_distill_loss(distill.py:208); eval served wrist-only viaVLA_ZOO_VIEW_REMOVE_EVAL=agentview(serve_pi0_libero.py:132/167)
05Results
LIBERO-Spatial, agentview removed at eval, full 500-episode suite unless noted. Bars are scaled to the 96.6% both-cam ceiling. Reference rows are not full-FT baselines — they frame the result.
| Variant | Backbone | Mechanism | SR | Eval evidence | |
|---|---|---|---|---|---|
| B2 physdrop p=1.0 | full-FT | view_remove (physical) | 94.2% | PI05_FULL_WRISTONLY_PHYSDROP_EVAL_FULL | |
| B1 zero-mask p=1.0 | full-FT | view_dropout (zeros) | 27.4% | PI05_FULL_WRISTONLY_FULL_EVAL_FULL | |
| B3 curriculum-only | full-FT | curriculum (in-loss) | pending | slurm 3071609 → 3071617 | |
| B4 self-distill only | full-FT | distill λ=1.0 | pending | slurm 3071612 → 3071618 | |
| B5 curric + self-distill | full-FT | distill + curriculum | pending | slurm 3071611 → 3071619 | |
| both-cam ceiling (ref) | LoRA, full-obs | none | 96.6% | PI05_LORA_r2 (teacher) | |
| LoRA distill (ref) | LoRA | privileged distill (2-stage) | 91.0% | PI05_LORA_WRISTONLY_DISTILL_EVAL_FULL | |
| dp wrist-only (ref) | from-scratch | view_dropout p=1.0 | 50.8% | DP_WRISTONLY_FULL | |
| LoRA floor (ref) | LoRA (frozen) | view_dropout p=1.0 | 5.0% | PI05_LORA_WRISTONLY_FULL_R2_EVAL_FULL |
06Discussion
The dominant variable in wrist-only π0.5 is not the backbone's trainability but how the exterior view is removed. Zero-masking keeps the agentview tokens "valid", so a pretrained image tower spends capacity encoding a black frame and the transformer is obligated to attend to an out-of-distribution dead view — dragging full-FT to 27.4%, below even a from-scratch diffusion policy (50.8%). Physically excluding the tokens from attention (B2) restores near-ceiling performance (94.2%), and a frozen-LoRA adapter (5%) simply lacks the capacity to compensate for either deficiency.
image_mask=False for any "ablate a modality" experiment on π0/π0.5.
image_mask=False, full-FT." Conversely, if B3 (curriculum-only, no teacher) already matches B5, the privileged signal is redundant and only the curriculum schedule matters.
3071617/18/19, n=50) to populate the B3–B5 rows; promote the winning configuration to a full n=500 eval. Owner: tohkawa25. The ablation cleanly disentangles curriculum (B3) vs distillation (B4) vs both (B5), against the B2 physical-removal floor. Then re-run B2's recipe at view_remove p=0.5 to test the FULL-vs-P05 direction at full-FT capacity (still untested on π0.5).
07Implementation details — code anchors
Every masking mechanism in this report, and the line of code where it lives.
| Module | file:line | Role |
|---|---|---|
OPENPI_IMAGE_KEY_MAP | src/vla_zoo/data/view_mask.py:88 | canonical agentview → openpi base_0_rgb, eye_in_hand → left_wrist_0_rgb |
apply_view_dropout | src/vla_zoo/data/view_mask.py:209 | zeros the RGB, mask stays True (blake2b-reproducible per-sample draw) |
apply_view_image_mask | src/vla_zoo/data/view_mask.py:257 | sets image_masks[key]=np.False_ — physical removal |
_ViewDropoutRepack | scripts/run_openpi_train.py:219 | repack-level transform (train-only) for view_dropout |
_ViewMaskOutTransform | scripts/run_openpi_train.py:274 | data_transforms.inputs transform (train+eval) for view_remove |
LeRobotLiberoViewRemoveDataConfig | scripts/run_openpi_train.py:328 | data-config subclass that injects M2 after LiberoInputs |
| block readers (vd / vr / distill) | run_openpi_train.py:837 / 874 / 918 | read root cfg blocks; enforce mutual exclusivity (loud) |
embed_prefix | third_party/openpi/.../models/pi0.py:106 | repeats image_masks[name] over image tokens → input_mask |
make_attn_mask | third_party/openpi/.../models/pi0.py:19 | outer product drops False rows/cols from attention |
compute_distill_loss | third_party/openpi/.../training/distill.py:208 | BC + λ·velocity-match + anchor; per-sample student mask |
curriculum_p | third_party/openpi/.../training/distill.py:165 | cosine ramp 0→1 over first curriculum_frac of steps |
load_teacher_state | third_party/openpi/.../training/distill.py:329 | loads + freezes a checkpoint teacher (mode=checkpoint) |
_build_distill | src/vla_zoo/openpi/config_factory.py:190 | builds DistillConfig, fails loud on missing keys (no silent defaults) |
| eval view-removal hook | scripts/serve_pi0_libero.py:132 / 167 | VLA_ZOO_VIEW_REMOVE_EVAL env → server-side physical removal + log line |
Example: physical-removal config (B2)
_base: libero_original.yaml
wandb_run_name: pi05_wristonly_physdrop_full
# PHYSICAL camera removal (image_mask=False) - distinct from view_dropout zeros.
view_remove:
enabled: true
seed: 42
train: { agentview: 1.0 } # exclude third-person view from attention in training
eval: { agentview: 1.0 } # same physical removal at served eval
Reproduce
# Train (single H200, openpi trainer)
uv run python scripts/train.py \
--config configs/experiment/pi05/libero_wristonly_physdrop_full.yaml \
batch_size=32
# Eval n=500, server-side physical removal
PI0_CKPT_DIR=<orbax .../29999> OPENPI_CONFIG=pi05_libero NUM_TRIALS=50 \
VLA_ZOO_VIEW_REMOVE_EVAL=agentview \
LOG_DIR=outputs/eval_logs/PI05_FULL_WRISTONLY_PHYSDROP_EVAL_FULL \
bash scripts_sh/eval_pi0_trained.sh libero_spatial
08References
Papers
- [1] Black et al. π0: A Vision-Language-Action Flow Model for General Robot Control. Physical Intelligence, 2024. arXiv:2410.24164
- [2] Physical Intelligence. π0.5: a VLA with Open-World Generalization. 2025. arXiv:2504.16054
- [3] Liu et al. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. NeurIPS 2023. arXiv:2306.03310
- [4] Lipman et al. Flow Matching for Generative Modeling. ICLR 2023. arXiv:2210.02747
- [5] Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685
- [6] Chen et al. Learning by Cheating (privileged teacher → sensorimotor student). CoRL 2020. arXiv:1912.12294
- [7] Rusu et al. Policy Distillation. ICLR 2016. arXiv:1511.06295
Internal
- GitHub issues
#5(wrist-only floor re-test) ·#10(curriculum + self-distillation ablation) - Configs:
configs/experiment/pi05/libero_wristonly_full.yaml,…/libero_wristonly_physdrop_full.yaml,…/distill/{curriculum_only,selfdistill_only,curric_selfdistill}.yaml - Job ledger:
configs/jobs.yaml,configs/jobs_distill.yaml - Eval logs:
outputs/eval_logs/PI05_FULL_WRISTONLY_*_EVAL_FULL/