π0.5 wrist-only — full-FT baselines

00Summary

A full fine-tuned π0.5 reaches 94.2% wrist-only success on LIBERO-Spatial when the third-person camera is physically removed from attention — almost the 96.6% both-camera ceiling — while the identical recipe that instead feeds a black (zeroed) frame collapses to 27.4%.

The headline finding is that the apparent difficulty of wrist-only π0.5 was largely an artifact of the masking mechanism, not a capacity ceiling. Frozen-backbone LoRA collapses to a ~5% behavior-cloning floor regardless of dropout schedule (issue #5); unfreezing the backbone (full-FT) is necessary but not sufficient — how the view is removed matters as much as whether the backbone is trainable. This report covers the five full-FT wrist-only baselines (two evaluated, three in-flight self-distillation ablations under issue #10), how each is trained and wired into the openpi pipeline, and the line of code where every mechanism lives.

94.2%

full-FT, physical removal (471/500)

27.4%

full-FT, zero-mask (137/500)

96.6%

both-cam ceiling (ref)

5.0%

frozen LoRA floor (ref)

01Background

Wrist-only deployment

LIBERO^[3] exposes two cameras: a fixed third-person agentview and a wrist-mounted eye_in_hand. Many real deployments lack a reliable exterior view (occlusion, mounting cost, calibration drift), so a policy that runs wrist-only is operationally valuable. The question: can a model pretrained with both cameras be adapted to ignore the exterior view without collapsing?

π0.5, in one paragraph

π0.5^[2] (a successor to π0^[1]) is a vision-language-action flow-matching model: a PaliGemma-style backbone (gemma_2b) encodes images + language, and a smaller action expert (gemma_300m) denoises an action chunk via conditional flow matching^[4]. Each camera is tokenized into image tokens that the transformer attends over jointly with the language and action tokens. The training target is the flow-matching velocity u_t = noise − actions; the loss is mean((v_θ − u_t)²).

The LoRA floor (issue #5) — why this report exists

The first wrist-only batch adapted π0.5 with LoRA^[5] (~1.5% trainable params, frozen backbone). Both dropout schedules collapsed to a ~5% floor:

LoRA, agentview dropped 50% of the time (p=0.5): 4.4% (22/500)
LoRA, agentview dropped always (p=1.0): 5.0% (25/500, reproduced from a 4.6% r1)

A ~1.5%-trainable adapter on an agentview-pretrained backbone cannot rewire onto a wrist-only distribution. Issue #5's TODO — "re-test the hypothesis out of the floor regime: unfrozen backbone" — is the starting point for the full-FT baselines below. Reference anchors for the wrist-only task:

Both-camera ceiling: 96.6% — the full-obs π0.5 LoRA (PI05_LORA_r2), also the distillation teacher.
From-scratch wrist-only (diffusion policy): 50.8% (p=1.0) / 30.0% (p=0.5) — a small from-scratch policy trivially ignores a dead channel.

02Hypothesis

Three falsifiable claims drive the full-FT batch:

H1 — capacity: the LoRA collapse is a frozen-backbone capacity artifact, not a wrist-only ceiling. Unfreezing the whole backbone (full-FT) should escape the ~5% floor.
H2 — mechanism: a from-scratch diffusion policy reaches 50.8% but full-FT π0.5 zero-mask only 27.4%. The gap is suspected to come from the zero-mask substitution — a pretrained image tower still encodes a dead black frame and the transformer still attends to it. Physically excluding the agentview tokens from attention should beat the zero-mask.
H3 — privileged signal: a teacher that sees both cameras can transfer its behavior to a wrist-only student at the flow-matching velocity level, lifting it toward the both-cam ceiling. Self-distillation (and a curriculum that gradually withdraws the exterior view) should beat plain behavior cloning under the same physical-removal condition.

Insight (confirmed) H1 and H2 both hold. Full-FT zero-mask = 27.4% (escapes the 5% LoRA floor → H1 ✓); full-FT physical removal = 94.2% (beats zero-mask by +66.8pp and from-scratch dp by +43.4pp → H2 ✓). The wrist-only "ceiling" was never a ceiling — it was a dead black frame the model was forced to attend to. H3 is the open question the issue-#10 ablation tests.

03Approach

Shared recipe

All five baselines inherit the canonical upstream pi05_libero recipe (configs/experiment/pi05/libero_original.yaml): pi05_base loader, gemma_2b backbone + gemma_300m action expert, no LoRA (the freeze filter returns Nothing ⇒ all params trainable), model-side EMA 0.999, action_horizon=10, cosine LR with peak == decay 5e-5 (≈ constant after 10k warmup), 30k steps, seed 42. Upstream batch is 256 (multi-GPU FSDP); single-H200 runs override batch_size=32 (dry-run-confirmed: ~108 GB peak, no OOM). All run via the openpi trainer — the AiroPi 8-GPU FSDP backend has no view-masking support.

System diagram — where each mechanism is injected

flowchart LR
  DS["LeRobot LIBERO libero_spatial"]:::data
  DS --> RP["repack_transforms"]:::stage
  RP --> DTI["data_transforms.inputs after LiberoInputs"]:::stage
  DTI --> M["pi0.5 model forward : embed_prefix to make_attn_mask"]:::model
  M --> L["BC flow-matching loss : mean of (v - u_t) squared"]:::loss

  RP -. "M1 view_dropout zeros, TRAIN-only, _ViewDropoutRepack :219" .-> M
  DTI -. "M2 view_remove image_mask=False, TRAIN+EVAL, _ViewMaskOutTransform :274" .-> M
  TC["both-cam teacher self or checkpoint"]:::teacher -. "M3 distill in-loss mask, compute_distill_loss :208" .-> L

  classDef data fill:#E3DACC,stroke:#B85C3E,color:#141413;
  classDef stage fill:#F0EEE6,stroke:#87867F,color:#3D3D3A;
  classDef model fill:#FAF9F5,stroke:#D97757,color:#141413,stroke-width:2px;
  classDef loss fill:#FAF9F5,stroke:#788C5D,color:#141413,stroke-width:2px;
  classDef teacher fill:#F0EEE6,stroke:#788C5D,color:#3D3D3A;

Fig. 1 — Three mutually-exclusive masking seams. M1 (view_dropout) is repack-level ⇒ train-only. M2 (view_remove) is in data_transforms.inputs ⇒ applied at train and served eval. M3 (distill) leaves the data pipeline both-cam and masks inside the loss. Exclusivity is enforced loudly in run_openpi_train.py.

Mechanism diagram — how `image_mask=False` removes a camera

flowchart TD
  A["image_masks base_0_rgb = False (agentview)"]:::set --> B["embed_prefix pi0.py:106 : repeat mask over agentview tokens, input_mask = False"]:::stage
  B --> C["make_attn_mask pi0.py:19 : valid = input_mask outer-product"]:::stage
  C --> D["agentview rows AND cols zeroed in attention"]:::out
  D --> E["tokens excluded from attention = treated like padding"]:::win

  Z["view_dropout zeros : image_mask stays True"]:::bad -. "pixels=0 but tokens still VALID, model still attends" .-> Y["wasted capacity, OOD frame = 27.4%"]:::bad

  classDef set fill:#E3DACC,stroke:#B85C3E,color:#141413;
  classDef stage fill:#F0EEE6,stroke:#87867F,color:#3D3D3A;
  classDef out fill:#FAF9F5,stroke:#D97757,color:#141413;
  classDef win fill:#FAF9F5,stroke:#788C5D,color:#141413,stroke-width:2px;
  classDef bad fill:#FAF9F5,stroke:#B85C3E,color:#141413,stroke-dasharray:4 3;

Fig. 2 — Physical removal vs zero-mask. The outer product in make_attn_mask means any token with input_mask=False has its entire row and column zeroed: it can neither attend nor be attended to. This is openpi's own padding-image convention (the unused right_wrist_0_rgb slot), so num_images and the pi05_base weight contract stay intact.

make_attn_mask — the two lines that matter

def make_attn_mask(input_mask, mask_ar):           # third_party/openpi/src/openpi/models/pi0.py:19
    mask_ar = jnp.broadcast_to(mask_ar, input_mask.shape)
    cumsum = jnp.cumsum(mask_ar, axis=1)
    attn_mask = cumsum[:, None, :] <= cumsum[:, :, None]
    valid_mask = input_mask[:, None, :] * input_mask[:, :, None]   # ← False row/col => dropped
    return jnp.logical_and(attn_mask, valid_mask)

The four masking mechanisms, contrasted

Mechanism	Pixels	image_mask	Attention	Injection seam	Phase
view_dropout (zeros)	zeroed	True	tokens still attended (black frame)	`repack_transforms`	train-only
view_remove (physical)	zeroed (cosmetic)	False	tokens excluded entirely	`data_transforms.inputs`	train + eval
distill in-loss mask	both-cam batch	False on student forward	student excludes agentview; teacher keeps it	inside `compute_distill_loss`	train (eval via view_remove)
curriculum	both-cam batch	False with prob `p(step)`	ramps 0→1 over first 75% then holds	`curriculum_p` × in-loss mask	train

Privileged distillation loss

The self-distillation variants add a velocity-matching term against a stop-gradient teacher that sees both cameras, plus a both-cam BC anchor that keeps the teacher pathway alive:

loss = BC(v_student_wrist, u_t)                          # always coefficient 1.0
     + lambda * || v_student_wrist - sg(v_teacher_bothcam) ||^2   # velocity distillation
     + anchor_frac * BC(v_teacher_bothcam, u_t)           # both-cam anchor

curriculum:  p(step) = 0.5 * (1 - cos(pi * progress)),  progress = clip(step / (0.75*N), 0, 1)

The teacher shares the student's graphdef + params (mode=self, live, ema_decay=null); the only difference between the two forwards is the input observation. This adds activations but no extra parameter memory.

04The five full-FT baselines

Each card: how it is trained (config, knobs, slurm) and how it is implemented (the line of code that does the masking).

B1 · view_dropout (zeros), p=1.0 27.4% · 137/500

Full-FT control. agentview RGB is replaced by a zeros tensor while its image_mask stays True, so the image tower encodes a black frame and the transformer still attends to it. The injection is repack-level (dataset-only ⇒ train-only); standard eval is already effectively wrist-only because the trained model ignores the dead view.

config: configs/experiment/pi05/libero_wristonly_full.yaml (view_dropout.train.agentview=1.0)
train: slurm 3067991 · outputs/20260521_165625_PI05_FULL_WRISTONLY_FULL_r1_j3067991
eval: n=500 · outputs/eval_logs/PI05_FULL_WRISTONLY_FULL_EVAL_FULL
impl: _ViewDropoutRepack (run_openpi_train.py:219) → apply_view_dropout (view_mask.py:209); block read at run_openpi_train.py:837

B2 · view_remove (physical), p=1.0 94.2% · 471/500

Matched control for B1 — byte-identical recipe, the only changed variable is the drop mechanism. _ViewMaskOutTransform flips image_mask['base_0_rgb']=False so the agentview tokens are excluded from make_attn_mask entirely (Fig. 2). Because it lives in data_transforms.inputs (run at both the data loader and the served Policy), the same physical removal applies at eval with no client-side hook.

config: configs/experiment/pi05/libero_wristonly_physdrop_full.yaml (view_remove.{train,eval}.agentview=1.0)
train: slurm 3071544 · outputs/20260523_024012_PI05_FULL_WRISTONLY_PHYSDROP_r1_j3071544
eval: n=500 · outputs/eval_logs/PI05_FULL_WRISTONLY_PHYSDROP_EVAL_FULL — removal verified in server.log: eval physical view-removal {agentview=1}
impl: _ViewMaskOutTransform (run_openpi_train.py:274) + LeRobotLiberoViewRemoveDataConfig (:328) → apply_view_image_mask (view_mask.py:257); block read at :874

Watch the partial SR The LoRA floor eval read 15.2% at episode 164 but ended at 5.0% (front-loaded easy tasks). Always wait for the full-500 number, and confirm wrist-only evals via the server.log "eval physical view-removal" line — a suspiciously high SR usually means the view was not actually removed. B2's 94.2% passed both checks.

B3 · curriculum-only eval pending

Issue-#10 ablation 1/3. The student's agentview mask probability ramps cosine 0→1 over the first 75% of steps then holds at 1.0; λ=0 (no distillation), anchor_frac=0. Isolates whether gradually withdrawing the privileged view (vs withdrawing it from step 0, as B2 does) helps the full-FT model land a better wrist-only policy. No data-level transform — the student masks agentview in-loss; the teacher forward runs but contributes nothing.

config: configs/experiment/pi05/distill/curriculum_only.yaml
train: slurm 3071609 (RUNNING) → eval afterok 3071617 (n=50)
impl: curriculum_p (distill.py:165) × _mask_view_per_sample inside compute_distill_loss (distill.py:208); built by _build_distill (config_factory.py:190)

B4 · self-distill only eval pending

Issue-#10 ablation 2/3. Student is always wrist-only (curriculum off, p=1.0 constant); a live self-teacher (the same model M, both cameras, stop-gradient) supplies the velocity-matching signal at λ=1.0 with a anchor_frac=0.15 both-cam anchor. Isolates the distillation term without the curriculum: does matching M's both-cam velocities lift the wrist-only student above the plain physical-removal floor (B2)?

config: configs/experiment/pi05/distill/selfdistill_only.yaml
train: slurm 3071612 (RUNNING) → eval afterok 3071618 (n=50)
impl: live teacher: mode=self, ema_decay=null in compute_distill_loss (distill.py:208); checkpoint path uses load_teacher_state (distill.py:329)

B5 · curriculum + self-distill eval pending · headline run

Issue-#10 ablation 3/3 — the full method. Curriculum-masked student (ramp 0→1@75%) + live both-cam self-teacher (λ=1.0, anchor_frac=0.15). Early in training (small p) the student mostly matches a both-cam M; as p→1 it specialises wrist-only while still matching the privileged teacher's velocities. Tests whether the two mechanisms compound beyond each alone (B3, B4) and beyond the physical-removal floor (B2).

config: configs/experiment/pi05/distill/curric_selfdistill.yaml
train: slurm 3071611 (RUNNING) → eval afterok 3071619 (n=50)
impl: curriculum + distill both active in compute_distill_loss (distill.py:208); eval served wrist-only via VLA_ZOO_VIEW_REMOVE_EVAL=agentview (serve_pi0_libero.py:132/167)

05Results

LIBERO-Spatial, agentview removed at eval, full 500-episode suite unless noted. Bars are scaled to the 96.6% both-cam ceiling. Reference rows are not full-FT baselines — they frame the result.

Variant	Backbone	Mechanism	SR	Eval evidence
B2 physdrop p=1.0	full-FT	view_remove (physical)	94.2%	PI05_FULL_WRISTONLY_PHYSDROP_EVAL_FULL
B1 zero-mask p=1.0	full-FT	view_dropout (zeros)	27.4%	PI05_FULL_WRISTONLY_FULL_EVAL_FULL
B3 curriculum-only	full-FT	curriculum (in-loss)	pending	slurm 3071609 → 3071617
B4 self-distill only	full-FT	distill λ=1.0	pending	slurm 3071612 → 3071618
B5 curric + self-distill	full-FT	distill + curriculum	pending	slurm 3071611 → 3071619
both-cam ceiling (ref)	LoRA, full-obs	none	96.6%	PI05_LORA_r2 (teacher)
LoRA distill (ref)	LoRA	privileged distill (2-stage)	91.0%	PI05_LORA_WRISTONLY_DISTILL_EVAL_FULL
dp wrist-only (ref)	from-scratch	view_dropout p=1.0	50.8%	DP_WRISTONLY_FULL
LoRA floor (ref)	LoRA (frozen)	view_dropout p=1.0	5.0%	PI05_LORA_WRISTONLY_FULL_R2_EVAL_FULL

06Discussion

The dominant variable in wrist-only π0.5 is not the backbone's trainability but how the exterior view is removed. Zero-masking keeps the agentview tokens "valid", so a pretrained image tower spends capacity encoding a black frame and the transformer is obligated to attend to an out-of-distribution dead view — dragging full-FT to 27.4%, below even a from-scratch diffusion policy (50.8%). Physically excluding the tokens from attention (B2) restores near-ceiling performance (94.2%), and a frozen-LoRA adapter (5%) simply lacks the capacity to compensate for either deficiency.

Insight Zero-masking a camera (pixels=0, mask=True) is not equivalent to removing it. For a model with a pretrained image tower it is strictly worse than physical token removal — the −66.8pp gap (27.4% → 94.2%) under an otherwise byte-identical full-FT recipe is the cleanest evidence in this batch. Prefer image_mask=False for any "ablate a modality" experiment on π0/π0.5.

Limitation Single benchmark (LIBERO-Spatial), single seed (42), single suite. B3–B5 evals are still pending, so H3 (distillation beats plain physical removal) is unconfirmed here. The 91.0% LoRA-distill reference and the 94.2% full-FT physdrop are close enough that distillation's marginal value on top of physical removal at full-FT capacity is not yet established.

Falsifier If B4 (self-distill only, physical-removal eval) lands at or below B2's 94.2%, then distillation adds nothing once the view is physically removed at full-FT capacity — H3 would be falsified and the headline reduces to "use image_mask=False, full-FT." Conversely, if B3 (curriculum-only, no teacher) already matches B5, the privileged signal is redundant and only the curriculum schedule matters.

Action plan Wait for the three issue-#10 evals (afterok 3071617/18/19, n=50) to populate the B3–B5 rows; promote the winning configuration to a full n=500 eval. Owner: tohkawa25. The ablation cleanly disentangles curriculum (B3) vs distillation (B4) vs both (B5), against the B2 physical-removal floor. Then re-run B2's recipe at view_remove p=0.5 to test the FULL-vs-P05 direction at full-FT capacity (still untested on π0.5).

07Implementation details — code anchors

Every masking mechanism in this report, and the line of code where it lives.

Module	file:line	Role
`OPENPI_IMAGE_KEY_MAP`	`src/vla_zoo/data/view_mask.py:88`	canonical `agentview` → openpi `base_0_rgb`, `eye_in_hand` → `left_wrist_0_rgb`
`apply_view_dropout`	`src/vla_zoo/data/view_mask.py:209`	zeros the RGB, mask stays True (blake2b-reproducible per-sample draw)
`apply_view_image_mask`	`src/vla_zoo/data/view_mask.py:257`	sets `image_masks[key]=np.False_` — physical removal
`_ViewDropoutRepack`	`scripts/run_openpi_train.py:219`	repack-level transform (train-only) for view_dropout
`_ViewMaskOutTransform`	`scripts/run_openpi_train.py:274`	`data_transforms.inputs` transform (train+eval) for view_remove
`LeRobotLiberoViewRemoveDataConfig`	`scripts/run_openpi_train.py:328`	data-config subclass that injects M2 after `LiberoInputs`
block readers (vd / vr / distill)	`run_openpi_train.py:837 / 874 / 918`	read root cfg blocks; enforce mutual exclusivity (loud)
`embed_prefix`	`third_party/openpi/.../models/pi0.py:106`	repeats `image_masks[name]` over image tokens → `input_mask`
`make_attn_mask`	`third_party/openpi/.../models/pi0.py:19`	outer product drops False rows/cols from attention
`compute_distill_loss`	`third_party/openpi/.../training/distill.py:208`	BC + λ·velocity-match + anchor; per-sample student mask
`curriculum_p`	`third_party/openpi/.../training/distill.py:165`	cosine ramp 0→1 over first `curriculum_frac` of steps
`load_teacher_state`	`third_party/openpi/.../training/distill.py:329`	loads + freezes a checkpoint teacher (mode=checkpoint)
`_build_distill`	`src/vla_zoo/openpi/config_factory.py:190`	builds `DistillConfig`, fails loud on missing keys (no silent defaults)
eval view-removal hook	`scripts/serve_pi0_libero.py:132 / 167`	`VLA_ZOO_VIEW_REMOVE_EVAL` env → server-side physical removal + log line

Example: physical-removal config (B2)

_base: libero_original.yaml
wandb_run_name: pi05_wristonly_physdrop_full

# PHYSICAL camera removal (image_mask=False) - distinct from view_dropout zeros.
view_remove:
  enabled: true
  seed: 42
  train: { agentview: 1.0 }   # exclude third-person view from attention in training
  eval:  { agentview: 1.0 }   # same physical removal at served eval

Reproduce

# Train (single H200, openpi trainer)
uv run python scripts/train.py \
  --config configs/experiment/pi05/libero_wristonly_physdrop_full.yaml \
  batch_size=32

# Eval n=500, server-side physical removal
PI0_CKPT_DIR=<orbax .../29999> OPENPI_CONFIG=pi05_libero NUM_TRIALS=50 \
  VLA_ZOO_VIEW_REMOVE_EVAL=agentview \
  LOG_DIR=outputs/eval_logs/PI05_FULL_WRISTONLY_PHYSDROP_EVAL_FULL \
  bash scripts_sh/eval_pi0_trained.sh libero_spatial

08References

Papers

[1] Black et al. π0: A Vision-Language-Action Flow Model for General Robot Control. Physical Intelligence, 2024. arXiv:2410.24164
[2] Physical Intelligence. π0.5: a VLA with Open-World Generalization. 2025. arXiv:2504.16054
[3] Liu et al. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. NeurIPS 2023. arXiv:2306.03310
[4] Lipman et al. Flow Matching for Generative Modeling. ICLR 2023. arXiv:2210.02747
[5] Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685
[6] Chen et al. Learning by Cheating (privileged teacher → sensorimotor student). CoRL 2020. arXiv:1912.12294
[7] Rusu et al. Policy Distillation. ICLR 2016. arXiv:1511.06295

Citation note The privileged-distillation design (a teacher with extra observations supervising a deployable student) follows the "learning by cheating" line [6] adapted to flow-matching velocities; the velocity-matching term is policy distillation [7] at the denoiser output rather than at action logits. arXiv IDs should be verified before lifting into a paper.

Internal

GitHub issues #5 (wrist-only floor re-test) · #10 (curriculum + self-distillation ablation)
Configs: configs/experiment/pi05/libero_wristonly_full.yaml, …/libero_wristonly_physdrop_full.yaml, …/distill/{curriculum_only,selfdistill_only,curric_selfdistill}.yaml
Job ledger: configs/jobs.yaml, configs/jobs_distill.yaml
Eval logs: outputs/eval_logs/PI05_FULL_WRISTONLY_*_EVAL_FULL/

00Summary

01Background

Wrist-only deployment

π0.5, in one paragraph

The LoRA floor (issue #5) — why this report exists

02Hypothesis

03Approach

Shared recipe

System diagram — where each mechanism is injected

Mechanism diagram — how image_mask=False removes a camera

make_attn_mask — the two lines that matter

The four masking mechanisms, contrasted

Privileged distillation loss

04The five full-FT baselines

B1 · view_dropout (zeros), p=1.0 27.4% · 137/500

B2 · view_remove (physical), p=1.0 94.2% · 471/500

B3 · curriculum-only eval pending

B4 · self-distill only eval pending

B5 · curriculum + self-distill eval pending · headline run

05Results

06Discussion

07Implementation details — code anchors

Example: physical-removal config (B2)

Reproduce

08References

Papers

Internal

Mechanism diagram — how `image_mask=False` removes a camera