VLA-Zoo · Report · comparison

π0.5 wrist-only: full-FT baselines on LIBERO

doc-type report status ready updated 2026-05-23 owner tohkawa25 branch dev/distill @ 66a1233 issues #5 · #10

00Summary

A full fine-tuned π0.5 reaches 94.2% wrist-only success on LIBERO-Spatial when the third-person camera is physically removed from attention — almost the 96.6% both-camera ceiling — while the identical recipe that instead feeds a black (zeroed) frame collapses to 27.4%.

The headline finding is that the apparent difficulty of wrist-only π0.5 was largely an artifact of the masking mechanism, not a capacity ceiling. Frozen-backbone LoRA collapses to a ~5% behavior-cloning floor regardless of dropout schedule (issue #5); unfreezing the backbone (full-FT) is necessary but not sufficient — how the view is removed matters as much as whether the backbone is trainable. This report covers the five full-FT wrist-only baselines (two evaluated, three in-flight self-distillation ablations under issue #10), how each is trained and wired into the openpi pipeline, and the line of code where every mechanism lives.

94.2%
full-FT, physical removal (471/500)
27.4%
full-FT, zero-mask (137/500)
96.6%
both-cam ceiling (ref)
5.0%
frozen LoRA floor (ref)

01Background

Wrist-only deployment

LIBERO[3] exposes two cameras: a fixed third-person agentview and a wrist-mounted eye_in_hand. Many real deployments lack a reliable exterior view (occlusion, mounting cost, calibration drift), so a policy that runs wrist-only is operationally valuable. The question: can a model pretrained with both cameras be adapted to ignore the exterior view without collapsing?

π0.5, in one paragraph

π0.5[2] (a successor to π0[1]) is a vision-language-action flow-matching model: a PaliGemma-style backbone (gemma_2b) encodes images + language, and a smaller action expert (gemma_300m) denoises an action chunk via conditional flow matching[4]. Each camera is tokenized into image tokens that the transformer attends over jointly with the language and action tokens. The training target is the flow-matching velocity u_t = noise − actions; the loss is mean((v_θ − u_t)²).

The LoRA floor (issue #5) — why this report exists

The first wrist-only batch adapted π0.5 with LoRA[5] (~1.5% trainable params, frozen backbone). Both dropout schedules collapsed to a ~5% floor:

A ~1.5%-trainable adapter on an agentview-pretrained backbone cannot rewire onto a wrist-only distribution. Issue #5's TODO — "re-test the hypothesis out of the floor regime: unfrozen backbone" — is the starting point for the full-FT baselines below. Reference anchors for the wrist-only task:

02Hypothesis

Three falsifiable claims drive the full-FT batch:

  1. H1 — capacity: the LoRA collapse is a frozen-backbone capacity artifact, not a wrist-only ceiling. Unfreezing the whole backbone (full-FT) should escape the ~5% floor.
  2. H2 — mechanism: a from-scratch diffusion policy reaches 50.8% but full-FT π0.5 zero-mask only 27.4%. The gap is suspected to come from the zero-mask substitution — a pretrained image tower still encodes a dead black frame and the transformer still attends to it. Physically excluding the agentview tokens from attention should beat the zero-mask.
  3. H3 — privileged signal: a teacher that sees both cameras can transfer its behavior to a wrist-only student at the flow-matching velocity level, lifting it toward the both-cam ceiling. Self-distillation (and a curriculum that gradually withdraws the exterior view) should beat plain behavior cloning under the same physical-removal condition.
Insight (confirmed) H1 and H2 both hold. Full-FT zero-mask = 27.4% (escapes the 5% LoRA floor → H1 ✓); full-FT physical removal = 94.2% (beats zero-mask by +66.8pp and from-scratch dp by +43.4pp → H2 ✓). The wrist-only "ceiling" was never a ceiling — it was a dead black frame the model was forced to attend to. H3 is the open question the issue-#10 ablation tests.

03Approach

Shared recipe

All five baselines inherit the canonical upstream pi05_libero recipe (configs/experiment/pi05/libero_original.yaml): pi05_base loader, gemma_2b backbone + gemma_300m action expert, no LoRA (the freeze filter returns Nothing ⇒ all params trainable), model-side EMA 0.999, action_horizon=10, cosine LR with peak == decay 5e-5 (≈ constant after 10k warmup), 30k steps, seed 42. Upstream batch is 256 (multi-GPU FSDP); single-H200 runs override batch_size=32 (dry-run-confirmed: ~108 GB peak, no OOM). All run via the openpi trainer — the AiroPi 8-GPU FSDP backend has no view-masking support.

System diagram — where each mechanism is injected

flowchart LR
  DS["LeRobot LIBERO libero_spatial"]:::data
  DS --> RP["repack_transforms"]:::stage
  RP --> DTI["data_transforms.inputs after LiberoInputs"]:::stage
  DTI --> M["pi0.5 model forward : embed_prefix to make_attn_mask"]:::model
  M --> L["BC flow-matching loss : mean of (v - u_t) squared"]:::loss

  RP -. "M1 view_dropout zeros, TRAIN-only, _ViewDropoutRepack :219" .-> M
  DTI -. "M2 view_remove image_mask=False, TRAIN+EVAL, _ViewMaskOutTransform :274" .-> M
  TC["both-cam teacher self or checkpoint"]:::teacher -. "M3 distill in-loss mask, compute_distill_loss :208" .-> L

  classDef data fill:#E3DACC,stroke:#B85C3E,color:#141413;
  classDef stage fill:#F0EEE6,stroke:#87867F,color:#3D3D3A;
  classDef model fill:#FAF9F5,stroke:#D97757,color:#141413,stroke-width:2px;
  classDef loss fill:#FAF9F5,stroke:#788C5D,color:#141413,stroke-width:2px;
  classDef teacher fill:#F0EEE6,stroke:#788C5D,color:#3D3D3A;
  

Fig. 1 — Three mutually-exclusive masking seams. M1 (view_dropout) is repack-level ⇒ train-only. M2 (view_remove) is in data_transforms.inputs ⇒ applied at train and served eval. M3 (distill) leaves the data pipeline both-cam and masks inside the loss. Exclusivity is enforced loudly in run_openpi_train.py.

Mechanism diagram — how image_mask=False removes a camera

flowchart TD
  A["image_masks base_0_rgb = False (agentview)"]:::set --> B["embed_prefix pi0.py:106 : repeat mask over agentview tokens, input_mask = False"]:::stage
  B --> C["make_attn_mask pi0.py:19 : valid = input_mask outer-product"]:::stage
  C --> D["agentview rows AND cols zeroed in attention"]:::out
  D --> E["tokens excluded from attention = treated like padding"]:::win

  Z["view_dropout zeros : image_mask stays True"]:::bad -. "pixels=0 but tokens still VALID, model still attends" .-> Y["wasted capacity, OOD frame = 27.4%"]:::bad

  classDef set fill:#E3DACC,stroke:#B85C3E,color:#141413;
  classDef stage fill:#F0EEE6,stroke:#87867F,color:#3D3D3A;
  classDef out fill:#FAF9F5,stroke:#D97757,color:#141413;
  classDef win fill:#FAF9F5,stroke:#788C5D,color:#141413,stroke-width:2px;
  classDef bad fill:#FAF9F5,stroke:#B85C3E,color:#141413,stroke-dasharray:4 3;
  

Fig. 2 — Physical removal vs zero-mask. The outer product in make_attn_mask means any token with input_mask=False has its entire row and column zeroed: it can neither attend nor be attended to. This is openpi's own padding-image convention (the unused right_wrist_0_rgb slot), so num_images and the pi05_base weight contract stay intact.

make_attn_mask — the two lines that matter

def make_attn_mask(input_mask, mask_ar):           # third_party/openpi/src/openpi/models/pi0.py:19
    mask_ar = jnp.broadcast_to(mask_ar, input_mask.shape)
    cumsum = jnp.cumsum(mask_ar, axis=1)
    attn_mask = cumsum[:, None, :] <= cumsum[:, :, None]
    valid_mask = input_mask[:, None, :] * input_mask[:, :, None]   # ← False row/col => dropped
    return jnp.logical_and(attn_mask, valid_mask)

The four masking mechanisms, contrasted

MechanismPixelsimage_maskAttentionInjection seamPhase
view_dropout (zeros)zeroedTruetokens still attended (black frame)repack_transformstrain-only
view_remove (physical)zeroed (cosmetic)Falsetokens excluded entirelydata_transforms.inputstrain + eval
distill in-loss maskboth-cam batchFalse on student forwardstudent excludes agentview; teacher keeps itinside compute_distill_losstrain (eval via view_remove)
curriculumboth-cam batchFalse with prob p(step)ramps 0→1 over first 75% then holdscurriculum_p × in-loss masktrain

Privileged distillation loss

The self-distillation variants add a velocity-matching term against a stop-gradient teacher that sees both cameras, plus a both-cam BC anchor that keeps the teacher pathway alive:

loss = BC(v_student_wrist, u_t)                          # always coefficient 1.0
     + lambda * || v_student_wrist - sg(v_teacher_bothcam) ||^2   # velocity distillation
     + anchor_frac * BC(v_teacher_bothcam, u_t)           # both-cam anchor

curriculum:  p(step) = 0.5 * (1 - cos(pi * progress)),  progress = clip(step / (0.75*N), 0, 1)

The teacher shares the student's graphdef + params (mode=self, live, ema_decay=null); the only difference between the two forwards is the input observation. This adds activations but no extra parameter memory.

04The five full-FT baselines

Each card: how it is trained (config, knobs, slurm) and how it is implemented (the line of code that does the masking).

B1 · view_dropout (zeros), p=1.0 27.4% · 137/500

Full-FT control. agentview RGB is replaced by a zeros tensor while its image_mask stays True, so the image tower encodes a black frame and the transformer still attends to it. The injection is repack-level (dataset-only ⇒ train-only); standard eval is already effectively wrist-only because the trained model ignores the dead view.

config
configs/experiment/pi05/libero_wristonly_full.yaml (view_dropout.train.agentview=1.0)
train
slurm 3067991 · outputs/20260521_165625_PI05_FULL_WRISTONLY_FULL_r1_j3067991
eval
n=500 · outputs/eval_logs/PI05_FULL_WRISTONLY_FULL_EVAL_FULL
impl
_ViewDropoutRepack (run_openpi_train.py:219) → apply_view_dropout (view_mask.py:209); block read at run_openpi_train.py:837

B2 · view_remove (physical), p=1.0 94.2% · 471/500

Matched control for B1 — byte-identical recipe, the only changed variable is the drop mechanism. _ViewMaskOutTransform flips image_mask['base_0_rgb']=False so the agentview tokens are excluded from make_attn_mask entirely (Fig. 2). Because it lives in data_transforms.inputs (run at both the data loader and the served Policy), the same physical removal applies at eval with no client-side hook.

config
configs/experiment/pi05/libero_wristonly_physdrop_full.yaml (view_remove.{train,eval}.agentview=1.0)
train
slurm 3071544 · outputs/20260523_024012_PI05_FULL_WRISTONLY_PHYSDROP_r1_j3071544
eval
n=500 · outputs/eval_logs/PI05_FULL_WRISTONLY_PHYSDROP_EVAL_FULL — removal verified in server.log: eval physical view-removal {agentview=1}
impl
_ViewMaskOutTransform (run_openpi_train.py:274) + LeRobotLiberoViewRemoveDataConfig (:328) → apply_view_image_mask (view_mask.py:257); block read at :874
Watch the partial SR The LoRA floor eval read 15.2% at episode 164 but ended at 5.0% (front-loaded easy tasks). Always wait for the full-500 number, and confirm wrist-only evals via the server.log "eval physical view-removal" line — a suspiciously high SR usually means the view was not actually removed. B2's 94.2% passed both checks.

B3 · curriculum-only eval pending

Issue-#10 ablation 1/3. The student's agentview mask probability ramps cosine 0→1 over the first 75% of steps then holds at 1.0; λ=0 (no distillation), anchor_frac=0. Isolates whether gradually withdrawing the privileged view (vs withdrawing it from step 0, as B2 does) helps the full-FT model land a better wrist-only policy. No data-level transform — the student masks agentview in-loss; the teacher forward runs but contributes nothing.

config
configs/experiment/pi05/distill/curriculum_only.yaml
train
slurm 3071609 (RUNNING) → eval afterok 3071617 (n=50)
impl
curriculum_p (distill.py:165) × _mask_view_per_sample inside compute_distill_loss (distill.py:208); built by _build_distill (config_factory.py:190)

B4 · self-distill only eval pending

Issue-#10 ablation 2/3. Student is always wrist-only (curriculum off, p=1.0 constant); a live self-teacher (the same model M, both cameras, stop-gradient) supplies the velocity-matching signal at λ=1.0 with a anchor_frac=0.15 both-cam anchor. Isolates the distillation term without the curriculum: does matching M's both-cam velocities lift the wrist-only student above the plain physical-removal floor (B2)?

config
configs/experiment/pi05/distill/selfdistill_only.yaml
train
slurm 3071612 (RUNNING) → eval afterok 3071618 (n=50)
impl
live teacher: mode=self, ema_decay=null in compute_distill_loss (distill.py:208); checkpoint path uses load_teacher_state (distill.py:329)

B5 · curriculum + self-distill eval pending · headline run

Issue-#10 ablation 3/3 — the full method. Curriculum-masked student (ramp 0→1@75%) + live both-cam self-teacher (λ=1.0, anchor_frac=0.15). Early in training (small p) the student mostly matches a both-cam M; as p→1 it specialises wrist-only while still matching the privileged teacher's velocities. Tests whether the two mechanisms compound beyond each alone (B3, B4) and beyond the physical-removal floor (B2).

config
configs/experiment/pi05/distill/curric_selfdistill.yaml
train
slurm 3071611 (RUNNING) → eval afterok 3071619 (n=50)
impl
curriculum + distill both active in compute_distill_loss (distill.py:208); eval served wrist-only via VLA_ZOO_VIEW_REMOVE_EVAL=agentview (serve_pi0_libero.py:132/167)

05Results

LIBERO-Spatial, agentview removed at eval, full 500-episode suite unless noted. Bars are scaled to the 96.6% both-cam ceiling. Reference rows are not full-FT baselines — they frame the result.

VariantBackboneMechanismSREval evidence
B2 physdrop p=1.0full-FTview_remove (physical)94.2%PI05_FULL_WRISTONLY_PHYSDROP_EVAL_FULL
B1 zero-mask p=1.0full-FTview_dropout (zeros)27.4%PI05_FULL_WRISTONLY_FULL_EVAL_FULL
B3 curriculum-onlyfull-FTcurriculum (in-loss)pendingslurm 3071609 → 3071617
B4 self-distill onlyfull-FTdistill λ=1.0pendingslurm 3071612 → 3071618
B5 curric + self-distillfull-FTdistill + curriculumpendingslurm 3071611 → 3071619
both-cam ceiling (ref)LoRA, full-obsnone96.6%PI05_LORA_r2 (teacher)
LoRA distill (ref)LoRAprivileged distill (2-stage)91.0%PI05_LORA_WRISTONLY_DISTILL_EVAL_FULL
dp wrist-only (ref)from-scratchview_dropout p=1.050.8%DP_WRISTONLY_FULL
LoRA floor (ref)LoRA (frozen)view_dropout p=1.05.0%PI05_LORA_WRISTONLY_FULL_R2_EVAL_FULL

06Discussion

The dominant variable in wrist-only π0.5 is not the backbone's trainability but how the exterior view is removed. Zero-masking keeps the agentview tokens "valid", so a pretrained image tower spends capacity encoding a black frame and the transformer is obligated to attend to an out-of-distribution dead view — dragging full-FT to 27.4%, below even a from-scratch diffusion policy (50.8%). Physically excluding the tokens from attention (B2) restores near-ceiling performance (94.2%), and a frozen-LoRA adapter (5%) simply lacks the capacity to compensate for either deficiency.

Insight Zero-masking a camera (pixels=0, mask=True) is not equivalent to removing it. For a model with a pretrained image tower it is strictly worse than physical token removal — the −66.8pp gap (27.4% → 94.2%) under an otherwise byte-identical full-FT recipe is the cleanest evidence in this batch. Prefer image_mask=False for any "ablate a modality" experiment on π0/π0.5.
Limitation Single benchmark (LIBERO-Spatial), single seed (42), single suite. B3–B5 evals are still pending, so H3 (distillation beats plain physical removal) is unconfirmed here. The 91.0% LoRA-distill reference and the 94.2% full-FT physdrop are close enough that distillation's marginal value on top of physical removal at full-FT capacity is not yet established.
Falsifier If B4 (self-distill only, physical-removal eval) lands at or below B2's 94.2%, then distillation adds nothing once the view is physically removed at full-FT capacity — H3 would be falsified and the headline reduces to "use image_mask=False, full-FT." Conversely, if B3 (curriculum-only, no teacher) already matches B5, the privileged signal is redundant and only the curriculum schedule matters.
Action plan Wait for the three issue-#10 evals (afterok 3071617/18/19, n=50) to populate the B3–B5 rows; promote the winning configuration to a full n=500 eval. Owner: tohkawa25. The ablation cleanly disentangles curriculum (B3) vs distillation (B4) vs both (B5), against the B2 physical-removal floor. Then re-run B2's recipe at view_remove p=0.5 to test the FULL-vs-P05 direction at full-FT capacity (still untested on π0.5).

07Implementation details — code anchors

Every masking mechanism in this report, and the line of code where it lives.

Modulefile:lineRole
OPENPI_IMAGE_KEY_MAPsrc/vla_zoo/data/view_mask.py:88canonical agentview → openpi base_0_rgb, eye_in_handleft_wrist_0_rgb
apply_view_dropoutsrc/vla_zoo/data/view_mask.py:209zeros the RGB, mask stays True (blake2b-reproducible per-sample draw)
apply_view_image_masksrc/vla_zoo/data/view_mask.py:257sets image_masks[key]=np.False_ — physical removal
_ViewDropoutRepackscripts/run_openpi_train.py:219repack-level transform (train-only) for view_dropout
_ViewMaskOutTransformscripts/run_openpi_train.py:274data_transforms.inputs transform (train+eval) for view_remove
LeRobotLiberoViewRemoveDataConfigscripts/run_openpi_train.py:328data-config subclass that injects M2 after LiberoInputs
block readers (vd / vr / distill)run_openpi_train.py:837 / 874 / 918read root cfg blocks; enforce mutual exclusivity (loud)
embed_prefixthird_party/openpi/.../models/pi0.py:106repeats image_masks[name] over image tokens → input_mask
make_attn_maskthird_party/openpi/.../models/pi0.py:19outer product drops False rows/cols from attention
compute_distill_lossthird_party/openpi/.../training/distill.py:208BC + λ·velocity-match + anchor; per-sample student mask
curriculum_pthird_party/openpi/.../training/distill.py:165cosine ramp 0→1 over first curriculum_frac of steps
load_teacher_statethird_party/openpi/.../training/distill.py:329loads + freezes a checkpoint teacher (mode=checkpoint)
_build_distillsrc/vla_zoo/openpi/config_factory.py:190builds DistillConfig, fails loud on missing keys (no silent defaults)
eval view-removal hookscripts/serve_pi0_libero.py:132 / 167VLA_ZOO_VIEW_REMOVE_EVAL env → server-side physical removal + log line

Example: physical-removal config (B2)

_base: libero_original.yaml
wandb_run_name: pi05_wristonly_physdrop_full

# PHYSICAL camera removal (image_mask=False) - distinct from view_dropout zeros.
view_remove:
  enabled: true
  seed: 42
  train: { agentview: 1.0 }   # exclude third-person view from attention in training
  eval:  { agentview: 1.0 }   # same physical removal at served eval

Reproduce

# Train (single H200, openpi trainer)
uv run python scripts/train.py \
  --config configs/experiment/pi05/libero_wristonly_physdrop_full.yaml \
  batch_size=32

# Eval n=500, server-side physical removal
PI0_CKPT_DIR=<orbax .../29999> OPENPI_CONFIG=pi05_libero NUM_TRIALS=50 \
  VLA_ZOO_VIEW_REMOVE_EVAL=agentview \
  LOG_DIR=outputs/eval_logs/PI05_FULL_WRISTONLY_PHYSDROP_EVAL_FULL \
  bash scripts_sh/eval_pi0_trained.sh libero_spatial

08References

Papers

  1. [1] Black et al. π0: A Vision-Language-Action Flow Model for General Robot Control. Physical Intelligence, 2024. arXiv:2410.24164
  2. [2] Physical Intelligence. π0.5: a VLA with Open-World Generalization. 2025. arXiv:2504.16054
  3. [3] Liu et al. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. NeurIPS 2023. arXiv:2306.03310
  4. [4] Lipman et al. Flow Matching for Generative Modeling. ICLR 2023. arXiv:2210.02747
  5. [5] Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685
  6. [6] Chen et al. Learning by Cheating (privileged teacher → sensorimotor student). CoRL 2020. arXiv:1912.12294
  7. [7] Rusu et al. Policy Distillation. ICLR 2016. arXiv:1511.06295
Citation note The privileged-distillation design (a teacher with extra observations supervising a deployable student) follows the "learning by cheating" line [6] adapted to flow-matching velocities; the velocity-matching term is policy distillation [7] at the denoiser output rather than at action logits. arXiv IDs should be verified before lifting into a paper.

Internal