VLA-Zoo · Model card · architecture

π0.5: architecture, tensor IO & ablation knobs

doc-type model_card config pi05_libero updated 2026-05-23 owner tohkawa25 code third_party/openpi @ pi0.py

00Summary

π0.5 is a dual-expert flow-matching VLA: a single 18-layer Gemma transformer where a 2B-width PaliGemma expert encodes vision + language (the "prefix") and a 300M-width action expert denoises an action chunk (the "suffix"), the two attending jointly. This card pins every module's input/output tensor shape for the pi05_libero config and maps each GitHub-issue ablation knob to the exact module and tensor it touches.

32
action_dim (padded)
10
action_horizon (chunk)
2048/1024
VLM / action-expert width
256×3
image tokens (3 slots)
200
max_token_len (lang)
bf16
dtype

01Architecture

Dual-expert Gemma (mixture-of-transformers)

The core is a single Gemma stack of 18 layers instantiated with two per-layer weight sets (_gemma.Module(configs=[paligemma, action_expert]), pi0.py:74). Tokens are routed to an expert by which sequence they belong to, not by a gate: the VLM prefix (images + language) flows through the gemma_2b expert (width 2048), the action suffix through the gemma_300m expert (width 1024). Both experts share num_heads=8, head_dim=256, depth=18 — the matched depth + head geometry is what lets the two token streams attend to each other in every layer through one joint attention mask. Only width and mlp_dim (16384 vs 4096) differ.

SigLIP image tower

Each camera image is encoded by a shared SigLIP So400m/14 tower (_siglip.Module(variant="So400m/14", pool_type="none"), pi0.py:82): a 224×224×3 frame is cut into 14×14 patches → 16×16 = 256 patch tokens, run through the 27-layer So400m encoder (internal width 1152), and projected to the PaliGemma width (num_classes=2048). The tower is weight-tied across cameras — applied independently per view inside the for name in obs.images loop (pi0.py:113).

Image input contract — how many views?

pi0/π0.5 expects exactly three named image slots, fixed by Pi0Config.inputs_spec (pi0_config.py:61):

Slot keyRoleLIBEROtensor
base_0_rgbexterior / third-personpopulated (agentview)[b,224,224,3]
left_wrist_0_rgbwrist camerapopulated (eye_in_hand)[b,224,224,3]
right_wrist_0_rgb2nd wrist (bimanual)zeros, image_mask=False[b,224,224,3]
Pitfall The model is not arbitrary-view. The 3-slot layout is baked into the pi05_base weight contract and the data transforms (LiberoInputs, libero_policy.py:59). An unused slot uses the padding-image convention: zeros + image_mask=False so its tokens are excluded from attention (the same mechanism wrist-only ablation uses — see pi05-wrist-only.html). Because the SigLIP tower is weight-tied and embed_prefix just concatenates one 256-token block per view, more views are architecturally expressible, but doing so departs from the pretrained contract and the standard repack pipeline.

π0 vs π0.5 — what the pi05 flag changes

π0π0.5 (pi05=True)
statecontinuous state token in suffix (state_proj)no state token; discrete state in prompt (discrete_state_input; off for LIBERO)
timestepconcatenated with action tokens → action_time_mlpadaRMS conditioning via time_mlpadarms_cond (not a token)
max_token_len48200
adaRMS experts[False, False][False, True] (action expert only)

02Model diagram

flowchart TD
  IMG["images x3 : [b,224,224,3]"]:::data
  TOK["tokenized_prompt : [b,200]"]:::data
  IMG --> SIG["SigLIP So400m/14 per cam : [b,224,224,3] to [b,256,2048]"]:::vlm
  TOK --> LEMB["llm.embed : [b,200] to [b,200,2048]"]:::vlm
  SIG --> PFX["embed_prefix pi0.py:106 : prefix [b, up to 712, 2048]"]:::vlm
  LEMB --> PFX
  XT["noisy actions x_t : [b,10,32]"]:::act --> AIP["action_in_proj : [b,10,32] to [b,10,1024]"]:::act
  TSN["timestep t : [b]"]:::act --> TM["posemb_sincos + time_mlp : adarms_cond [b,1024]"]:::act
  AIP --> SFX["embed_suffix pi05 pi0.py:140 : [b,10,1024]"]:::act
  PFX --> LLM["Gemma dual-expert LLM 18 layers : prefix 2048-w expert, suffix 1024-w expert, joint attention"]:::core
  SFX --> LLM
  TM -. "adaRMS suffix only" .-> LLM
  LLM --> AOP["action_out_proj : suffix_out last-10 [b,10,1024] to [b,10,32]"]:::act
  AOP --> V["velocity v_t : [b,10,32]"]:::out

  classDef data fill:#E3DACC,stroke:#B85C3E,color:#141413;
  classDef vlm fill:#F0EEE6,stroke:#87867F,color:#3D3D3A;
  classDef act fill:#FAF9F5,stroke:#788C5D,color:#141413;
  classDef core fill:#FAF9F5,stroke:#D97757,color:#141413,stroke-width:2px;
  classDef out fill:#FAF9F5,stroke:#141413,color:#141413,stroke-width:2px;
  

Fig. 1 — Forward pass with tensor shapes (pi05_libero, batch b). Prefix length ≤ 712 = 2 active cameras × 256 + ≤ 200 language tokens. The action expert's timestep enters as adaRMS conditioning, not as a sequence token.

03Tensor IO — per module

All shapes for pi05_libero, batch b; ah=action_horizon=10, ad=action_dim=32. N = prefix_len + suffix_len.

Module (file:line)InputOutput
Observation.images[k] (model.py:91)[b,224,224,3] f32 ×3
Observation.image_masks[k][b] bool ×3
Observation.state[b,32] f32
tokenized_prompt / _mask[b,200] i32 / bool
target actions[b,10,32] f32
PaliGemma.img SigLIP (per cam)[b,224,224,3][b,256,2048]
PaliGemma.llm(method="embed")[b,200][b,200,2048]
embed_prefix (pi0.py:106)images + langtokens [b,≤712,2048], masks [b,≤712]
action_in_projx_t [b,10,32][b,10,1024]
posemb_sincos + time_mlpt [b]adarms_cond [b,1024]
embed_suffix pi05 (pi0.py:140)x_t, ttokens [b,10,1024], cond [b,1024]
PaliGemma.llm (MoE, 18 layers)[[b,≤712,2048], [b,10,1024]], mask [b,N,N](prefix [b,≤712,2048], suffix [b,10,1024])
action_out_projsuffix_out[:,-10:] [b,10,1024]v_t [b,10,32]
compute_loss (pi0.py:219)v_t, u_t=noise−a [b,10,32]per-elem MSE [b,10]
sample_actions (pi0.py:237)noise [b,10,32]actions [b,10,32]
Invariant action_dim=32 is the padded model dim, fixed by the pi05_base contract. The real env dim (LIBERO 7, yubi bimanual 14) is zero-padded up to 32 by PadStatesAndActions (config.py:124) and sliced back at the env boundary. state and both action projections all carry the padded 32.

04Flow-matching objective

Conditional flow matching[4] over the action chunk. Convention: t=1 is noise, t=0 is the data (note: opposite of the π0 paper).

# compute_loss (pi0.py:219)
noise = N(0, I)                          # [b,10,32]
t     = Beta(1.5, 1) * 0.999 + 0.001     # [b]   (biases samples toward t->0/data)
x_t   = t*noise + (1 - t)*actions        # [b,10,32]  interpolant
u_t   = noise - actions                  # [b,10,32]  target velocity
v_t   = model._predict_velocity(obs, x_t, t)         # [b,10,32]
loss  = mean( (v_t - u_t)^2 )            # over action_dim -> [b,10]
# sample_actions (pi0.py:237): Euler integration, num_steps=10
x = noise                                # [b,10,32],  t = 1.0,  dt = -1/num_steps
# prefix KV-cache filled once (vision+language); suffix re-run each step
while t >= -dt/2:
    v_t = action_out_proj( llm([None, suffix(x,t)], kv_cache=prefix)[-10:] )
    x   = x + dt * v_t ;  t = t + dt
return x                                  # [b,10,32]  ~ clean action chunk

05Implementation details — code anchors

Componentfile:lineRole
Pi0 modelthird_party/openpi/.../models/pi0.py:66top-level module; builds img tower + dual-expert llm + projections
__init__ (module wiring)pi0.py:67action_in_proj / action_out_proj / time_mlp (pi05) or state_proj+action_time_mlp (pi0)
embed_prefixpi0.py:106per-cam SigLIP + language embed → prefix tokens (bidirectional)
embed_suffixpi0.py:140action_in_proj + timestep; pi05 adaRMS branch vs pi0 state/time-MLP branch
_predict_velocitypi0.py:189one joint prefix+suffix forward → v_t (shared by loss + distillation)
compute_losspi0.py:219samples x_t/t, returns flow-matching MSE
sample_actionspi0.py:237Euler integration with prefix KV-cache
make_attn_maskpi0.py:19prefix-LM + causal block mask; image_mask=False drops a view
Pi0Config / get_freeze_filterpi0_config.py:19 / 79dims, variants, max_token_len; LoRA freeze filter
gemma.get_configgemma.py:58gemma_2b (w2048) / gemma_300m (w1024); +_lora variants
siglip.Modulesiglip.py:293So400m/14, patch_size 14, pool none
Observation / IMAGE_RESOLUTIONmodel.py:83 / 47input dataclass; 224×224
VLA-Zoo configconfigs/model/pi05.yamlmirrors upstream pi05_libero (config.py:751)

06Ablation knobs (GitHub issues)

Each parameter ablated across the issue tracker, mapped to the architecture: the module/tensor it touches, the range tested, the governing issue, and what it isolates. Results live in the issues themselves.

KnobModule / tensorRangeIssueIsolates
action_horizon H embed_suffix length & action_out_proj slice → output [b,H,32] 10 (libero) · 16 · 32 · 50 #6 #9 #12 #4 replan frequency vs temporal smoothness; long H over-runs short/low-fps episodes → boundary clamp-padding (#4)
action_dim padding action_in_proj in / action_out_proj out / state real 7 (libero), 14 (yubi) → padded 32 #2 #8 whether padded/real channels carry live signal (dead left-hand stream)
LoRA vs full-FT paligemma_variant + action_expert_variant; get_freeze_filter gemma_2b / _lora (r16,α16) · gemma_300m / _lora (r32,α32) #7 #9 ~1.5% trainable adapter vs 100% backbone — adaptation capacity vs compute/overfit
action-head pretrain weight_loader: full pi05_base vs VLM-only (action expert reinit) pretrained · from-scratch head #3 does the flow-matching action head need pretrained init, or learn from the VLM alone?
view masking images / image_masks dict; in-loss mask + curriculum full · view_dropout (zeros) · view_remove (mask=False) · distill #5 #10 modality availability & how a view is removed — see pi05-wrist-only.html
fps / codec (data) upstream of the model: frames backing each H-step chunk fps05 · fps10 · fps30 · AV1 / h264 #1 #2 #4 #8 temporal density, decode speed, and chunk coverage vs episode length
multi-task vs per-task same arch; training data mixture (fix_actions_tf / frame_source_table) 1 multi-task · N per-task specialists · ah16/ah32 #12 specialization vs generalization at fixed capacity
Insight Most ablation knobs touch the action expert side (action_horizon, action_dim, the action-head init) or the input contract (views, fps) — the 2B PaliGemma backbone and its 3-slot image interface stay fixed. LoRA-vs-full-FT is the one knob that changes which parameters of the backbone move at all.

07References

  1. [1] Black et al. π0: A Vision-Language-Action Flow Model for General Robot Control. Physical Intelligence, 2024. arXiv:2410.24164
  2. [2] Physical Intelligence. π0.5: a VLA with Open-World Generalization. 2025. arXiv:2504.16054
  3. [3] Beyer et al. PaliGemma: A versatile 3B VLM for transfer. 2024. arXiv:2407.07726 · SigLIP: Zhai et al. arXiv:2303.15343
  4. [4] Lipman et al. Flow Matching for Generative Modeling. ICLR 2023. arXiv:2210.02747
  5. [5] Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685
  6. [6] Liu et al. LIBERO. NeurIPS 2023. arXiv:2306.03310
Citation note The dual-expert ("mixture-of-transformers") + flow-matching action head is the π0 design [1]; π0.5 [2] swaps the state token + time-MLP for discrete-state prompt input + adaRMS time conditioning. arXiv IDs should be verified before lifting into a paper.

Internal