00Summary
π0.5 is a dual-expert flow-matching VLA: a single 18-layer Gemma transformer where a 2B-width PaliGemma expert encodes vision + language (the "prefix") and a 300M-width action expert denoises an action chunk (the "suffix"), the two attending jointly. This card pins every module's input/output tensor shape for the pi05_libero config and maps each GitHub-issue ablation knob to the exact module and tensor it touches.
01Architecture
Dual-expert Gemma (mixture-of-transformers)
The core is a single Gemma stack of 18 layers instantiated with two per-layer weight sets (_gemma.Module(configs=[paligemma, action_expert]), pi0.py:74). Tokens are routed to an expert by which sequence they belong to, not by a gate: the VLM prefix (images + language) flows through the gemma_2b expert (width 2048), the action suffix through the gemma_300m expert (width 1024). Both experts share num_heads=8, head_dim=256, depth=18 — the matched depth + head geometry is what lets the two token streams attend to each other in every layer through one joint attention mask. Only width and mlp_dim (16384 vs 4096) differ.
SigLIP image tower
Each camera image is encoded by a shared SigLIP So400m/14 tower (_siglip.Module(variant="So400m/14", pool_type="none"), pi0.py:82): a 224×224×3 frame is cut into 14×14 patches → 16×16 = 256 patch tokens, run through the 27-layer So400m encoder (internal width 1152), and projected to the PaliGemma width (num_classes=2048). The tower is weight-tied across cameras — applied independently per view inside the for name in obs.images loop (pi0.py:113).
Image input contract — how many views?
pi0/π0.5 expects exactly three named image slots, fixed by Pi0Config.inputs_spec (pi0_config.py:61):
| Slot key | Role | LIBERO | tensor |
|---|---|---|---|
base_0_rgb | exterior / third-person | populated (agentview) | [b,224,224,3] |
left_wrist_0_rgb | wrist camera | populated (eye_in_hand) | [b,224,224,3] |
right_wrist_0_rgb | 2nd wrist (bimanual) | zeros, image_mask=False | [b,224,224,3] |
pi05_base weight contract and the data transforms (LiberoInputs, libero_policy.py:59). An unused slot uses the padding-image convention: zeros + image_mask=False so its tokens are excluded from attention (the same mechanism wrist-only ablation uses — see pi05-wrist-only.html). Because the SigLIP tower is weight-tied and embed_prefix just concatenates one 256-token block per view, more views are architecturally expressible, but doing so departs from the pretrained contract and the standard repack pipeline.
π0 vs π0.5 — what the pi05 flag changes
| π0 | π0.5 (pi05=True) | |
|---|---|---|
| state | continuous state token in suffix (state_proj) | no state token; discrete state in prompt (discrete_state_input; off for LIBERO) |
| timestep | concatenated with action tokens → action_time_mlp | adaRMS conditioning via time_mlp → adarms_cond (not a token) |
max_token_len | 48 | 200 |
| adaRMS experts | [False, False] | [False, True] (action expert only) |
02Model diagram
flowchart TD IMG["images x3 : [b,224,224,3]"]:::data TOK["tokenized_prompt : [b,200]"]:::data IMG --> SIG["SigLIP So400m/14 per cam : [b,224,224,3] to [b,256,2048]"]:::vlm TOK --> LEMB["llm.embed : [b,200] to [b,200,2048]"]:::vlm SIG --> PFX["embed_prefix pi0.py:106 : prefix [b, up to 712, 2048]"]:::vlm LEMB --> PFX XT["noisy actions x_t : [b,10,32]"]:::act --> AIP["action_in_proj : [b,10,32] to [b,10,1024]"]:::act TSN["timestep t : [b]"]:::act --> TM["posemb_sincos + time_mlp : adarms_cond [b,1024]"]:::act AIP --> SFX["embed_suffix pi05 pi0.py:140 : [b,10,1024]"]:::act PFX --> LLM["Gemma dual-expert LLM 18 layers : prefix 2048-w expert, suffix 1024-w expert, joint attention"]:::core SFX --> LLM TM -. "adaRMS suffix only" .-> LLM LLM --> AOP["action_out_proj : suffix_out last-10 [b,10,1024] to [b,10,32]"]:::act AOP --> V["velocity v_t : [b,10,32]"]:::out classDef data fill:#E3DACC,stroke:#B85C3E,color:#141413; classDef vlm fill:#F0EEE6,stroke:#87867F,color:#3D3D3A; classDef act fill:#FAF9F5,stroke:#788C5D,color:#141413; classDef core fill:#FAF9F5,stroke:#D97757,color:#141413,stroke-width:2px; classDef out fill:#FAF9F5,stroke:#141413,color:#141413,stroke-width:2px;
Fig. 1 — Forward pass with tensor shapes (pi05_libero, batch b). Prefix length ≤ 712 = 2 active cameras × 256 + ≤ 200 language tokens. The action expert's timestep enters as adaRMS conditioning, not as a sequence token.
03Tensor IO — per module
All shapes for pi05_libero, batch b; ah=action_horizon=10, ad=action_dim=32. N = prefix_len + suffix_len.
| Module (file:line) | Input | Output |
|---|---|---|
Observation.images[k] (model.py:91) | — | [b,224,224,3] f32 ×3 |
Observation.image_masks[k] | — | [b] bool ×3 |
Observation.state | — | [b,32] f32 |
tokenized_prompt / _mask | — | [b,200] i32 / bool |
target actions | — | [b,10,32] f32 |
PaliGemma.img SigLIP (per cam) | [b,224,224,3] | [b,256,2048] |
PaliGemma.llm(method="embed") | [b,200] | [b,200,2048] |
embed_prefix (pi0.py:106) | images + lang | tokens [b,≤712,2048], masks [b,≤712] |
action_in_proj | x_t [b,10,32] | [b,10,1024] |
posemb_sincos + time_mlp | t [b] | adarms_cond [b,1024] |
embed_suffix pi05 (pi0.py:140) | x_t, t | tokens [b,10,1024], cond [b,1024] |
PaliGemma.llm (MoE, 18 layers) | [[b,≤712,2048], [b,10,1024]], mask [b,N,N] | (prefix [b,≤712,2048], suffix [b,10,1024]) |
action_out_proj | suffix_out[:,-10:] [b,10,1024] | v_t [b,10,32] |
compute_loss (pi0.py:219) | v_t, u_t=noise−a [b,10,32] | per-elem MSE [b,10] |
sample_actions (pi0.py:237) | noise [b,10,32] | actions [b,10,32] |
action_dim=32 is the padded model dim, fixed by the pi05_base contract. The real env dim (LIBERO 7, yubi bimanual 14) is zero-padded up to 32 by PadStatesAndActions (config.py:124) and sliced back at the env boundary. state and both action projections all carry the padded 32.
04Flow-matching objective
Conditional flow matching[4] over the action chunk. Convention: t=1 is noise, t=0 is the data (note: opposite of the π0 paper).
# compute_loss (pi0.py:219)
noise = N(0, I) # [b,10,32]
t = Beta(1.5, 1) * 0.999 + 0.001 # [b] (biases samples toward t->0/data)
x_t = t*noise + (1 - t)*actions # [b,10,32] interpolant
u_t = noise - actions # [b,10,32] target velocity
v_t = model._predict_velocity(obs, x_t, t) # [b,10,32]
loss = mean( (v_t - u_t)^2 ) # over action_dim -> [b,10]
# sample_actions (pi0.py:237): Euler integration, num_steps=10
x = noise # [b,10,32], t = 1.0, dt = -1/num_steps
# prefix KV-cache filled once (vision+language); suffix re-run each step
while t >= -dt/2:
v_t = action_out_proj( llm([None, suffix(x,t)], kv_cache=prefix)[-10:] )
x = x + dt * v_t ; t = t + dt
return x # [b,10,32] ~ clean action chunk
05Implementation details — code anchors
| Component | file:line | Role |
|---|---|---|
Pi0 model | third_party/openpi/.../models/pi0.py:66 | top-level module; builds img tower + dual-expert llm + projections |
__init__ (module wiring) | pi0.py:67 | action_in_proj / action_out_proj / time_mlp (pi05) or state_proj+action_time_mlp (pi0) |
embed_prefix | pi0.py:106 | per-cam SigLIP + language embed → prefix tokens (bidirectional) |
embed_suffix | pi0.py:140 | action_in_proj + timestep; pi05 adaRMS branch vs pi0 state/time-MLP branch |
_predict_velocity | pi0.py:189 | one joint prefix+suffix forward → v_t (shared by loss + distillation) |
compute_loss | pi0.py:219 | samples x_t/t, returns flow-matching MSE |
sample_actions | pi0.py:237 | Euler integration with prefix KV-cache |
make_attn_mask | pi0.py:19 | prefix-LM + causal block mask; image_mask=False drops a view |
Pi0Config / get_freeze_filter | pi0_config.py:19 / 79 | dims, variants, max_token_len; LoRA freeze filter |
gemma.get_config | gemma.py:58 | gemma_2b (w2048) / gemma_300m (w1024); +_lora variants |
siglip.Module | siglip.py:293 | So400m/14, patch_size 14, pool none |
Observation / IMAGE_RESOLUTION | model.py:83 / 47 | input dataclass; 224×224 |
| VLA-Zoo config | configs/model/pi05.yaml | mirrors upstream pi05_libero (config.py:751) |
06Ablation knobs (GitHub issues)
Each parameter ablated across the issue tracker, mapped to the architecture: the module/tensor it touches, the range tested, the governing issue, and what it isolates. Results live in the issues themselves.
| Knob | Module / tensor | Range | Issue | Isolates |
|---|---|---|---|---|
| action_horizon H | embed_suffix length & action_out_proj slice → output [b,H,32] |
10 (libero) · 16 · 32 · 50 | #6 #9 #12 #4 | replan frequency vs temporal smoothness; long H over-runs short/low-fps episodes → boundary clamp-padding (#4) |
| action_dim padding | action_in_proj in / action_out_proj out / state |
real 7 (libero), 14 (yubi) → padded 32 | #2 #8 | whether padded/real channels carry live signal (dead left-hand stream) |
| LoRA vs full-FT | paligemma_variant + action_expert_variant; get_freeze_filter |
gemma_2b / _lora (r16,α16) · gemma_300m / _lora (r32,α32) | #7 #9 | ~1.5% trainable adapter vs 100% backbone — adaptation capacity vs compute/overfit |
| action-head pretrain | weight_loader: full pi05_base vs VLM-only (action expert reinit) |
pretrained · from-scratch head | #3 | does the flow-matching action head need pretrained init, or learn from the VLM alone? |
| view masking | images / image_masks dict; in-loss mask + curriculum |
full · view_dropout (zeros) · view_remove (mask=False) · distill | #5 #10 | modality availability & how a view is removed — see pi05-wrist-only.html |
| fps / codec (data) | upstream of the model: frames backing each H-step chunk | fps05 · fps10 · fps30 · AV1 / h264 | #1 #2 #4 #8 | temporal density, decode speed, and chunk coverage vs episode length |
| multi-task vs per-task | same arch; training data mixture (fix_actions_tf / frame_source_table) |
1 multi-task · N per-task specialists · ah16/ah32 | #12 | specialization vs generalization at fixed capacity |
07References
- [1] Black et al. π0: A Vision-Language-Action Flow Model for General Robot Control. Physical Intelligence, 2024. arXiv:2410.24164
- [2] Physical Intelligence. π0.5: a VLA with Open-World Generalization. 2025. arXiv:2504.16054
- [3] Beyer et al. PaliGemma: A versatile 3B VLM for transfer. 2024. arXiv:2407.07726 · SigLIP: Zhai et al. arXiv:2303.15343
- [4] Lipman et al. Flow Matching for Generative Modeling. ICLR 2023. arXiv:2210.02747
- [5] Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685
- [6] Liu et al. LIBERO. NeurIPS 2023. arXiv:2306.03310
Internal
- Issues: #3 (head pretrain) · #4 (low-fps clamp) · #5 (wrist-only) · #6 (action_horizon) · #7 (LoRA vs full-FT) · #9 (fps30 grid) · #10 (distillation) · #12 (multi vs per-task) · #2/#8 (data audits)
- Config:
configs/model/pi05.yaml· upstreamthird_party/openpi/src/openpi/training/config.py:751(pi05_libero) - Companion report: pi05-wrist-only.html (view-masking ablations)