π0.5 — architecture & tensor IO

00Summary

π0.5 is a dual-expert flow-matching VLA: a single 18-layer Gemma transformer where a 2B-width PaliGemma expert encodes vision + language (the "prefix") and a 300M-width action expert denoises an action chunk (the "suffix"), the two attending jointly. This card pins every module's input/output tensor shape for the pi05_libero config and maps each GitHub-issue ablation knob to the exact module and tensor it touches.

action_dim (padded)

action_horizon (chunk)

2048/1024

VLM / action-expert width

256×3

image tokens (3 slots)

200

max_token_len (lang)

bf16

dtype

01Architecture

Dual-expert Gemma (mixture-of-transformers)

The core is a single Gemma stack of 18 layers instantiated with two per-layer weight sets (_gemma.Module(configs=[paligemma, action_expert]), pi0.py:74). Tokens are routed to an expert by which sequence they belong to, not by a gate: the VLM prefix (images + language) flows through the gemma_2b expert (width 2048), the action suffix through the gemma_300m expert (width 1024). Both experts share num_heads=8, head_dim=256, depth=18 — the matched depth + head geometry is what lets the two token streams attend to each other in every layer through one joint attention mask. Only width and mlp_dim (16384 vs 4096) differ.

SigLIP image tower

Each camera image is encoded by a shared SigLIP So400m/14 tower (_siglip.Module(variant="So400m/14", pool_type="none"), pi0.py:82): a 224×224×3 frame is cut into 14×14 patches → 16×16 = 256 patch tokens, run through the 27-layer So400m encoder (internal width 1152), and projected to the PaliGemma width (num_classes=2048). The tower is weight-tied across cameras — applied independently per view inside the for name in obs.images loop (pi0.py:113).

Image input contract — how many views?

pi0/π0.5 expects exactly three named image slots, fixed by Pi0Config.inputs_spec (pi0_config.py:61):

Slot key	Role	LIBERO	tensor
`base_0_rgb`	exterior / third-person	populated (`agentview`)	`[b,224,224,3]`
`left_wrist_0_rgb`	wrist camera	populated (`eye_in_hand`)	`[b,224,224,3]`
`right_wrist_0_rgb`	2nd wrist (bimanual)	zeros, `image_mask=False`	`[b,224,224,3]`

Pitfall The model is not arbitrary-view. The 3-slot layout is baked into the pi05_base weight contract and the data transforms (LiberoInputs, libero_policy.py:59). An unused slot uses the padding-image convention: zeros + image_mask=False so its tokens are excluded from attention (the same mechanism wrist-only ablation uses — see pi05-wrist-only.html). Because the SigLIP tower is weight-tied and embed_prefix just concatenates one 256-token block per view, more views are architecturally expressible, but doing so departs from the pretrained contract and the standard repack pipeline.

π0 vs π0.5 — what the `pi05` flag changes

	π0	π0.5 (`pi05=True`)
state	continuous state token in suffix (`state_proj`)	no state token; discrete state in prompt (`discrete_state_input`; off for LIBERO)
timestep	concatenated with action tokens → `action_time_mlp`	adaRMS conditioning via `time_mlp` → `adarms_cond` (not a token)
`max_token_len`	48	200
adaRMS experts	`[False, False]`	`[False, True]` (action expert only)

02Model diagram

flowchart TD
  IMG["images x3<br/>[b,224,224,3]"]:::data
  TOK["tokenized_prompt<br/>[b,200]"]:::data
  IMG --> SIG["SigLIP So400m/14 per cam<br/>[b,224,224,3] to [b,256,2048]"]:::vlm
  TOK --> LEMB["llm.embed<br/>[b,200] to [b,200,2048]"]:::vlm
  SIG --> PFX["embed_prefix (pi0.py:106)<br/>prefix [b, up to 712, 2048]"]:::vlm
  LEMB --> PFX
  XT["noisy actions x_t<br/>[b,10,32]"]:::act --> AIP["action_in_proj<br/>[b,10,32] to [b,10,1024]"]:::act
  TSN["timestep t [b]"]:::act --> TM["posemb_sincos + time_mlp<br/>adarms_cond [b,1024]"]:::act
  AIP --> SFX["embed_suffix pi05 (pi0.py:140)<br/>[b,10,1024]"]:::act
  PFX --> LLM["Gemma dual-expert LLM (18 layers)<br/>prefix 2048-w expert<br/>suffix 1024-w expert<br/>joint attention"]:::core
  SFX --> LLM
  TM -. "adaRMS (suffix only)" .-> LLM
  LLM --> AOP["action_out_proj<br/>suffix_out last-10<br/>[b,10,1024] to [b,10,32]"]:::act
  AOP --> V["velocity v_t<br/>[b,10,32]"]:::out

  classDef data fill:#E3DACC,stroke:#B85C3E,color:#141413;
  classDef vlm fill:#F0EEE6,stroke:#87867F,color:#3D3D3A;
  classDef act fill:#FAF9F5,stroke:#788C5D,color:#141413;
  classDef core fill:#FAF9F5,stroke:#D97757,color:#141413,stroke-width:2px;
  classDef out fill:#FAF9F5,stroke:#141413,color:#141413,stroke-width:2px;

Fig. 1 — Forward pass with tensor shapes (pi05_libero, batch b). Prefix length ≤ 712 = 2 active cameras × 256 + ≤ 200 language tokens. The action expert's timestep enters as adaRMS conditioning, not as a sequence token.

03Tensor IO — per module

All shapes for pi05_libero, batch b; ah=action_horizon=10, ad=action_dim=32. N = prefix_len + suffix_len.

Module (file:line)	Input	Output
`Observation.images[k]` (model.py:91)	—	`[b,224,224,3]` f32 ×3
`Observation.image_masks[k]`	—	`[b]` bool ×3
`Observation.state`	—	`[b,32]` f32
`tokenized_prompt` / `_mask`	—	`[b,200]` i32 / bool
target `actions`	—	`[b,10,32]` f32
`PaliGemma.img` SigLIP (per cam)	`[b,224,224,3]`	`[b,256,2048]`
`PaliGemma.llm(method="embed")`	`[b,200]`	`[b,200,2048]`
`embed_prefix` (pi0.py:106)	images + lang	tokens `[b,≤712,2048]`, masks `[b,≤712]`
`action_in_proj`	x_t `[b,10,32]`	`[b,10,1024]`
`posemb_sincos` + `time_mlp`	t `[b]`	adarms_cond `[b,1024]`
`embed_suffix` pi05 (pi0.py:140)	x_t, t	tokens `[b,10,1024]`, cond `[b,1024]`
`PaliGemma.llm` (MoE, 18 layers)	[`[b,≤712,2048]`, `[b,10,1024]`], mask `[b,N,N]`	(prefix `[b,≤712,2048]`, suffix `[b,10,1024]`)
`action_out_proj`	suffix_out[:,-10:] `[b,10,1024]`	v_t `[b,10,32]`
`compute_loss` (pi0.py:219)	v_t, u_t=noise−a `[b,10,32]`	per-elem MSE `[b,10]`
`sample_actions` (pi0.py:237)	noise `[b,10,32]`	actions `[b,10,32]`

Invariant action_dim=32 is the padded model dim, fixed by the pi05_base contract. The real env dim (LIBERO 7, yubi bimanual 14) is zero-padded up to 32 by PadStatesAndActions (config.py:124) and sliced back at the env boundary. state and both action projections all carry the padded 32.

04Flow-matching objective

Conditional flow matching^[4] over the action chunk. Convention: t=1 is noise, t=0 is the data (note: opposite of the π0 paper).

# compute_loss (pi0.py:219)
noise = N(0, I)                          # [b,10,32]
t     = Beta(1.5, 1) * 0.999 + 0.001     # [b]   (biases samples toward t->0/data)
x_t   = t*noise + (1 - t)*actions        # [b,10,32]  interpolant
u_t   = noise - actions                  # [b,10,32]  target velocity
v_t   = model._predict_velocity(obs, x_t, t)         # [b,10,32]
loss  = mean( (v_t - u_t)^2 )            # over action_dim -> [b,10]

# sample_actions (pi0.py:237): Euler integration, num_steps=10
x = noise                                # [b,10,32],  t = 1.0,  dt = -1/num_steps
# prefix KV-cache filled once (vision+language); suffix re-run each step
while t >= -dt/2:
    v_t = action_out_proj( llm([None, suffix(x,t)], kv_cache=prefix)[-10:] )
    x   = x + dt * v_t ;  t = t + dt
return x                                  # [b,10,32]  ~ clean action chunk

05Implementation details — code anchors

Component	file:line	Role
`Pi0` model	`third_party/openpi/.../models/pi0.py:66`	top-level module; builds img tower + dual-expert llm + projections
`__init__` (module wiring)	`pi0.py:67`	`action_in_proj` / `action_out_proj` / `time_mlp` (pi05) or `state_proj`+`action_time_mlp` (pi0)
`embed_prefix`	`pi0.py:106`	per-cam SigLIP + language embed → prefix tokens (bidirectional)
`embed_suffix`	`pi0.py:140`	action_in_proj + timestep; pi05 adaRMS branch vs pi0 state/time-MLP branch
`_predict_velocity`	`pi0.py:189`	one joint prefix+suffix forward → v_t (shared by loss + distillation)
`compute_loss`	`pi0.py:219`	samples x_t/t, returns flow-matching MSE
`sample_actions`	`pi0.py:237`	Euler integration with prefix KV-cache
`make_attn_mask`	`pi0.py:19`	prefix-LM + causal block mask; `image_mask=False` drops a view
`Pi0Config` / `get_freeze_filter`	`pi0_config.py:19 / 79`	dims, variants, `max_token_len`; LoRA freeze filter
`gemma.get_config`	`gemma.py:58`	gemma_2b (w2048) / gemma_300m (w1024); +_lora variants
`siglip.Module`	`siglip.py:293`	So400m/14, patch_size 14, pool none
`Observation` / `IMAGE_RESOLUTION`	`model.py:83 / 47`	input dataclass; 224×224
VLA-Zoo config	`configs/model/pi05.yaml`	mirrors upstream `pi05_libero` (config.py:751)

06Ablation knobs (GitHub issues)

Each parameter ablated across the issue tracker, mapped to the architecture: the module/tensor it touches, the range tested, the governing issue, and what it isolates. Results live in the issues themselves.

Knob	Module / tensor	Range	Issue	Isolates
action_horizon H	`embed_suffix` length & `action_out_proj` slice → output `[b,H,32]`	10 (libero) · 16 · 32 · 50	#6 #9 #12 #4	replan frequency vs temporal smoothness; long H over-runs short/low-fps episodes → boundary clamp-padding (#4)
action_dim padding	`action_in_proj` in / `action_out_proj` out / `state`	real 7 (libero), 14 (yubi) → padded 32	#2 #8	whether padded/real channels carry live signal (dead left-hand stream)
LoRA vs full-FT	`paligemma_variant` + `action_expert_variant`; `get_freeze_filter`	gemma_2b / _lora (r16,α16) · gemma_300m / _lora (r32,α32)	#7 #9	~1.5% trainable adapter vs 100% backbone — adaptation capacity vs compute/overfit
action-head pretrain	`weight_loader`: full `pi05_base` vs VLM-only (action expert reinit)	pretrained · from-scratch head	#3	does the flow-matching action head need pretrained init, or learn from the VLM alone?
view masking	`images` / `image_masks` dict; in-loss mask + curriculum	full · view_dropout (zeros) · view_attn_drop (mask=False) · distill	#5 #10	modality availability & how a view is removed — see pi05-wrist-only.html
fps / codec (data)	upstream of the model: frames backing each H-step chunk	fps05 · fps10 · fps30 · AV1 / h264	#1 #2 #4 #8	temporal density, decode speed, and chunk coverage vs episode length
multi-task vs per-task	same arch; training data mixture (`fix_actions_tf` / `frame_source_table`)	1 multi-task · N per-task specialists · ah16/ah32	#12	specialization vs generalization at fixed capacity

Insight Most ablation knobs touch the action expert side (action_horizon, action_dim, the action-head init) or the input contract (views, fps) — the 2B PaliGemma backbone and its 3-slot image interface stay fixed. LoRA-vs-full-FT is the one knob that changes which parameters of the backbone move at all.

07References

[1] Black et al. π0: A Vision-Language-Action Flow Model for General Robot Control. Physical Intelligence, 2024. arXiv:2410.24164
[2] Physical Intelligence. π0.5: a VLA with Open-World Generalization. 2025. arXiv:2504.16054
[3] Beyer et al. PaliGemma: A versatile 3B VLM for transfer. 2024. arXiv:2407.07726 · SigLIP: Zhai et al. arXiv:2303.15343
[4] Lipman et al. Flow Matching for Generative Modeling. ICLR 2023. arXiv:2210.02747
[5] Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685
[6] Liu et al. LIBERO. NeurIPS 2023. arXiv:2306.03310

Citation note The dual-expert ("mixture-of-transformers") + flow-matching action head is the π0 design [1]; π0.5 [2] swaps the state token + time-MLP for discrete-state prompt input + adaRMS time conditioning. arXiv IDs should be verified before lifting into a paper.

Internal

Issues: #3 (head pretrain) · #4 (low-fps clamp) · #5 (wrist-only) · #6 (action_horizon) · #7 (LoRA vs full-FT) · #9 (fps30 grid) · #10 (distillation) · #12 (multi vs per-task) · #2/#8 (data audits)
Config: configs/model/pi05.yaml · upstream third_party/openpi/src/openpi/training/config.py:751 (pi05_libero)
Companion report: pi05-wrist-only.html (view-masking ablations)

00Summary

01Architecture

Dual-expert Gemma (mixture-of-transformers)

SigLIP image tower

Image input contract — how many views?

π0 vs π0.5 — what the pi05 flag changes

02Model diagram

03Tensor IO — per module

04Flow-matching objective

05Implementation details — code anchors

06Ablation knobs (GitHub issues)

07References

Internal

π0 vs π0.5 — what the `pi05` flag changes