Overview
pi0 is a vision-language-action model that combines a PaliGemma-style image and language prefix with a Gemma action expert. In PhyAI, thews1 path runs the full end-to-end inference loop on one GPU: encode cameras and task text, cache the prefix, condition the expert on robot state, and integrate the action chunk with flow matching.
This page describes the single-card implementation. There is no tensor parallelism, no continuous batching, and no preemption in PI0WS1Scheduler.
pi0 differs from pi0.5 in how robot state enters the model. pi0 keeps state as a numeric expert-side token. pi0.5 folds discretized state bins into the language prompt.
Architecture
PhyAI uses the same as the other model integrations. The pi0 path is split across configuration, model modules, runners, and a scheduler:phyai/src/phyai/models/pi0
main_pi0.py
scheduler_ws1_pi0.py
model_runner_pi0.py
modeling_pi0.py
configuration_pi0.py
| Component | Responsibility |
|---|---|
PI0Entry | Registers the "pi0" engine plugin, builds PI0Model, loads weights, and creates the scheduler |
PI0Config | Stores vision, text, expert, action chunk, tokenizer, and camera-count geometry |
PI0Model | Owns the SigLIP/PaliGemma vision tower, PaliGemma text stack, Gemma expert stack, RoPE, and action/time heads |
PI0VisionRunner | Runs the vision tower, with optional CUDA graph capture |
PI0LLMRunner | Runs the PaliGemma prefix pass and writes prefix K/V into the shared cache |
PI0ExpertRunner | Runs the expert state/action passes for each flow-matching step |
PI0WS1Scheduler | Orchestrates one complete inference request on a single GPU |
Model layout
PI0Model is built from three major stacks:
| Stack | Default shape | Notes |
|---|---|---|
| Vision | SigLIP, 27 layers, 224×224 images, 14×14 patches | Produces image tokens projected into the PaliGemma text width |
| Text | PaliGemma/Gemma, 18 layers, hidden size 2048 | Processes image + language prefix and writes prefix K/V |
| Expert | Gemma action expert, 18 layers, hidden size 1024 | Processes one state token plus the full action chunk |
| Field | Default | Meaning |
|---|---|---|
chunk_size | 50 | Number of action tokens returned per engine step |
max_state_dim | 32 | Padded robot-state width |
max_action_dim | 32 | Padded action width |
num_inference_steps | 10 | Flow-matching Euler steps |
tokenizer_max_length | 48 | Right-padded PaliGemma task prompt length |
empty_cameras | 0 | num_images = 3 - empty_cameras; pi0 supports 2 or 3 cameras |
params_dtype for the language and expert stacks. The vision tower has a separate vision_params_dtype, which defaults to fp32 for reference parity. Set PI0Args(vision_params_dtype=torch.bfloat16) only when you intentionally want bf16 vision execution.
Request contract
PI0Request is the scheduler’s canonical input:
| Field | Shape | Notes |
|---|---|---|
pixel_values | (B, num_images, 3, image_size, image_size) | Already resized and normalized camera tensors |
input_ids | (B, tokenizer_max_length) int64 | Right-padded PaliGemma token ids |
lang_lens | (B,) int64 | Real task-prompt length for each sample |
state | (B, max_state_dim) | Numeric robot state, padded before the expert |
noise | (B, chunk_size, max_action_dim) or None | Optional initial action noise; when None, the scheduler samples Gaussian noise |
B can be any value in [1, max_batch_size]. The scheduler pads smaller batches to max_batch_size internally and slices the result back to actual_B before returning.
Scheduler phases
Oneengine.step(request) maps to the following scheduler phases:
| Phase | Work |
|---|---|
pi0.vision_loop | Move camera tensors to the vision dtype and run PI0VisionRunner once per real batch item |
pi0.lang_pack | Embed language ids, then pack image tokens and language tokens into the per-sample prefix buffer |
pi0.llm_prefix_plan | Reset static caches and prepare ragged prefix attention metadata |
pi0.llm_prefix_fwd | Run the PaliGemma text stack and write prefix K/V into KVCachePool |
pi0.expert_plan | Prepare state and action expert attention metadata over prefix + suffix slots |
pi0.expert_loop | Initialize or copy action noise and run flow-matching integration |
pi0.expert_step | One expert velocity prediction and Euler update inside pi0.expert_loop |
1 + chunk_size: one state token followed by the action tokens.
CUDA graphs
WhenRuntimeConfig(use_cuda_graph=True), the pi0 runners capture CUDA graphs during scheduler.setup():
| Runner | Captured shape |
|---|---|
PI0VisionRunner | (num_images, 3, image_size, image_size) |
PI0LLMRunner | (max_batch_size * n_per_sample, text_hidden_size) |
PI0ExpertRunner | state, x_t, and time buffers at fixed max_batch_size |
scheduler.step(), the runners update static graph input buffers and replay the captured graphs. Attention metadata is staged outside the captured region through the attention backend’s capture-aware metadata buffers.
Running pi0
Prepare weights
Prepare a HF-style pi0 PyTorch checkpoint directory with
config.json and model.safetensors files. You can also omit --checkpoint for random-weight smoke tests.Construct the engine
The plugin name is
"pi0". The engine handles setup, optional weight loading, runner setup, and CUDA graph capture.max_batch_size fixes the captured graph shapes. Rebuild the engine if you need a different maximum batch.Build a request
Use
PI0Processor to convert raw robot observations into model-ready tensors. The processor lives outside the engine in phyai-utils-tools.Run one step
action_dim, call processor.postprocess(actions) to slice the padded action width and unnormalize actions when dataset stats are available.End-to-end example
examples/pi0/run_pi0.py exercises both raw and processor-backed request paths:
--checkpoint:
--num-images 2 when your checkpoint uses one empty camera:
Benchmarking and profiling
benchmark/bench_n_batch_ws1_pi0.py sweeps batch sizes and can open a tight profile window for Nsight Systems:
--vision-dtype bfloat16 only when you intentionally want bf16 vision timing. The default keeps the vision tower in fp32.
Current limitations
- This path is single-GPU only.
max_batch_sizeis fixed at engine construction.- The vision tower is replayed once per real batch item.
- The scheduler expects already preprocessed tensors. Image resize, tokenization, state padding, and action unnormalization belong to
PI0Processor. - CUDA graph capture is shape-fixed. Change camera count, image size, tokenizer length, or max batch by rebuilding the engine.

