Overview

pi0 is a vision-language-action model that combines a PaliGemma-style image and language prefix with a Gemma action expert. In PhyAI, the ws1 path runs the full end-to-end inference loop on one GPU: encode cameras and task text, cache the prefix, condition the expert on robot state, and integrate the action chunk with flow matching. This page describes the single-card implementation. There is no tensor parallelism, no continuous batching, and no preemption in PI0WS1Scheduler.

pi0 differs from pi0.5 in how robot state enters the model. pi0 keeps state as a numeric expert-side token. pi0.5 folds discretized state bins into the language prompt.

Architecture

PhyAI uses the same as the other model integrations. The pi0 path is split across configuration, model modules, runners, and a scheduler:

phyai/src/phyai/models/pi0

main_pi0.py

scheduler_ws1_pi0.py

model_runner_pi0.py

modeling_pi0.py

configuration_pi0.py

Component	Responsibility
`PI0Entry`	Registers the `"pi0"` engine plugin, builds `PI0Model`, loads weights, and creates the scheduler
`PI0Config`	Stores vision, text, expert, action chunk, tokenizer, and camera-count geometry
`PI0Model`	Owns the SigLIP/PaliGemma vision tower, PaliGemma text stack, Gemma expert stack, RoPE, and action/time heads
`PI0VisionRunner`	Runs the vision tower, with optional CUDA graph capture
`PI0LLMRunner`	Runs the PaliGemma prefix pass and writes prefix K/V into the shared cache
`PI0ExpertRunner`	Runs the expert state/action passes for each flow-matching step
`PI0WS1Scheduler`	Orchestrates one complete inference request on a single GPU

Model layout

PI0Model is built from three major stacks:

Stack	Default shape	Notes
Vision	SigLIP, 27 layers, 224×224 images, 14×14 patches	Produces image tokens projected into the PaliGemma text width
Text	PaliGemma/Gemma, 18 layers, hidden size 2048	Processes image + language prefix and writes prefix K/V
Expert	Gemma action expert, 18 layers, hidden size 1024	Processes one state token plus the full action chunk

The top-level config defaults are:

Field	Default	Meaning
`chunk_size`	`50`	Number of action tokens returned per engine step
`max_state_dim`	`32`	Padded robot-state width
`max_action_dim`	`32`	Padded action width
`num_inference_steps`	`10`	Flow-matching Euler steps
`tokenizer_max_length`	`48`	Right-padded PaliGemma task prompt length
`empty_cameras`	`0`	`num_images = 3 - empty_cameras`; pi0 supports 2 or 3 cameras

The model uses params_dtype for the language and expert stacks. The vision tower has a separate vision_params_dtype, which defaults to fp32 for reference parity. Set PI0Args(vision_params_dtype=torch.bfloat16) only when you intentionally want bf16 vision execution.

Request contract

PI0Request is the scheduler’s canonical input:

Field	Shape	Notes
`pixel_values`	`(B, num_images, 3, image_size, image_size)`	Already resized and normalized camera tensors
`input_ids`	`(B, tokenizer_max_length)` int64	Right-padded PaliGemma token ids
`lang_lens`	`(B,)` int64	Real task-prompt length for each sample
`state`	`(B, max_state_dim)`	Numeric robot state, padded before the expert
`noise`	`(B, chunk_size, max_action_dim)` or `None`	Optional initial action noise; when `None`, the scheduler samples Gaussian noise

B can be any value in [1, max_batch_size]. The scheduler pads smaller batches to max_batch_size internally and slices the result back to actual_B before returning.

Scheduler phases

One engine.step(request) maps to the following scheduler phases:

Phase	Work
`pi0.vision_loop`	Move camera tensors to the vision dtype and run `PI0VisionRunner` once per real batch item
`pi0.lang_pack`	Embed language ids, then pack image tokens and language tokens into the per-sample prefix buffer
`pi0.llm_prefix_plan`	Reset static caches and prepare ragged prefix attention metadata
`pi0.llm_prefix_fwd`	Run the PaliGemma text stack and write prefix K/V into `KVCachePool`
`pi0.expert_plan`	Prepare state and action expert attention metadata over prefix + suffix slots
`pi0.expert_loop`	Initialize or copy action noise and run flow-matching integration
`pi0.expert_step`	One expert velocity prediction and Euler update inside `pi0.expert_loop`

The prefix tokens are cached once per request. The expert then attends over:

state query  -> prefix + state
action query -> prefix + state + action chunk

This is why pi0’s suffix length is 1 + chunk_size: one state token followed by the action tokens.

CUDA graphs

When RuntimeConfig(use_cuda_graph=True), the pi0 runners capture CUDA graphs during scheduler.setup():

Runner	Captured shape
`PI0VisionRunner`	`(num_images, 3, image_size, image_size)`
`PI0LLMRunner`	`(max_batch_size * n_per_sample, text_hidden_size)`
`PI0ExpertRunner`	`state`, `x_t`, and `time` buffers at fixed `max_batch_size`

During scheduler.step(), the runners update static graph input buffers and replay the captured graphs. Attention metadata is staged outside the captured region through the attention backend’s capture-aware metadata buffers.

Disable CUDA graphs when you want a more expanded Nsight Systems trace:

uv run python benchmark/bench_n_batch_ws1_pi0.py \
    --batch-sizes 4 \
    --no-cuda-graph

Running pi0

Prepare weights

Prepare a HF-style pi0 PyTorch checkpoint directory with config.json and model.safetensors files. You can also omit --checkpoint for random-weight smoke tests.

Construct the engine

The plugin name is "pi0". The engine handles setup, optional weight loading, runner setup, and CUDA graph capture.

import torch
from pathlib import Path

from phyai.engine import Engine, EngineArgs
from phyai.engine_config import DeviceConfig, EngineConfig, RuntimeConfig
from phyai.models.pi0.main_pi0 import PI0Args

engine = Engine(
    EngineArgs(
        plugin="pi0",
        plugin_args=PI0Args(
            checkpoint_dir=Path("/path/to/pi0_pytorch"),
            max_batch_size=4,
            vision_params_dtype=torch.float32,
        ),
        config=EngineConfig(
            device=DeviceConfig(target="cuda", params_dtype=torch.bfloat16),
            runtime=RuntimeConfig(use_cuda_graph=True),
        ),
    )
)

max_batch_size fixes the captured graph shapes. Rebuild the engine if you need a different maximum batch.

Build a request

Use PI0Processor to convert raw robot observations into model-ready tensors. The processor lives outside the engine in phyai-utils-tools.

from phyai.models.pi0.scheduler_ws1_pi0 import PI0Request
from phyai_utils_tools.models.pi0 import PI0Processor

processor = PI0Processor(
    image_size=224,
    num_channels=3,
    num_images=3,
    tokenizer_max_length=48,
    max_state_dim=32,
    action_dim=7,
    device="cuda",
    params_dtype=torch.bfloat16,
)

processed = processor.preprocess(
    {
        "images": [cam0, cam1, cam2],
        "task": ["pick up the object"],
        "state": state,
    }
)

request = PI0Request(
    pixel_values=processed.pixel_values,
    input_ids=processed.input_ids,
    lang_lens=processed.lang_lens,
    state=processed.state,
)

Run one step

actions = engine.step(request)  # (B, chunk_size, max_action_dim)

If you constructed a processor with action_dim, call processor.postprocess(actions) to slice the padded action width and unnormalize actions when dataset stats are available.

Close the engine

engine.close()

End-to-end example

examples/pi0/run_pi0.py exercises both raw and processor-backed request paths:

uv run python examples/pi0/run_pi0.py \
    --checkpoint /path/to/pi0_pytorch \
    --batch-size 1

For a random-weight smoke test, omit --checkpoint:

uv run python examples/pi0/run_pi0.py --raw --batch-size 1

Use --num-images 2 when your checkpoint uses one empty camera:

uv run python examples/pi0/run_pi0.py \
    --checkpoint /path/to/pi0_pytorch \
    --num-images 2

Benchmarking and profiling

benchmark/bench_n_batch_ws1_pi0.py sweeps batch sizes and can open a tight profile window for Nsight Systems:

uv run python benchmark/bench_n_batch_ws1_pi0.py \
    --batch-sizes 1 2 4 \
    --n-warmup 5 \
    --n-timed 30 \
    --result-file ./pi0_ws1_results.jsonl

Nsight Systems capture:

nsys profile \
    --capture-range=cudaProfilerApi \
    --capture-range-end=stop \
    -o ./prof/pi0_ws1 \
    uv run python benchmark/bench_n_batch_ws1_pi0.py \
        --batch-sizes 4 \
        --profile-backend nsys \
        --profile-start-step 5 \
        --profile-num-steps 3

Set --vision-dtype bfloat16 only when you intentionally want bf16 vision timing. The default keeps the vision tower in fp32.

Current limitations

This path is single-GPU only.
max_batch_size is fixed at engine construction.
The vision tower is replayed once per real batch item.
The scheduler expects already preprocessed tensors. Image resize, tokenization, state padding, and action unnormalization belong to PI0Processor.
CUDA graph capture is shape-fixed. Change camera count, image size, tokenizer length, or max batch by rebuilding the engine.

Full example

from pathlib import Path

import torch

from phyai.engine import Engine, EngineArgs
from phyai.engine_config import DeviceConfig, EngineConfig, RuntimeConfig
from phyai.models.pi0.configuration_pi0 import PI0Config
from phyai.models.pi0.main_pi0 import PI0Args
from phyai.models.pi0.scheduler_ws1_pi0 import PI0Request
from phyai.utils import load_config

CHECKPOINT_DIR = Path("/path/to/pi0_pytorch")
BATCH_SIZE = 1

cfg = load_config(CHECKPOINT_DIR, PI0Config)
device = torch.device("cuda")
dtype = torch.bfloat16

engine = Engine(
    EngineArgs(
        plugin="pi0",
        plugin_args=PI0Args(
            checkpoint_dir=CHECKPOINT_DIR,
            max_batch_size=BATCH_SIZE,
            vision_params_dtype=torch.float32,
        ),
        config=EngineConfig(
            device=DeviceConfig(target="cuda", params_dtype=dtype),
            runtime=RuntimeConfig(use_cuda_graph=True),
        ),
    )
)

try:
    input_ids = torch.zeros(
        BATCH_SIZE, cfg.tokenizer_max_length, dtype=torch.int64, device=device
    )
    input_ids[:, 0] = 2

    request = PI0Request(
        pixel_values=torch.rand(
            BATCH_SIZE,
            cfg.num_images,
            cfg.vision.num_channels,
            cfg.vision.image_size,
            cfg.vision.image_size,
            dtype=torch.float32,
            device=device,
        ),
        input_ids=input_ids,
        lang_lens=torch.ones(BATCH_SIZE, dtype=torch.int64, device=device),
        state=torch.rand(BATCH_SIZE, cfg.max_state_dim, dtype=dtype, device=device),
    )

    actions = engine.step(request)
    print(f"action chunk shape={tuple(actions.shape)}, dtype={actions.dtype}")
finally:
    engine.close()

​Overview

​Architecture

​Model layout

​Request contract

​Scheduler phases

​CUDA graphs

​Running pi0

​End-to-end example

​Benchmarking and profiling

​Current limitations

​Full example

Overview

Architecture

Model layout

Request contract

Scheduler phases

CUDA graphs

Running pi0

End-to-end example

Benchmarking and profiling

Current limitations

Full example