Overview

π0.5 is a vision-language-action (VLA) model from Physical Intelligence, jointly trained on robot demonstration data and large-scale multimodal data. It can perform long-horizon tasks in unseen real-world open environments and generalizes across them. This page focuses on ws1, i.e. world_size=1 — a single rank with no distributed setup. Everything on this page targets this single-card configuration. The entry is PI05WS1Scheduler.

Architecture

PhyAI’s decomposes pi0.5 inference into four cooperating components:

phyai/src/phyai/models/pi05

main_pi05.py

scheduler_ws1_pi05.py

model_runner_pi05.py

modeling_pi05.py

configuration_pi05.py

img_preprocess_pi05.py

tokenization_pi05.py

The diagram below illustrates how phyai’s three model runners cooperate with the scheduler, and how the engine bootstrap hands off to scheduler.setup() and scheduler.step().

PhyAI Engine ↔ Scheduler ↔ 3 Runners lifecycle

Running pi0.5

Get the weights

Prepare a pi05_base safetensors checkpoint. You can download it from huggingface:

https://huggingface.co/lerobot/pi05_base

Construct the engine

The plugin name is "pi05". The engine handles setup, weight loading, and graph capture in one shot.

import torch
from pathlib import Path
from phyai.engine import Engine, EngineArgs
from phyai.engine_config import DeviceConfig, EngineConfig, RuntimeConfig
from phyai.models.pi05.main_pi05 import PI05Args

engine = Engine(
    EngineArgs(
        plugin="pi05",
        plugin_args=PI05Args(
            checkpoint_dir=Path("/path/to/pi05_base/"),
            max_batch_size=4,
        ),
        config=EngineConfig(
            device=DeviceConfig(target="cuda", params_dtype=torch.bfloat16),
            runtime=RuntimeConfig(use_cuda_graph=True),
        ),
    )
)

max_batch_size fixes the captured-graph batch dimension. Pick based on the largest batch you’ll submit; smaller batches are padded internally.

The cuda graph batch bucketing optimization is not enabled when WS=1.

Build a request

PI05Request carries the per-step inference inputs:

Field	Shape	Notes
`pixel_values`	`(B, 3, 3, H, W)`	3 cameras × 3 channels per robot, `H = W = image_size`
`input_ids`	`(B, tokenizer_max_length)` int64	Right-padded with zeros
`lang_lens`	`(B,)` int64	Real (un-padded) length per sample
`noise`	`(B, chunk_size, max_action_dim)` or `None`	Optional; when `None`, the scheduler samples a fresh Gaussian internally

B can be any value in [1, max_batch_size]. Build the tensors on the engine’s device; the scheduler validates shapes and raises immediately on mismatch.

Step the engine

actions = engine.step(request)  # (actual_B, chunk_size, max_action_dim)

The padding is sliced off before returning — the tensor you get has its leading dim equal to the real batch.

Close the engine

engine.close()

Releases the scheduler’s buffers and tears down the captured cuda graphs.

End-to-end example

examples/pi05/run_pi05.py exercises the full path with deterministic dummy inputs at max_batch_size ∈ {1, 4} and includes a multi-batch equivalence check. To run it:

uv run python examples/pi05/run_pi05.py --checkpoint /path/to/pi0.5

The script prints per-phase latency stats (mean / median / std / min / max over 3 warmups + 30 timed runs) and a PASS line for the equivalence check. Just change the path after --checkpoint to your local checkpoint path.

Current limitations

Single GPU only. Tensor parallel, continuous batching, and preemption are all out of scope for PI05WS1Scheduler.
max_batch_size is fixed at engine construction. To change it, you must tear down and rebuild the engine.
The vision tower replays sequentially per real robot — it doesn’t batch along the camera dimension.

Full example

from pathlib import Path

import torch

from phyai.engine import Engine, EngineArgs
from phyai.engine_config import DeviceConfig, EngineConfig, RuntimeConfig
from phyai.models.pi05.configuration_pi05 import PI05Config
from phyai.models.pi05.main_pi05 import PI05Args
from phyai.models.pi05.scheduler_ws1_pi05 import PI05Request
from phyai.utils import load_config

CHECKPOINT_DIR = Path("/path/to/pi05_base/")  # change to your local checkpoint folder
BATCH_SIZE = 1

cfg = load_config(CHECKPOINT_DIR, PI05Config)
device = torch.device("cuda")
dtype = torch.bfloat16

# 1. Construct the Engine — runs setup, weight loading, and CUDA graph capture in one shot.
engine = Engine(
    EngineArgs(
        plugin="pi05",
        plugin_args=PI05Args(
            checkpoint_dir=CHECKPOINT_DIR,
            max_batch_size=BATCH_SIZE,
        ),
        config=EngineConfig(
            device=DeviceConfig(target="cuda", params_dtype=dtype),
            runtime=RuntimeConfig(use_cuda_graph=True),
        ),
    )
)
try:
    # 2. Build a dummy request: random pixels + single-token prompt.
    input_ids = torch.zeros(
        BATCH_SIZE, cfg.tokenizer_max_length, dtype=torch.int64, device=device
    )
    input_ids[:, 0] = 2  # any non-pad token id
    request = PI05Request(
        pixel_values=torch.rand(
            BATCH_SIZE,
            3,
            3,
            cfg.vision.image_size,
            cfg.vision.image_size,
            dtype=dtype,
            device=device,
        ),
        input_ids=input_ids,
        lang_lens=torch.ones(BATCH_SIZE, dtype=torch.int64, device=device),
    )

    # 3. Run one inference step.
    actions = engine.step(request)
    print(f"action chunk shape={tuple(actions.shape)}, dtype={actions.dtype}")
finally:
    # 4. Release scheduler buffers and tear down captured CUDA graphs.
    engine.close()

​Overview

​Architecture

​Running pi0.5

​End-to-end example

​Current limitations

​Full example

Overview

Architecture

Running pi0.5

End-to-end example

Current limitations

Full example