> ## Documentation Index > Fetch the complete documentation index at: https://phyai.mintlify.site/llms.txt > Use this file to discover all available pages before exploring further. # Single-GPU Inference for PI0.5 > How PhyAI runs pi0.5 inference on a single GPU export const ModelCard = ({title, subtitle, icon, rows = {}}) => { const entries = Object.entries(rows); const renderValue = value => { if (value === null || value === undefined) { return —; } if (Array.isArray(value)) { return

{value.map((v, i) => {v} )}

; } if (typeof value === "string" || typeof value === "number") { return {value} ; } return value; }; const hasHeader = title || subtitle || icon; return

{hasHeader &&

{icon &&

{icon}

}

{title &&

{title}

} {subtitle &&

{subtitle}

}

{entries.map(([key, value], i) =>

{key}

{renderValue(value)}

)}

; }; huggingface.co/lerobot/pi05_base, "Tags": ["VLA", "flow-matching", "PaliGemma", "SigLIP", "single-GPU"], "Image Input": "3-camera RGB · 224×224", "Tokenizer Length": "200", "Entry Point": PI05WS1Scheduler, "Param Precision": "bf16", "Paper": pi.website/blog/pi05, }} /> # Overview π0.5 is a vision-language-action (VLA) model from Physical Intelligence, jointly trained on robot demonstration data and large-scale multimodal data. It can perform long-horizon tasks in unseen real-world open environments and generalizes across them. This page focuses on `ws1`, i.e. `world_size=1` — a single rank with no distributed setup. Everything on this page targets this single-card configuration. The `entry` is `PI05WS1Scheduler`. PI0.5 model execution pipeline

# Architecture PhyAI's engine + plugin contract decomposes pi0.5 inference into four cooperating components: The diagram below illustrates how phyai's three model runners cooperate with the scheduler, and how the engine bootstrap hands off to `scheduler.setup()` and `scheduler.step()`. PhyAI Engine ↔ Scheduler ↔ 3 Runners lifecycle

PhyAI Engine ↔ Scheduler ↔ 3 Runners lifecycle

# Running pi0.5 Prepare a `pi05_base` safetensors checkpoint. You can download it from huggingface: ``` https://huggingface.co/lerobot/pi05_base ``` The plugin name is `"pi05"`. The engine handles setup, weight loading, and graph capture in one shot. ```python theme={null} import torch from pathlib import Path from phyai.engine import Engine, EngineArgs from phyai.engine_config import DeviceConfig, EngineConfig, RuntimeConfig from phyai.models.pi05.main_pi05 import PI05Args engine = Engine( EngineArgs( plugin="pi05", plugin_args=PI05Args( checkpoint_dir=Path("/path/to/pi05_base/"), max_batch_size=4, ), config=EngineConfig( device=DeviceConfig(target="cuda", params_dtype=torch.bfloat16), runtime=RuntimeConfig(use_cuda_graph=True), ), ) ) ``` `max_batch_size` fixes the captured-graph batch dimension. Pick based on the largest batch you'll submit; smaller batches are padded internally. The cuda graph batch bucketing optimization is not enabled when WS=1. `PI05Request` carries the per-step inference inputs: | Field | Shape | Notes | | -------------- | ------------------------------------------- | ------------------------------------------------------------------------ | | `pixel_values` | `(B, 3, 3, H, W)` | 3 cameras × 3 channels per robot, `H = W = image_size` | | `input_ids` | `(B, tokenizer_max_length)` int64 | Right-padded with zeros | | `lang_lens` | `(B,)` int64 | Real (un-padded) length per sample | | `noise` | `(B, chunk_size, max_action_dim)` or `None` | Optional; when `None`, the scheduler samples a fresh Gaussian internally | `B` can be any value in `[1, max_batch_size]`. Build the tensors on the engine's device; the scheduler validates shapes and raises immediately on mismatch. ```python theme={null} actions = engine.step(request) # (actual_B, chunk_size, max_action_dim) ``` The padding is sliced off before returning — the tensor you get has its leading dim equal to the real batch. ```python theme={null} engine.close() ``` Releases the scheduler's buffers and tears down the captured cuda graphs. # End-to-end example `examples/pi05/run_pi05.py` exercises the full path with deterministic dummy inputs at `max_batch_size ∈ {1, 4}` and includes a multi-batch equivalence check. To run it: ```bash theme={null} uv run python examples/pi05/run_pi05.py --checkpoint /path/to/pi0.5 ``` The script prints per-phase latency stats (mean / median / std / min / max over 3 warmups + 30 timed runs) and a `PASS` line for the equivalence check. Just change the path after `--checkpoint` to your local checkpoint path. # Current limitations * Single GPU only. Tensor parallel, continuous batching, and preemption are all out of scope for `PI05WS1Scheduler`. * `max_batch_size` is fixed at engine construction. To change it, you must tear down and rebuild the engine. * The vision tower replays sequentially per real robot — it doesn't batch along the camera dimension. # Full example ```python theme={null} from pathlib import Path import torch from phyai.engine import Engine, EngineArgs from phyai.engine_config import DeviceConfig, EngineConfig, RuntimeConfig from phyai.models.pi05.configuration_pi05 import PI05Config from phyai.models.pi05.main_pi05 import PI05Args from phyai.models.pi05.scheduler_ws1_pi05 import PI05Request from phyai.utils import load_config CHECKPOINT_DIR = Path("/path/to/pi05_base/") # change to your local checkpoint folder BATCH_SIZE = 1 cfg = load_config(CHECKPOINT_DIR, PI05Config) device = torch.device("cuda") dtype = torch.bfloat16 # 1. Construct the Engine — runs setup, weight loading, and CUDA graph capture in one shot. engine = Engine( EngineArgs( plugin="pi05", plugin_args=PI05Args( checkpoint_dir=CHECKPOINT_DIR, max_batch_size=BATCH_SIZE, ), config=EngineConfig( device=DeviceConfig(target="cuda", params_dtype=dtype), runtime=RuntimeConfig(use_cuda_graph=True), ), ) ) try: # 2. Build a dummy request: random pixels + single-token prompt. input_ids = torch.zeros( BATCH_SIZE, cfg.tokenizer_max_length, dtype=torch.int64, device=device ) input_ids[:, 0] = 2 # any non-pad token id request = PI05Request( pixel_values=torch.rand( BATCH_SIZE, 3, 3, cfg.vision.image_size, cfg.vision.image_size, dtype=dtype, device=device, ), input_ids=input_ids, lang_lens=torch.ones(BATCH_SIZE, dtype=torch.int64, device=device), ) # 3. Run one inference step. actions = engine.step(request) print(f"action chunk shape={tuple(actions.shape)}, dtype={actions.dtype}") finally: # 4. Release scheduler buffers and tear down captured CUDA graphs. engine.close() ```