Skip to main content

Overview

PI05Processor lives in phyai_utils_tools.models.pi05. It converts robot-side data into the canonical tensors required by PI05Request, and converts the model’s action chunk back to the dataset’s real action dimension. The PhyAI pi0.5 scheduler does not resize images, tokenize text, discretize state, or unnormalize actions. Those steps are handled by the processor:
StageInputOutput
preprocessimages, task, statePI05ProcessedInputs(pixel_values, input_ids, lang_lens)
engine.stepPI05Request(B, chunk_size, max_action_dim)
postprocessRaw action chunk(B, chunk_size, action_dim)
The public pi05_base checkpoint has empty normalizer features, so state/action normalization is a no-op by default. If your lerobot checkpoint includes dataset stats, from_pretrained loads those stats sidecars and uses them in pre/postprocess.

Input contract

preprocess accepts a transition dict. The common fields are:
FieldTypeNotes
imageslist[torch.Tensor] or torch.TensorEach camera is (B, C, H, W); stacked (B, num_images, C, H, W) is also accepted
tasklist[str] or strOne task string per batch sample
statetorch.Tensor(B, state_dim), with state values in the [-1, 1] range for the pi0.5 prompt
The output PI05ProcessedInputs fields map directly into PI05Request:
FieldShapeNotes
pixel_values(B, num_images, C, image_size, image_size)Defaults: num_images=3, C=3, image_size=224
input_ids(B, tokenizer_max_length) int64Default tokenizer is google/paligemma-3b-pt-224, right-padded
lang_lens(B,) int64Real token length for each prompt
Images are resized proportionally to a square, padded, then stacked into the (B, num_images, C, H, W) layout expected by the scheduler. When normalize_pixels=True, the processor maps [0, 1] pixels to [-1, 1].

Construct from a checkpoint

If your checkpoint directory contains lerobot-format policy_preprocessor.json and policy_postprocessor.json, prefer from_pretrained. This path preserves the processor steps, normalizer configuration, and stats sidecars recorded in the checkpoint, then adds the vision resize and action slice needed by PhyAI inference.
from pathlib import Path

import torch

from phyai_utils_tools.models.pi05 import PI05Processor

processor = PI05Processor.from_pretrained(
    Path("/path/to/pi05_base"),
    image_size=224,
    num_channels=3,
    num_images=3,
    action_dim=7,
    device="cuda",
    params_dtype=torch.bfloat16,
)
This construction path:
  • Loads policy_preprocessor.json and policy_postprocessor.json.
  • Injects a HuggingFace tokenizer object into the tokenizer step.
  • Points the preprocess device_processor at device, so model inputs land on the inference device.
  • Leaves postprocess device behavior as configured by the checkpoint; the pi05_base postprocessor returns CPU tensors.
  • Prepends resize / optional pixel normalization to the loaded preprocessor.
  • Appends SliceActionStep(action_dim=action_dim) to the loaded postprocessor.

Manual construction

If you do not have processor JSON files, or you only need the default pi05_base behavior, construct PI05Processor directly:
import torch

from phyai_utils_tools.models.pi05 import PI05Processor

processor = PI05Processor(
    image_size=224,
    num_channels=3,
    num_images=3,
    tokenizer_max_length=200,
    action_dim=7,
    device="cuda",
    params_dtype=torch.bfloat16,
)
The manually constructed preprocess pipeline runs in this order:
1

Resize cameras

ResizeWithPadStep reads images, validates the camera count and channel count, then resizes/pads each camera to image_size × image_size.
2

Normalize state

NormalizerStep processes state using dataset_stats and PI05_NORM_MAP. Without stats, this is a no-op.
3

Build prompt

StateTokenizerPrepareStep discretizes state into 256 bins and builds Task: <task>, State: <bins>;\nAction: .
4

Tokenize

TokenizerStep uses the PaliGemma tokenizer to encode the prompt into input_ids and lang_lens.
5

Move tensors

DeviceStep moves tensors to device and casts floating-point tensors to params_dtype.
The postprocess pipeline first unnormalizes actions, slices the model’s padded internal action dimension down to action_dim, then moves the result back to CPU.

Connect to Engine

The example below shows how raw cameras, task text, and state flow through PI05Processor into PI05Request, then into Engine inference.
from pathlib import Path

import torch

from phyai.engine import Engine, EngineArgs
from phyai.engine_config import DeviceConfig, EngineConfig, RuntimeConfig
from phyai.models.pi05.configuration_pi05 import PI05Config
from phyai.models.pi05.main_pi05 import PI05Args
from phyai.models.pi05.scheduler_ws1_pi05 import PI05Request
from phyai.utils import load_config
from phyai_utils_tools.models.pi05 import PI05Processor

checkpoint_dir = Path("/path/to/pi05_base")
cfg = load_config(checkpoint_dir, PI05Config)
device = torch.device("cuda")
dtype = torch.bfloat16
batch_size = 1
action_dim = 7

processor = PI05Processor.from_pretrained(
    checkpoint_dir,
    image_size=cfg.vision.image_size,
    num_channels=cfg.vision.num_channels,
    num_images=3,
    action_dim=action_dim,
    device=device,
    params_dtype=dtype,
)

engine = Engine(
    EngineArgs(
        plugin="pi05",
        plugin_args=PI05Args(
            checkpoint_dir=checkpoint_dir,
            max_batch_size=batch_size,
        ),
        config=EngineConfig(
            device=DeviceConfig(target="cuda", params_dtype=dtype),
            runtime=RuntimeConfig(use_cuda_graph=True),
        ),
    )
)

try:
    raw = {
        "images": [
            torch.rand(batch_size, 3, 480, 640, device=device),
            torch.rand(batch_size, 3, 480, 640, device=device),
            torch.rand(batch_size, 3, 480, 640, device=device),
        ],
        "task": ["pick up the cup"],
        "state": torch.rand(batch_size, 7, device=device) * 2 - 1,
    }

    processed = processor.preprocess(raw)
    request = PI05Request(
        pixel_values=processed.pixel_values,
        input_ids=processed.input_ids,
        lang_lens=processed.lang_lens,
    )

    raw_actions = engine.step(request)
    actions = processor.postprocess(raw_actions)
    print(actions.shape)
finally:
    engine.close()
If you only want to measure engine latency, skip the processor and build an already resized/tokenized PI05Request directly. examples/pi05/run_pi05.py --raw uses that path.

Save and load

A manually constructed processor can be saved as lerobot-compatible JSON:
processor.save_pretrained("/tmp/pi05_processor")
The saved directory contains:
FileContents
policy_preprocessor.jsonNormalizer, pi0.5 prompt step, tokenizer, device step
policy_postprocessor.jsonUnnormalizer and device step
*.safetensorsGenerated only when the normalizer / unnormalizer has stats
PhyAI-side vision resize, optional pixel normalization, and action slicing are not written into JSON. PI05Processor.from_pretrained(...) adds them back from constructor arguments. This matches the lerobot boundary: image resize and action slicing are inference-side model glue, not part of the checkpoint JSON’s generic processor core.

FAQ

images shape mismatch

num_images and num_channels must match the processor constructor arguments. The default pi05_base setup uses 3 RGB cameras, so list input needs 3 tensors shaped (B, 3, H, W), and stacked input needs (B, 3, 3, H, W).

Is state required

StateTokenizerPrepareStep supports the path where state is absent. In that case, the prompt only contains task text and no state bins. The normal pi0.5 robot inference path should pass proprioceptive state.

Why action output returns to CPU

PI05Processor.from_pretrained does not override the checkpoint postprocessor’s device_processor. The pi05_base postprocessor configuration returns actions to CPU so they are ready for robot control or evaluation code.

Does the tokenizer require network access

The default tokenizer name is google/paligemma-3b-pt-224. If this tokenizer is not already in the local HuggingFace cache, the first processor construction may trigger a download. In offline environments, pass a prepared tokenizer object:
processor = PI05Processor(
    tokenizer=my_tokenizer,
    image_size=224,
    num_images=3,
    tokenizer_max_length=200,
)