Overview
PI05Processor lives in phyai_utils_tools.models.pi05. It converts robot-side data into the canonical tensors required by PI05Request, and converts the model’s action chunk back to the dataset’s real action dimension.
The PhyAI pi0.5 scheduler does not resize images, tokenize text, discretize state, or unnormalize actions. Those steps are handled by the processor:
| Stage | Input | Output |
|---|---|---|
preprocess | images, task, state | PI05ProcessedInputs(pixel_values, input_ids, lang_lens) |
engine.step | PI05Request | (B, chunk_size, max_action_dim) |
postprocess | Raw action chunk | (B, chunk_size, action_dim) |
The public
pi05_base checkpoint has empty normalizer features, so state/action normalization is a no-op by default. If your lerobot checkpoint includes dataset stats, from_pretrained loads those stats sidecars and uses them in pre/postprocess.Input contract
preprocess accepts a transition dict. The common fields are:
| Field | Type | Notes |
|---|---|---|
images | list[torch.Tensor] or torch.Tensor | Each camera is (B, C, H, W); stacked (B, num_images, C, H, W) is also accepted |
task | list[str] or str | One task string per batch sample |
state | torch.Tensor | (B, state_dim), with state values in the [-1, 1] range for the pi0.5 prompt |
PI05ProcessedInputs fields map directly into PI05Request:
| Field | Shape | Notes |
|---|---|---|
pixel_values | (B, num_images, C, image_size, image_size) | Defaults: num_images=3, C=3, image_size=224 |
input_ids | (B, tokenizer_max_length) int64 | Default tokenizer is google/paligemma-3b-pt-224, right-padded |
lang_lens | (B,) int64 | Real token length for each prompt |
(B, num_images, C, H, W) layout expected by the scheduler. When normalize_pixels=True, the processor maps [0, 1] pixels to [-1, 1].
Construct from a checkpoint
If your checkpoint directory contains lerobot-formatpolicy_preprocessor.json and policy_postprocessor.json, prefer from_pretrained. This path preserves the processor steps, normalizer configuration, and stats sidecars recorded in the checkpoint, then adds the vision resize and action slice needed by PhyAI inference.
- Loads
policy_preprocessor.jsonandpolicy_postprocessor.json. - Injects a HuggingFace tokenizer object into the tokenizer step.
- Points the preprocess
device_processoratdevice, so model inputs land on the inference device. - Leaves postprocess device behavior as configured by the checkpoint; the
pi05_basepostprocessor returns CPU tensors. - Prepends resize / optional pixel normalization to the loaded preprocessor.
- Appends
SliceActionStep(action_dim=action_dim)to the loaded postprocessor.
Manual construction
If you do not have processor JSON files, or you only need the defaultpi05_base behavior, construct PI05Processor directly:
Resize cameras
ResizeWithPadStep reads images, validates the camera count and channel count, then resizes/pads each camera to image_size × image_size.Normalize state
NormalizerStep processes state using dataset_stats and PI05_NORM_MAP. Without stats, this is a no-op.Build prompt
StateTokenizerPrepareStep discretizes state into 256 bins and builds Task: <task>, State: <bins>;\nAction: .Tokenize
TokenizerStep uses the PaliGemma tokenizer to encode the prompt into input_ids and lang_lens.action_dim, then moves the result back to CPU.
Connect to Engine
The example below shows how raw cameras, task text, and state flow throughPI05Processor into PI05Request, then into Engine inference.
Save and load
A manually constructed processor can be saved as lerobot-compatible JSON:| File | Contents |
|---|---|
policy_preprocessor.json | Normalizer, pi0.5 prompt step, tokenizer, device step |
policy_postprocessor.json | Unnormalizer and device step |
*.safetensors | Generated only when the normalizer / unnormalizer has stats |
PI05Processor.from_pretrained(...) adds them back from constructor arguments. This matches the lerobot boundary: image resize and action slicing are inference-side model glue, not part of the checkpoint JSON’s generic processor core.
FAQ
images shape mismatch
num_images and num_channels must match the processor constructor arguments. The default pi05_base setup uses 3 RGB cameras, so list input needs 3 tensors shaped (B, 3, H, W), and stacked input needs (B, 3, 3, H, W).
Is state required
StateTokenizerPrepareStep supports the path where state is absent. In that case, the prompt only contains task text and no state bins. The normal pi0.5 robot inference path should pass proprioceptive state.
Why action output returns to CPU
PI05Processor.from_pretrained does not override the checkpoint postprocessor’s device_processor. The pi05_base postprocessor configuration returns actions to CPU so they are ready for robot control or evaluation code.
Does the tokenizer require network access
The default tokenizer name isgoogle/paligemma-3b-pt-224. If this tokenizer is not already in the local HuggingFace cache, the first processor construction may trigger a download. In offline environments, pass a prepared tokenizer object:

