Skip to main content

Overview

Tensor dump records the output activation of every leaf operator your model runs during inference, writing one file per Engine.step(). It is the tool to reach for when you need to answer “what does layer X actually produce at runtime?” — debugging a numerical regression, comparing two backends (e.g. flashinfer vs eager, or bf16 vs an FP8 build), or checking a new model port against a reference implementation. It is built on PyTorch forward hooks: every selected nn.Module with no children gets a hook that captures its return value, moves it to CPU, and accumulates it under the module’s dotted name (model.expert_stack.layers.0.o_proj). Weights are not dumped — those are static and already live in your checkpoint; what you capture here are the intermediate tensors that change with the input.
Tensor dump runs eager-only. A captured CUDA graph replays its kernels without re-entering Python, so forward hooks never fire during graph replay. When you set a dump directory, the engine forces use_cuda_graph=False (with a warning) so the hooks actually run. Expect eager-mode speed while dumping — this is a debugging path, not a production one.

Enabling

Tensor dump is off by default. Turn it on through the runtime config, or purely through environment variables — whichever fits your workflow.
The lightest way to switch dumping on for a single run without touching the caller. PHYAI_* vars overlay on top of whatever config the program passes, so this works even for a script that builds its own EngineConfig:
PHYAI_DEBUG_TENSOR_DUMP_DIR=/tmp/dump \
    uv run python examples/pi05/run_pi05.py --checkpoint /path/to/pi05_base --raw
Restrict what is captured with a JSON array of regexes (matched against each operator’s full dotted name):
PHYAI_DEBUG_TENSOR_DUMP_DIR=/tmp/dump \
PHYAI_DEBUG_TENSOR_DUMP_FILTER='["expert_stack\\.layers\\.0\\.", "\\.heads\\."]' \
    uv run python examples/pi05/run_pi05.py --checkpoint /path/to/pi05_base --raw
VariableMeaning
PHYAI_DEBUG_TENSOR_DUMP_DIROutput directory. Setting it enables dumping.
PHYAI_DEBUG_TENSOR_DUMP_FILTERJSON array of regexes (or a single bare pattern). Records an operator if any pattern matches.
PHYAI_DEBUG_TENSOR_DUMP_FILTER_FN"pkg.module:func" or "/path/file.py:func" predicate. Mutually exclusive with _FILTER.

Selecting what to dump

VLA models are not a single homogeneous decoder stack — pi0.5 alone has three layers.<int> stacks (vision encoder, PaliGemma language model, action expert) plus components with no layer index at all (heads, rope, embeddings, projectors). The filter selects operators by their full dotted name, which lets you target any of them precisely. filter accepts three forms:
Every leaf operator is captured. For pi0.5 this is ~1500 tensors per step, so prefer a narrower filter once you know what you are after.
Patterns are re.search-matched against the operator name and OR-ed together. Examples:
GoalRegex
One stack’s first layerr"expert_stack\.layers\.0\."
First layer of two stacksr"expert_stack\.layers\.0\.", r"paligemma_lm\.layers\.0\."
Every output projectionr"o_proj$"
The action/time heads (no layer index)r"\.heads\."
The whole vision towerr"\.vision\."
For logic a regex cannot express, pass a (name: str, module: nn.Module) -> bool predicate. It receives the module too, so you can dispatch on type:
from torch import nn

def keep(name, module):
    # Every output projection except the vision tower's.
    return name.endswith("o_proj") and ".vision." not in name
Point the config or env var at it as "my_pkg.filters:keep" (import path) or "/tmp/myfilter.py:keep" (file path — convenient for ad-hoc debugging without installing anything).

Output layout

Each rank writes to its own subdirectory so concurrent processes never collide; each Engine.step() produces one numbered pass file:
/tmp/dump
rank0_pid3069569
pass00000.pt
pass00001.pt
pass00002.pt
Each .pt file is a dict of {operator_name: cpu_tensor}. When one operator fires multiple times in a single step — the vision tower runs once per camera, the action expert runs once per Euler denoise step — every invocation is preserved: the first is keyed by the bare name, later ones get a ::callN suffix.
model.paligemma_lm.layers.0.o_proj
model.expert_stack.layers.0.attn          # Euler step 0
model.expert_stack.layers.0.attn::call1   # Euler step 1
model.expert_stack.layers.0.attn::call2   # Euler step 2
...

Loading a dump

Use load_pass to read one pass file back:
from phyai.runtime.tensor_dump import load_pass

tensors = load_pass("/tmp/dump/rank0_pid3069569/pass00000.pt")

# Keys are operator names; values are CPU tensors.
print(tensors["model.expert_stack.layers.0.o_proj"].shape)

# Compare two runs (e.g. two backends) operator-by-operator.
a = load_pass("/tmp/dump_a/rank0_pid111/pass00000.pt")
b = load_pass("/tmp/dump_b/rank0_pid222/pass00000.pt")
for name in a.keys() & b.keys():
    diff = (a[name].float() - b[name].float()).abs().max().item()
    if diff > 1e-3:
        print(f"{name}: max_abs_diff={diff:.6f}")