Overview
Tensor dump records the output activation of every leaf operator your model runs during inference, writing one file perEngine.step(). It is the
tool to reach for when you need to answer “what does layer X actually
produce at runtime?” — debugging a numerical regression, comparing two
backends (e.g. flashinfer vs eager, or bf16 vs an FP8 build), or checking a
new model port against a reference implementation.
It is built on PyTorch forward hooks: every selected nn.Module with no
children gets a hook that captures its return value, moves it to CPU, and
accumulates it under the module’s dotted name
(model.expert_stack.layers.0.o_proj). Weights are not dumped — those are
static and already live in your checkpoint; what you capture here are the
intermediate tensors that change with the input.
Enabling
Tensor dump is off by default. Turn it on through the runtime config, or purely through environment variables — whichever fits your workflow.- Environment variables
- EngineConfig
The lightest way to switch dumping on for a single run without touching
the caller. Restrict what is captured with a JSON array of regexes (matched against
each operator’s full dotted name):
PHYAI_* vars overlay on top of whatever config the program
passes, so this works even for a script that builds its own
EngineConfig:| Variable | Meaning |
|---|---|
PHYAI_DEBUG_TENSOR_DUMP_DIR | Output directory. Setting it enables dumping. |
PHYAI_DEBUG_TENSOR_DUMP_FILTER | JSON array of regexes (or a single bare pattern). Records an operator if any pattern matches. |
PHYAI_DEBUG_TENSOR_DUMP_FILTER_FN | "pkg.module:func" or "/path/file.py:func" predicate. Mutually exclusive with _FILTER. |
Selecting what to dump
VLA models are not a single homogeneous decoder stack — pi0.5 alone has threelayers.<int> stacks (vision encoder, PaliGemma language model, action
expert) plus components with no layer index at all (heads, rope,
embeddings, projectors). The filter selects operators by their full
dotted name, which lets you target any of them precisely.
filter accepts three forms:
None — record everything (default)
None — record everything (default)
Every leaf operator is captured. For pi0.5 this is ~1500 tensors per
step, so prefer a narrower filter once you know what you are after.
A list of regexes — record if any matches
A list of regexes — record if any matches
Patterns are
re.search-matched against the operator name and OR-ed
together. Examples:| Goal | Regex |
|---|---|
| One stack’s first layer | r"expert_stack\.layers\.0\." |
| First layer of two stacks | r"expert_stack\.layers\.0\.", r"paligemma_lm\.layers\.0\." |
| Every output projection | r"o_proj$" |
| The action/time heads (no layer index) | r"\.heads\." |
| The whole vision tower | r"\.vision\." |
A callable — record if it returns True
A callable — record if it returns True
For logic a regex cannot express, pass a
Point the config or env var at it as
(name: str, module: nn.Module) -> bool predicate. It receives the
module too, so you can dispatch on type:"my_pkg.filters:keep" (import path)
or "/tmp/myfilter.py:keep" (file path — convenient for ad-hoc debugging
without installing anything).Output layout
Each rank writes to its own subdirectory so concurrent processes never collide; eachEngine.step() produces one numbered pass file:
/tmp/dump
rank0_pid3069569
pass00000.pt
pass00001.pt
pass00002.pt
.pt file is a dict of {operator_name: cpu_tensor}. When one operator
fires multiple times in a single step — the vision tower runs once per
camera, the action expert runs once per Euler denoise step — every
invocation is preserved: the first is keyed by the bare name, later ones get
a ::callN suffix.
Loading a dump
Useload_pass to read one pass file back:

