Overview
Cosmos3’s policy path is not the text-to-video path. It is the part of the model that looks at an observation, reads a task, and predicts what to do next. Give it an observation and a prompt, and it can predict an action chunk. Give it an action, and it can roll out a possible future. Give it a transition that already happened, and it can infer the action that explains it. This page uses Cosmos3-Nano-Policy-DROID by default. If your goal is action output, do not substitute the generalCosmos3-Nano generation checkpoint. The T2V/T2AV path is documented separately in /models/cosmos/ws1.
This is the ws1 path, meaning single-GPU inference. It covers three modes:
| Mode | Input | Output |
|---|---|---|
policy | Observation image/video + prompt | Action chunk, optionally rollout video |
forward_dynamics | Observation + prompt + known action | Rollout video, with action preserved in the output |
inverse_dynamics | Observation video + prompt | Action chunk explaining the transition |
examples/cosmos3/run_cosmos3_policy.py already wires these three modes together. The script enables decode_video=True, so it saves a rollout mp4 whenever the scheduler returns pixels. It always saves action as JSON.Architecture
The policy path uses thecosmos3_policy plugin. It shares the Cosmos3 transformer with the T2V/T2AV generation path, but its request adds action latent, domain id, and mode. Video and action move through the same denoising loop; each mode only changes which parts are clean conditions and which parts must be generated.
phyai/src/phyai/models/cosmos3
main_cosmos3_policy.py
scheduler_ws1_cosmos3_policy.py
model_runner_policy_cosmos3.py
model_runner_vae_cosmos3.py
modeling_cosmos3.py
vae_wan.py
sampler_unipc.py
| Component | Responsibility |
|---|---|
Cosmos3PolicyEntry | Loads the transformer; also loads VAE when decode_video=True |
Cosmos3PolicyScheduler | Builds video/action clean and noised masks for each mode, then runs UniPC |
Cosmos3ActionRunner | Calls the policy transformer and returns video velocity plus action velocity |
Cosmos3PolicyProcessor | Handles observation, prompt, action padding, domain id, and output action postprocessing |
How to read the three modes
policy
policy is the robot-control-shaped path. You provide an observation and a task, and the model predicts an action chunk. By default, the first observation frame is the clean condition; later video latent and all action latent are generated from noise.
Use it when the question is: “given this scene, what should the robot do?”
forward_dynamics
forward_dynamics gives the model an observation and a known action, then asks it to roll out video. Here action is the clean condition, and video is the generated target.
Use it when the question is: “if the robot takes this action, what happens next?”
This mode requires --action-file.
inverse_dynamics
inverse_dynamics works in the other direction. You provide an observation video, and the model infers an action chunk that can explain the transition. By default, the whole video is clean condition, and action is recovered from noise.
Use it when the question is: “what action likely moved the scene from A to B?”
Input contract
Cosmos3PolicyProcessor.preprocess() accepts a dict. The example script turns CLI arguments into this shape:
| Field | Type | Notes |
|---|---|---|
images | Image path, PIL image, numpy array, torch tensor, or a list of those objects | A single image becomes 1 frame; a list is treated as a multi-frame observation |
task / prompt | str or list[str] | Task text; when a list is provided, the first item is used |
cond_action / action | list, numpy array, or torch.Tensor | Required only for forward_dynamics |
domain_name / domain_id | str or int | Overrides the processor constructor value |
mode | str | Overrides the processor constructor value |
(1, 3, T, H, W) with values in [-1, 1]. When you pass --video, the script reads the first action_chunk_size + 1 frames. If the clip is too short, it repeats the last frame to fill the sequence.
Domain and action dimensions
Cosmos3 action output has two widths:| Name | Meaning |
|---|---|
action_dim | Internal model action width; default 64 |
raw_action_dim | Real action width for the robot embodiment |
action_dim. After engine output, it slices action back to raw_action_dim.
Common domains:
domain_name | domain_id | raw_action_dim |
|---|---|---|
bridge_orig_lerobot | 7 | 10 |
droid_lerobot | 8 | 10 |
agibotworld | 15 | 29 |
fractal | 20 | 10 |
domain_id, the processor cannot infer raw_action_dim from a name. Pass --raw-action-dim explicitly in that case.
Run path
Prepare weights
Prepare a Cosmos3-Nano-Policy-DROID checkpoint. The policy path needs at least:
Construct the engine
The plugin name is
"cosmos3_policy". The example script uses decode_video=True, so VAE is loaded and decoded rollout pixels are returned.use_karras_sigmas=None reads the scheduler config from the checkpoint. The example also lets you pass false to use linear-flow sampling with flow_shift.Construct the processor
Cosmos3PolicyProcessor handles observation resize/pad, prompt tokenization, action padding, domain id resolution, and output action slicing / optional denormalization.Preprocess input
processed.video_shape is a pixel shape (T, H, W). Convert it to a latent grid with pixel_to_latent_shape before building the request.Script examples
Policy
Single observation image, predict action:| File | Contents |
|---|---|
.cache/cosmos3_policy_out_action.json | Action chunk |
.cache/cosmos3_policy_out.mp4 | Rollout video, if decoded pixels are returned |
Forward dynamics
Provide an action and generate rollout video:action.json supports two formats:
raw_action_dim is 10; if the file has fewer steps than action_chunk_size, the processor repeats the last step to fill the chunk.
Inverse dynamics
Provide an observation video and infer action:--condition-frames, the script defaults to 0 for image input and 0,1 for video input.
Output postprocessing
Cosmos3PolicyProcessor.postprocess() does three things:
- Reads
actionfrom either a tensor result or a result dict. - Slices action to
raw_action_dim. - Denormalizes action back to physical units when
action_stats_pathis provided.
action_normalization | Required stats fields |
|---|---|
meanstd | mean, std |
minmax | min, max |
quantile | q01, q99 |
quantile_rot | global_raw.q01, global_raw.q99 |
action_stats_path, action remains in the model’s normalized output scale.
Current limitations
- The current script processes one request at a time. It is for path validation and examples, not a server scheduler.
- Action / policy examples use the DROID policy checkpoint and
droid_lerobot. If you switch embodiment, use matching policy weights, domain, and action stats together. decode_video=Trueloads VAE and saves rollout video. If you only care about action latency, turn it off in code.forward_dynamicsrequires an action file. The processor trims it or repeats the last step to reachaction_chunk_size.- When
domain_namecannot resolveraw_action_dim, pass--raw-action-dimexplicitly. - CUDA graph is not the main optimization target for this path yet. The current code leaves room for future work; the first goal is getting the semantics correct.

