Overview
Cosmos3-Nano-Policy-DROID is the policy model in the Cosmos3 family. Cosmos3 itself is an omnimodal world model for Physical AI; the policy variant takes a language instruction plus a DROID robot platform observation and produces robot action trajectories for manipulation and control. This page covers the multi-GPU Cosmos3 policy path, exposed through thecosmos3_policy_wn plugin. It supports policy, forward_dynamics, and inverse_dynamics. Video latent and action latent advance through the same denoising loop, and the final output is action. If decode_video=True, the plugin also returns rollout video.
PhyAI currently supports two kinds of parallelism in this path. The policy transformer runs tensor parallelism on the tp axis. When cfg=2 and guidance_scale > 1, the cond and uncond CFG branches run in parallel on two TP groups. Rollout video VAE decode is also split into spatial tiles across ranks, with halo overlap used to stitch tile boundaries.
Modes and output
The clean / noisy rules for the three modes are:| Mode | Clean video | Clean action | Generation target |
|---|---|---|---|
policy | Default latent frame 0, or the frames listed in cond_frame_indexes | None | Action chunk, optional rollout video |
forward_dynamics | Default latent frame 0, or the frames listed in cond_frame_indexes | All action steps in cond_action | Rollout video |
inverse_dynamics | All video latent frames by default, or the frames listed in cond_frame_indexes | None | Action chunk that explains the observation transition |
decode_video=True, the plugin also returns video latent and decoded pixels.
| Key | Shape / Type | Meaning |
|---|---|---|
action | [B, action_chunk, raw_action_dim] | Padding tail already removed |
video | [B, C, t_lat, h_lat, w_lat] | Rollout / denoised video latent |
pixels | [B, 3, T, H, W], optional | Returned only when decode_video=True and the checkpoint has VAE |
action_dim=64. The real robot action width is raw_action_dim, and the scheduler removes padding before returning output.
Parallel topology
The example below usesTP=4, CFG=2, and world_size=8. Rank 0-3 form the TP group for the cond branch, and rank 4-7 form the TP group for the uncond branch. Within each denoising step, the four TP ranks in the same branch run transformer forward together rather than as a serial pipeline.
P.all_gather(axis="cfg") uses the parallel mesh created during engine initialization. ParallelConfig(world_size=cfg_size * tp_size, cfg_size=cfg_size, tp_size=tp_size) maps each rank to (cfg_rank, tp_rank). Gathering along the cfg axis only collects ranks that share the same tp_rank but differ in cfg_rank, so each TP shard receives cond and uncond velocity and can complete CFG combine locally.
The VAE eight-GPU split is shown below, with cfg as the outer axis and tp as the inner axis:
Run path
Prepare weights and inputs
Prepare a Cosmos3-Nano-Policy-DROID checkpoint. If you want rollout video output, the checkpoint also needs
vae/.policy and inverse_dynamics can take observation image or video input. forward_dynamics also needs action JSON.Construct the multi-GPU policy engine
The plugin name is
"cosmos3_policy_wn". torchrun --nproc_per_node must equal cfg_size * tp_size.Construct the input processor
Cosmos3PolicyProcessor handles observation resize / padding, prompt tokenization, action padding, domain id, and output postprocessing.Build the request
Cosmos3ActionRequest does not carry parallel topology. Parallelism comes from the engine config; the request only describes this policy inference.Run all ranks together
Every rank must call
engine.step(request). The scheduler triggers collectives on the tp and cfg axes, so rank 0 cannot run alone.Run examples
TP-only four-GPU policy inference:--nproc_per_node must equal --cfg * --tp. The policy example defaults to guidance_scale=1.0, where cfg=2 has no benefit; CFG parallel only matters once --guidance-scale is greater than 1.
Implementation notes
decode_video=Truerequiresvae/in the checkpoint. Without it, the path can only return action and video latent.forward_dynamicsmust providecond_action; the processor pads the raw action toaction_dim.- This path is still a single-request example / baseline path. It is not a continuous batching scheduler.

