Overview
Cosmos3’s generation path turns text into video. When the sound stream is enabled, the same denoising run also produces audio aligned with the frames. T2V produces video only; T2AV advances video latent and sound latent on the same timeline, so the image comes into view while the waveform takes shape beside it. This page is aboutws1, meaning world_size=1. There is no tensor parallelism, no continuous batching, and no server-side scheduling in this path. It is the plain single-GPU route: build an engine, tokenize the prompt, assemble a Cosmos3T2VRequest, run the denoising loop, and let VAE / AVAE decode the result into media you can save.
Architecture
The Cosmos3 generation path uses PhyAI’s . Thecosmos3 plugin splits the work into a few layers:
phyai/src/phyai/models/cosmos3
main_cosmos3.py
scheduler_ws1_cosmos3.py
model_runner_cosmos3.py
model_runner_vae_cosmos3.py
modeling_cosmos3.py
vae_wan.py
avae_sound.py
sampler_unipc.py
configuration_cosmos3.py
| Component | Responsibility |
|---|---|
Cosmos3Entry | Parses Cosmos3Args; loads the transformer, VAE, and optional AVAE |
Cosmos3T2VScheduler | Runs the T2V/T2AV denoising loop, owns the UniPC sampler, and decodes video / sound when needed |
Cosmos3T2VRunner | Calls the transformer and caches timestep-independent UND condition |
Cosmos3VAERunner | Decodes video latent into pixels in [0, 1] |
Cosmos3SoundVAERunner | Decodes sound latent into waveform in [-1, 1] |
Cosmos3Processor | Handles prompt tokenization and prompt metadata outside the engine |
Cosmos3GenerationPostProcessor | Moves pixels / waveform to CPU and saves mp4 output outside the engine |
Run path
Prepare weights
Prepare a Cosmos3-Nano checkpoint. The examples assume this layout:
Construct the engine
The plugin name is
"cosmos3". T2V needs the transformer and VAE. T2AV also needs AVAE, exposed through sound_tokenizer.flow_shift=10.0 and use_karras_sigmas=False match the native linear-flow UniPC setup used by the current example script.Tokenize the prompt
Cosmos3T2VScheduler does not run the tokenizer. Chat template handling, eos / <|vision_start|> suffixes, and positive / negative prompt token ids are produced by Cosmos3Processor.negative_prompt=None uses the built-in Cosmos3 structured negative prompt. Pass negative_prompt="" if you want an empty negative prompt.Build the request
Cosmos3T2VRequest carries tokenized text conditions, the latent grid, sampler settings, CFG scale, and seed.| Field | Shape / Type | Notes |
|---|---|---|
text_ids / text_mask | (1, S) int64 | Positive prompt condition |
neg_text_ids / neg_text_mask | (1, S_neg) int64 | Negative / unconditional prompt condition |
video_shape | (t_lat, h_lat, w_lat) | Latent grid, not pixel dimensions |
fps | float | Video FPS; also used in prompt metadata |
num_inference_steps | int | UniPC steps; the example default is 35 |
guidance_scale | float | CFG scale; the example default is 6.0 |
seed | int | Initial video / sound noise seed |
sound_frames | int or None | Non-None enables T2AV |
pixel_to_latent_shape converts pixel dimensions into the VAE latent grid. The default compression is 4 along time and 16 along each spatial axis.Run generation
(B, 3, T, H, W) with values in [0, 1]. T2AV returns a dict:End-to-end examples
examples/cosmos3/run_cosmos3.py wires the full path together. T2V:
720x1280, 189 frames, and 35 steps. That is heavy. For a smoke test, shrink the run first:
model_load, preprocess, inference, to_cpu, and encode. inference includes the denoising loop plus VAE / AVAE decode. encode is PyAV mp4 writing time.
Current limitations
- This is a single-GPU
ws1path. Tensor parallelism, sequence parallelism, continuous batching, and request scheduling are outside its scope. - The examples disable CUDA graph. The denoising loop is a Python-level UniPC loop, built for clarity and reference alignment first.
- T2AV loads
sound_tokenizer/ AVAE and advances sound latent at every step, so memory use and runtime go up. - Prompt tokenization and media saving happen outside the engine. If you are measuring the model itself, separate
preprocess,to_cpu, andencodefrominference. - PhyAI has not yet built dedicated kernels, graph capture, batching, or end-to-end throughput optimization for Cosmos3 T2V/T2AV. This page shows the baseline road, not the performance endpoint.

