Skip to main content

Overview

Cosmos3’s generation path turns text into video. When the sound stream is enabled, the same denoising run also produces audio aligned with the frames. T2V produces video only; T2AV advances video latent and sound latent on the same timeline, so the image comes into view while the waveform takes shape beside it. This page is about ws1, meaning world_size=1. There is no tensor parallelism, no continuous batching, and no server-side scheduling in this path. It is the plain single-GPU route: build an engine, tokenize the prompt, assemble a Cosmos3T2VRequest, run the denoising loop, and let VAE / AVAE decode the result into media you can save.
PhyAI has not added special optimization for the Cosmos3 T2V/T2AV path yet. The current implementation favors correctness, reference alignment, and readable control flow: the denoising loop is a Python-driven UniPC loop, and the examples use RuntimeConfig(use_cuda_graph=False). Treat any timing numbers as baseline measurements, not final optimized throughput.

Architecture

The Cosmos3 generation path uses PhyAI’s . The cosmos3 plugin splits the work into a few layers:
phyai/src/phyai/models/cosmos3
main_cosmos3.py
scheduler_ws1_cosmos3.py
model_runner_cosmos3.py
model_runner_vae_cosmos3.py
modeling_cosmos3.py
vae_wan.py
avae_sound.py
sampler_unipc.py
configuration_cosmos3.py
Main components:
ComponentResponsibility
Cosmos3EntryParses Cosmos3Args; loads the transformer, VAE, and optional AVAE
Cosmos3T2VSchedulerRuns the T2V/T2AV denoising loop, owns the UniPC sampler, and decodes video / sound when needed
Cosmos3T2VRunnerCalls the transformer and caches timestep-independent UND condition
Cosmos3VAERunnerDecodes video latent into pixels in [0, 1]
Cosmos3SoundVAERunnerDecodes sound latent into waveform in [-1, 1]
Cosmos3ProcessorHandles prompt tokenization and prompt metadata outside the engine
Cosmos3GenerationPostProcessorMoves pixels / waveform to CPU and saves mp4 output outside the engine

Run path

1

Prepare weights

Prepare a Cosmos3-Nano checkpoint. The examples assume this layout:
/path/to/Cosmos3-Nano/
  transformer/
  vae/
  text_tokenizer/
  sound_tokenizer/   # required for T2AV
  scheduler/
2

Construct the engine

The plugin name is "cosmos3". T2V needs the transformer and VAE. T2AV also needs AVAE, exposed through sound_tokenizer.
import torch

from phyai.engine import Engine, EngineArgs
from phyai.engine_config import DeviceConfig, EngineConfig, RuntimeConfig
from phyai.models.cosmos3.main_cosmos3 import Cosmos3Args

checkpoint_dir = "/path/to/Cosmos3-Nano"
with_sound = False

engine = Engine(
    EngineArgs(
        plugin="cosmos3",
        plugin_args=Cosmos3Args(
            checkpoint_dir=checkpoint_dir,
            flow_shift=10.0,
            use_karras_sigmas=False,
            load_sound=(True if with_sound else None),
        ),
        config=EngineConfig(
            device=DeviceConfig(target="cuda", params_dtype=torch.bfloat16),
            runtime=RuntimeConfig(use_cuda_graph=False),
        ),
    )
)
flow_shift=10.0 and use_karras_sigmas=False match the native linear-flow UniPC setup used by the current example script.
3

Tokenize the prompt

Cosmos3T2VScheduler does not run the tokenizer. Chat template handling, eos / <|vision_start|> suffixes, and positive / negative prompt token ids are produced by Cosmos3Processor.
from phyai_utils_tools.models.cosmos3 import Cosmos3Processor

processor = Cosmos3Processor(
    f"{checkpoint_dir}/text_tokenizer",
    fps=24.0,
    num_frames=189,
    height=720,
    width=1280,
    append_metadata=True,
)
cond, uncond = processor.tokenize_pair(
    "A red sports car driving along a coastal road at sunset.",
    negative_prompt=None,
    device="cuda",
)
negative_prompt=None uses the built-in Cosmos3 structured negative prompt. Pass negative_prompt="" if you want an empty negative prompt.
4

Build the request

Cosmos3T2VRequest carries tokenized text conditions, the latent grid, sampler settings, CFG scale, and seed.
FieldShape / TypeNotes
text_ids / text_mask(1, S) int64Positive prompt condition
neg_text_ids / neg_text_mask(1, S_neg) int64Negative / unconditional prompt condition
video_shape(t_lat, h_lat, w_lat)Latent grid, not pixel dimensions
fpsfloatVideo FPS; also used in prompt metadata
num_inference_stepsintUniPC steps; the example default is 35
guidance_scalefloatCFG scale; the example default is 6.0
seedintInitial video / sound noise seed
sound_framesint or NoneNon-None enables T2AV
import math

from phyai.models.cosmos3 import Cosmos3T2VRequest, pixel_to_latent_shape

num_frames = 189
height = 720
width = 1280
fps = 24.0
with_sound = False

request = Cosmos3T2VRequest(
    text_ids=cond.text_ids,
    text_mask=cond.text_mask,
    neg_text_ids=uncond.text_ids,
    neg_text_mask=uncond.text_mask,
    video_shape=pixel_to_latent_shape(num_frames, height, width),
    fps=fps,
    num_inference_steps=35,
    guidance_scale=6.0,
    seed=42,
    sound_frames=(math.ceil(num_frames / fps * 25.0) if with_sound else None),
)
pixel_to_latent_shape converts pixel dimensions into the VAE latent grid. The default compression is 4 along time and 16 along each spatial axis.
5

Run generation

output = engine.step(request)
T2V returns a pixels tensor shaped (B, 3, T, H, W) with values in [0, 1]. T2AV returns a dict:
{
    "video": pixels,
    "sound": waveform,
    "sample_rate": sample_rate,
}
6

Save media

Cosmos3GenerationPostProcessor moves GPU tensors to CPU, converts video to uint8 RGB frames, and muxes waveform into the same mp4 when audio is present.
from phyai_utils_tools.models.cosmos3 import Cosmos3GenerationPostProcessor

postprocessor = Cosmos3GenerationPostProcessor(fps=fps)
media = postprocessor.postprocess(output)
postprocessor.save_mp4(media, ".cache/cosmos3_t2v.mp4")

End-to-end examples

examples/cosmos3/run_cosmos3.py wires the full path together. T2V:
uv run python examples/cosmos3/run_cosmos3.py \
    --checkpoint /path/to/Cosmos3-Nano \
    --prompt "A red sports car driving along a coastal road at sunset." \
    --out .cache/cosmos3_t2v
T2AV:
uv run python examples/cosmos3/run_cosmos3.py \
    --checkpoint /path/to/Cosmos3-Nano \
    --prompt "ocean waves crashing on rocks" \
    --sound \
    --out .cache/cosmos3_t2av
The defaults are 720x1280, 189 frames, and 35 steps. That is heavy. For a smoke test, shrink the run first:
uv run python examples/cosmos3/run_cosmos3.py \
    --checkpoint /path/to/Cosmos3-Nano \
    --num-frames 49 \
    --height 480 \
    --width 832 \
    --steps 10 \
    --out .cache/cosmos3_smoke
The script prints phase timings: model_load, preprocess, inference, to_cpu, and encode. inference includes the denoising loop plus VAE / AVAE decode. encode is PyAV mp4 writing time.

Current limitations

  • This is a single-GPU ws1 path. Tensor parallelism, sequence parallelism, continuous batching, and request scheduling are outside its scope.
  • The examples disable CUDA graph. The denoising loop is a Python-level UniPC loop, built for clarity and reference alignment first.
  • T2AV loads sound_tokenizer / AVAE and advances sound latent at every step, so memory use and runtime go up.
  • Prompt tokenization and media saving happen outside the engine. If you are measuring the model itself, separate preprocess, to_cpu, and encode from inference.
  • PhyAI has not yet built dedicated kernels, graph capture, batching, or end-to-end throughput optimization for Cosmos3 T2V/T2AV. This page shows the baseline road, not the performance endpoint.

Full example

import math

import torch

from phyai.engine import Engine, EngineArgs
from phyai.engine_config import DeviceConfig, EngineConfig, RuntimeConfig
from phyai.models.cosmos3 import Cosmos3T2VRequest, pixel_to_latent_shape
from phyai.models.cosmos3.main_cosmos3 import Cosmos3Args
from phyai_utils_tools.models.cosmos3 import (
    Cosmos3GenerationPostProcessor,
    Cosmos3Processor,
)

checkpoint_dir = "/path/to/Cosmos3-Nano"
device = "cuda"
dtype = torch.bfloat16
num_frames = 189
height = 720
width = 1280
fps = 24.0
with_sound = False

engine = Engine(
    EngineArgs(
        plugin="cosmos3",
        plugin_args=Cosmos3Args(
            checkpoint_dir=checkpoint_dir,
            flow_shift=10.0,
            use_karras_sigmas=False,
            load_sound=(True if with_sound else None),
        ),
        config=EngineConfig(
            device=DeviceConfig(target=device, params_dtype=dtype),
            runtime=RuntimeConfig(use_cuda_graph=False),
        ),
    )
)

try:
    processor = Cosmos3Processor(
        f"{checkpoint_dir}/text_tokenizer",
        fps=fps,
        num_frames=num_frames,
        height=height,
        width=width,
        append_metadata=True,
    )
    cond, uncond = processor.tokenize_pair(
        "A red sports car driving along a coastal road at sunset.",
        negative_prompt=None,
        device=device,
    )

    request = Cosmos3T2VRequest(
        text_ids=cond.text_ids,
        text_mask=cond.text_mask,
        neg_text_ids=uncond.text_ids,
        neg_text_mask=uncond.text_mask,
        video_shape=pixel_to_latent_shape(num_frames, height, width),
        fps=fps,
        num_inference_steps=35,
        guidance_scale=6.0,
        seed=42,
        sound_frames=(math.ceil(num_frames / fps * 25.0) if with_sound else None),
    )

    output = engine.step(request)
    postprocessor = Cosmos3GenerationPostProcessor(fps=fps)
    media = postprocessor.postprocess(output)
    postprocessor.save_mp4(media, ".cache/cosmos3_t2v.mp4")
finally:
    engine.close()