跳转到主要内容

概述

Cosmos3 的生成路径把一段文字变成视频;当开启 sound stream 时,同一次去噪也会生成与画面同步的音频。T2V 只生成视频,T2AV 则让视频 latent 和 sound latent 在同一条时间轴上一起被 UniPC 推进:画面在噪声里慢慢显影,声音也在同一个节奏里成形。 本页只讨论 ws1,也就是 world_size=1 的单卡路径。这里没有 tensor parallel,没有 continuous batching,也没有面向服务端吞吐的调度技巧。它是一条清楚、直接、便于对齐参考实现的生成路径:构造 engine,tokenize prompt,组装 Cosmos3T2VRequest,跑 denoise loop,最后由 VAE/AVAE decode 成可保存的媒体。
PhyAI 目前还没有对 Cosmos3 的 T2V/T2AV 路径做特殊优化。当前实现以 correctness、参考对齐和可读性为主:denoise loop 是 Python 调度的 UniPC 循环,RuntimeConfig(use_cuda_graph=False) 是示例里的默认选择。性能数据不应被解读为最终优化后的结果。

架构

PhyAI 的 Cosmos3 生成路径仍然走统一的 cosmos3 plugin 把生成任务拆成几层:
phyai/src/phyai/models/cosmos3
main_cosmos3.py
scheduler_ws1_cosmos3.py
model_runner_cosmos3.py
model_runner_vae_cosmos3.py
modeling_cosmos3.py
vae_wan.py
avae_sound.py
sampler_unipc.py
configuration_cosmos3.py
各层职责如下:
组件职责
Cosmos3Entry解析 Cosmos3Args,加载 transformer、VAE,以及可选 AVAE
Cosmos3T2VScheduler管理 T2V/T2AV denoise loop,维护 UniPC sampler,并在需要时 decode video / sound
Cosmos3T2VRunner调用 transformer,缓存 timestep-independent 的 UND condition
Cosmos3VAERunner把 video latent decode 成 pixels,范围 [0, 1]
Cosmos3SoundVAERunner把 sound latent decode 成 waveform,范围 [-1, 1]
Cosmos3Processor在 engine 外处理 prompt tokenization 和 prompt metadata
Cosmos3GenerationPostProcessor在 engine 外把 pixels / waveform 搬到 CPU,并保存 mp4

运行路径

1

准备权重

准备一份 Cosmos3-Nano checkpoint。示例假设目录结构里包含:
/path/to/Cosmos3-Nano/
  transformer/
  vae/
  text_tokenizer/
  sound_tokenizer/   # T2AV 需要
  scheduler/
2

构造 Engine

插件名是 "cosmos3"。T2V 必须加载 transformer 和 VAE;T2AV 还需要加载 AVAE,也就是 sound_tokenizer
import torch

from phyai.engine import Engine, EngineArgs
from phyai.engine_config import DeviceConfig, EngineConfig, RuntimeConfig
from phyai.models.cosmos3.main_cosmos3 import Cosmos3Args

checkpoint_dir = "/path/to/Cosmos3-Nano"
with_sound = False

engine = Engine(
    EngineArgs(
        plugin="cosmos3",
        plugin_args=Cosmos3Args(
            checkpoint_dir=checkpoint_dir,
            flow_shift=10.0,
            use_karras_sigmas=False,
            load_sound=(True if with_sound else None),
        ),
        config=EngineConfig(
            device=DeviceConfig(target="cuda", params_dtype=torch.bfloat16),
            runtime=RuntimeConfig(use_cuda_graph=False),
        ),
    )
)
flow_shift=10.0use_karras_sigmas=False 对齐当前示例里的 native linear-flow UniPC 配置。
3

Tokenize prompt

Cosmos3T2VScheduler 不做 tokenizer。prompt 侧的 chat template、eos / <|vision_start|> 追加、正负 prompt 的 token ids 都由 Cosmos3Processor 完成。
from phyai_utils_tools.models.cosmos3 import Cosmos3Processor

processor = Cosmos3Processor(
    f"{checkpoint_dir}/text_tokenizer",
    fps=24.0,
    num_frames=189,
    height=720,
    width=1280,
    append_metadata=True,
)
cond, uncond = processor.tokenize_pair(
    "A red sports car driving along a coastal road at sunset.",
    negative_prompt=None,
    device="cuda",
)
negative_prompt=None 会使用内置的 Cosmos3 structured negative prompt。若希望空 negative prompt,显式传 negative_prompt=""
4

构造请求

Cosmos3T2VRequest 携带已经处理好的文本条件、latent grid、采样步数、CFG scale 和随机种子。
字段Shape / 类型备注
text_ids / text_mask(1, S) int64正向 prompt 条件
neg_text_ids / neg_text_mask(1, S_neg) int64负向 / unconditional prompt 条件
video_shape(t_lat, h_lat, w_lat)latent grid,不是像素尺寸
fpsfloat视频 FPS,同时参与 prompt metadata
num_inference_stepsintUniPC steps,示例默认 35
guidance_scalefloatCFG scale,示例默认 6.0
seedint生成初始 video/sound noise
sound_framesintNoneNone 时开启 T2AV
import math

from phyai.models.cosmos3 import Cosmos3T2VRequest, pixel_to_latent_shape

num_frames = 189
height = 720
width = 1280
fps = 24.0
with_sound = False

request = Cosmos3T2VRequest(
    text_ids=cond.text_ids,
    text_mask=cond.text_mask,
    neg_text_ids=uncond.text_ids,
    neg_text_mask=uncond.text_mask,
    video_shape=pixel_to_latent_shape(num_frames, height, width),
    fps=fps,
    num_inference_steps=35,
    guidance_scale=6.0,
    seed=42,
    sound_frames=(math.ceil(num_frames / fps * 25.0) if with_sound else None),
)
pixel_to_latent_shape 使用 VAE 压缩比例把像素维度换算成 latent grid:时间维默认按 4 压缩,空间维默认按 16 压缩。
5

运行生成

output = engine.step(request)
T2V 返回 pixels tensor,shape 是 (B, 3, T, H, W),范围 [0, 1]。T2AV 返回 dict:
{
    "video": pixels,
    "sound": waveform,
    "sample_rate": sample_rate,
}
6

保存媒体

Cosmos3GenerationPostProcessor 负责把 GPU tensor 移到 CPU,把视频转成 uint8 RGB frames,并在有音频时把 waveform mux 进同一个 mp4。
from phyai_utils_tools.models.cosmos3 import Cosmos3GenerationPostProcessor

postprocessor = Cosmos3GenerationPostProcessor(fps=fps)
media = postprocessor.postprocess(output)
postprocessor.save_mp4(media, ".cache/cosmos3_t2v.mp4")

端到端示例

examples/cosmos3/run_cosmos3.py 已经把上面的步骤串好。T2V:
uv run python examples/cosmos3/run_cosmos3.py \
    --checkpoint /path/to/Cosmos3-Nano \
    --prompt "A red sports car driving along a coastal road at sunset." \
    --out .cache/cosmos3_t2v
T2AV:
uv run python examples/cosmos3/run_cosmos3.py \
    --checkpoint /path/to/Cosmos3-Nano \
    --prompt "ocean waves crashing on rocks" \
    --sound \
    --out .cache/cosmos3_t2av
默认参数是 720×1280189 帧、35 steps。这个尺寸会慢;如果只是冒烟测试,可以先缩小:
uv run python examples/cosmos3/run_cosmos3.py \
    --checkpoint /path/to/Cosmos3-Nano \
    --num-frames 49 \
    --height 480 \
    --width 832 \
    --steps 10 \
    --out .cache/cosmos3_smoke
脚本会输出阶段耗时:model_loadpreprocessinferenceto_cpuencode。其中 inference 包含 denoise loop 和 VAE/AVAE decode;encode 是 PyAV 写 mp4 的时间。

当前限制

  • 当前路径是单卡 ws1。没有 tensor parallel、sequence parallel、continuous batching 或请求级调度。
  • 示例默认关闭 CUDA graph。denoise loop 是 Python 层 UniPC 循环,主要追求清晰和可对齐。
  • T2AV 会额外加载 sound_tokenizer / AVAE,并在每个 step 同步推进 sound latent;显存和耗时都会上升。
  • Prompt tokenization 与媒体保存都在 engine 外完成;要测模型本体,应把 preprocessto_cpuencode 的时间分开看。
  • PhyAI 还没有为 Cosmos3 T2V/T2AV 做专项 kernel、graph capture、batching 或端到端吞吐优化。这里展示的是基线道路,不是性能终点。

完整代码

import math

import torch

from phyai.engine import Engine, EngineArgs
from phyai.engine_config import DeviceConfig, EngineConfig, RuntimeConfig
from phyai.models.cosmos3 import Cosmos3T2VRequest, pixel_to_latent_shape
from phyai.models.cosmos3.main_cosmos3 import Cosmos3Args
from phyai_utils_tools.models.cosmos3 import (
    Cosmos3GenerationPostProcessor,
    Cosmos3Processor,
)

checkpoint_dir = "/path/to/Cosmos3-Nano"
device = "cuda"
dtype = torch.bfloat16
num_frames = 189
height = 720
width = 1280
fps = 24.0
with_sound = False

engine = Engine(
    EngineArgs(
        plugin="cosmos3",
        plugin_args=Cosmos3Args(
            checkpoint_dir=checkpoint_dir,
            flow_shift=10.0,
            use_karras_sigmas=False,
            load_sound=(True if with_sound else None),
        ),
        config=EngineConfig(
            device=DeviceConfig(target=device, params_dtype=dtype),
            runtime=RuntimeConfig(use_cuda_graph=False),
        ),
    )
)

try:
    processor = Cosmos3Processor(
        f"{checkpoint_dir}/text_tokenizer",
        fps=fps,
        num_frames=num_frames,
        height=height,
        width=width,
        append_metadata=True,
    )
    cond, uncond = processor.tokenize_pair(
        "A red sports car driving along a coastal road at sunset.",
        negative_prompt=None,
        device=device,
    )

    request = Cosmos3T2VRequest(
        text_ids=cond.text_ids,
        text_mask=cond.text_mask,
        neg_text_ids=uncond.text_ids,
        neg_text_mask=uncond.text_mask,
        video_shape=pixel_to_latent_shape(num_frames, height, width),
        fps=fps,
        num_inference_steps=35,
        guidance_scale=6.0,
        seed=42,
        sound_frames=(math.ceil(num_frames / fps * 25.0) if with_sound else None),
    )

    output = engine.step(request)
    postprocessor = Cosmos3GenerationPostProcessor(fps=fps)
    media = postprocessor.postprocess(output)
    postprocessor.save_mp4(media, ".cache/cosmos3_t2v.mp4")
finally:
    engine.close()