250217 Step-Video-T2V Reading & Porting

导言

阅读Step-Video-T2V代码（git id d3ca3d6），移植到昇腾。

框架特点¶

api/call_remote_server.py 展示了vae，text-encoder在不同GPU分离的计算系统。

JSON配置¶

step_llm（text encoder）

{
    "_name_or_path": "/mnt/shared-storage/tenant/opensource/step_llm",
    "allow_transformer_engine": false,
    "architectures": [
        "Step1Model"
    ],
    "attention_dropout": 0.0,
    "attention_impl": "GQA",
    "base_batch_size": 128,
    "embedding_weights_in_fp32": false,
    "ffn_hidden_size": 16896,
    "fp32_residual_connection": false,
    "hidden_dropout": 0.0,
    "hidden_size": 6144,
    "kv_channels": 128,
    "layernorm_epsilon": 1e-05,
    "max_position_embeddings": 16384,
    "num_attention_groups": 8,
    "num_attention_heads": 48,
    "num_layers": 48,
    "orig_vocab_size": 65536,
    "overlap_p2p_comm": true,
    "padded_vocab_size": 65536,
    "params_dtype": "torch.bfloat16",
    "seq_length": 16384,
    "swiglu_recompute_silu_dot": true,
    "tokens_to_generate": 512,
    "torch_dtype": "bfloat16",
    "transformers_version": "4.48.3",
    "use_flash_attn": true,
    "virtual_pipeline_model_parallel_size": 3
}

transformer

{
    "_class_name": "StepVideoModel",
    "_diffusers_version": "0.31.0",
    "attention_head_dim": 128,
    "attention_type": "parallel",
    "caption_channels": [
        6144,
        1024
    ],
    "dropout": 0.0,
    "in_channels": 64,
    "norm_elementwise_affine": false,
    "norm_eps": 1e-06,
    "norm_type": "ada_norm_single",
    "num_attention_heads": 48,
    "num_layers": 48,
    "out_channels": 64,
    "patch_size": 1,
    "use_additional_conditions": false
}

flow matching scheduler

{
    "_class_name": "FlowMatchDiscreteScheduler",
    "_diffusers_version": "0.31.0",
    "device": null,
    "num_train_timesteps": 1000,
    "reverse": false,
    "solver": "euler"
}

VAE¶

from stepvideo.vae.vae import AutoencoderKL
vae 约 1、2GB ³
如何接入DiT流程的？如何逐步替换OSP1.5

def decode_vae(self, samples):
    samples = asyncio.run(self.vae(samples.cpu()))
    return samples

关于双路径的开启

use_conv_shortcut
version == 2 # 开启？

text encoder¶

stepllm（约40GB）和 HunyuanClip（4GB）³ 融成了self.caption
prompt 通过 encode_prompt 产生 encoder_hidden_states 经过 caption_projection

    def build_llm(self, model_dir):
        from stepvideo.text_encoder.stepllm import STEP1TextEncoder
        text_encoder = STEP1TextEncoder(model_dir, max_length=320).to(dtype).to(device).eval()
        print("Inintialized text encoder...")
        return text_encoder

    def build_clip(self, model_dir):
        from stepvideo.text_encoder.clip import HunyuanClip
        clip = HunyuanClip(model_dir, max_length=77).to(device).eval()
        print("Inintialized clip encoder...")
        return clip

    def embedding(self, prompts, *args, **kwargs):
        with torch.no_grad():
        try:
                y, y_mask = self.text_encoder(prompts)

                clip_embedding, _ = self.clip(prompts)

                len_clip = clip_embedding.shape[1]
                y_mask = torch.nn.functional.pad(y_mask, (len_clip, 0), value=1)   ## pad attention_mask with clip's length 

                data = {
                    'y': y.detach().cpu(),
                    'y_mask': y_mask.detach().cpu(),
                    'clip_embedding': clip_embedding.to(torch.bfloat16).detach().cpu()
                }

                return data
        except Exception as err:
                print(f"{err}")
                return None

DiT¶

StepVideoModel
怎么接入text—encoder的；url的接口，怎么改回去？
- 把 encode_prompt 里的 asyncio.run 改回去即可。
xfuser并行库？是否要去掉
flowMatch scheduler: FlowMatchDiscreteScheduler,

模型diffusion_pytorch_model-00001-of-00006.safetensors约58GB³

适配流程¶

先训练再推理
将推理硬编码的超参变成json里可选的超参，并且各个组件传递tensor时要对齐。
1. 问题：VAE和DiT的channel怎么对齐的？超参正好都是8，predictor:in_channels == ae:latent_dim == 8
2. 超参改变权重还能加载吗？
推理：由于权重的超参是固定的，还不一定能跑得下。

参考文献¶

接力DeepSeek，阶跃星辰直接开源两款国产多模态大模型 ↩
https://blog.csdn.net/weixin_41446370/article/details/145768114 ↩
模型权重 ↩↩↩