跳转至

250217 Step-Video-T2V Reading & Porting

导言

阅读Step-Video-T2V代码(git id d3ca3d6),移植到昇腾。

框架特点

api/call_remote_server.py 展示了vae,text-encoder在不同GPU分离的计算系统。

JSON配置

step_llm(text encoder)
{
    "_name_or_path": "/mnt/shared-storage/tenant/opensource/step_llm",
    "allow_transformer_engine": false,
    "architectures": [
        "Step1Model"
    ],
    "attention_dropout": 0.0,
    "attention_impl": "GQA",
    "base_batch_size": 128,
    "embedding_weights_in_fp32": false,
    "ffn_hidden_size": 16896,
    "fp32_residual_connection": false,
    "hidden_dropout": 0.0,
    "hidden_size": 6144,
    "kv_channels": 128,
    "layernorm_epsilon": 1e-05,
    "max_position_embeddings": 16384,
    "num_attention_groups": 8,
    "num_attention_heads": 48,
    "num_layers": 48,
    "orig_vocab_size": 65536,
    "overlap_p2p_comm": true,
    "padded_vocab_size": 65536,
    "params_dtype": "torch.bfloat16",
    "seq_length": 16384,
    "swiglu_recompute_silu_dot": true,
    "tokens_to_generate": 512,
    "torch_dtype": "bfloat16",
    "transformers_version": "4.48.3",
    "use_flash_attn": true,
    "virtual_pipeline_model_parallel_size": 3
}
transformer
{
    "_class_name": "StepVideoModel",
    "_diffusers_version": "0.31.0",
    "attention_head_dim": 128,
    "attention_type": "parallel",
    "caption_channels": [
        6144,
        1024
    ],
    "dropout": 0.0,
    "in_channels": 64,
    "norm_elementwise_affine": false,
    "norm_eps": 1e-06,
    "norm_type": "ada_norm_single",
    "num_attention_heads": 48,
    "num_layers": 48,
    "out_channels": 64,
    "patch_size": 1,
    "use_additional_conditions": false
}
flow matching scheduler
{
    "_class_name": "FlowMatchDiscreteScheduler",
    "_diffusers_version": "0.31.0",
    "device": null,
    "num_train_timesteps": 1000,
    "reverse": false,
    "solver": "euler"
}

VAE

  • from stepvideo.vae.vae import AutoencoderKL
  • vae 约 1、2GB 3
  • 如何接入DiT流程的?如何逐步替换OSP1.5
def decode_vae(self, samples):
    samples = asyncio.run(self.vae(samples.cpu()))
    return samples
关于双路径的开启
use_conv_shortcut
version == 2 # 开启?

text encoder

  • stepllm(约40GB) 和 HunyuanClip(4GB)3 融成了self.caption
  • prompt 通过 encode_prompt 产生 encoder_hidden_states 经过 caption_projection
    def build_llm(self, model_dir):
        from stepvideo.text_encoder.stepllm import STEP1TextEncoder
        text_encoder = STEP1TextEncoder(model_dir, max_length=320).to(dtype).to(device).eval()
        print("Inintialized text encoder...")
        return text_encoder

    def build_clip(self, model_dir):
        from stepvideo.text_encoder.clip import HunyuanClip
        clip = HunyuanClip(model_dir, max_length=77).to(device).eval()
        print("Inintialized clip encoder...")
        return clip

    def embedding(self, prompts, *args, **kwargs):
        with torch.no_grad():
        try:
                y, y_mask = self.text_encoder(prompts)

                clip_embedding, _ = self.clip(prompts)

                len_clip = clip_embedding.shape[1]
                y_mask = torch.nn.functional.pad(y_mask, (len_clip, 0), value=1)   ## pad attention_mask with clip's length 

                data = {
                    'y': y.detach().cpu(),
                    'y_mask': y_mask.detach().cpu(),
                    'clip_embedding': clip_embedding.to(torch.bfloat16).detach().cpu()
                }

                return data
        except Exception as err:
                print(f"{err}")
                return None

DiT

  • StepVideoModel
  • 怎么接入text—encoder的;url的接口,怎么改回去?
    • 把 encode_prompt 里的 asyncio.run 改回去即可。
  • xfuser并行库?是否要去掉
  • flowMatch scheduler: FlowMatchDiscreteScheduler,

模型diffusion_pytorch_model-00001-of-00006.safetensors约58GB3

适配流程

  1. 先训练再推理
  2. 将推理硬编码的超参变成json里可选的超参,并且各个组件传递tensor时要对齐。
    1. 问题:VAE和DiT的channel怎么对齐的?超参正好都是8,predictor:in_channels == ae:latent_dim == 8
    2. 超参改变权重还能加载吗?
  3. 推理:由于权重的超参是固定的,还不一定能跑得下。

参考文献


  1. 接力DeepSeek,阶跃星辰直接开源两款国产多模态大模型 

  2. https://blog.csdn.net/weixin_41446370/article/details/145768114 

  3. 模型权重 

评论