250217 Step-Video-T2V Reading & Porting
导言
阅读Step-Video-T2V代码(git id d3ca3d6),移植到昇腾。
框架特点¶
api/call_remote_server.py
展示了vae,text-encoder在不同GPU分离的计算系统。
JSON配置¶
step_llm(text encoder)
{
"_name_or_path": "/mnt/shared-storage/tenant/opensource/step_llm",
"allow_transformer_engine": false,
"architectures": [
"Step1Model"
],
"attention_dropout": 0.0,
"attention_impl": "GQA",
"base_batch_size": 128,
"embedding_weights_in_fp32": false,
"ffn_hidden_size": 16896,
"fp32_residual_connection": false,
"hidden_dropout": 0.0,
"hidden_size": 6144,
"kv_channels": 128,
"layernorm_epsilon": 1e-05,
"max_position_embeddings": 16384,
"num_attention_groups": 8,
"num_attention_heads": 48,
"num_layers": 48,
"orig_vocab_size": 65536,
"overlap_p2p_comm": true,
"padded_vocab_size": 65536,
"params_dtype": "torch.bfloat16",
"seq_length": 16384,
"swiglu_recompute_silu_dot": true,
"tokens_to_generate": 512,
"torch_dtype": "bfloat16",
"transformers_version": "4.48.3",
"use_flash_attn": true,
"virtual_pipeline_model_parallel_size": 3
}
transformer
{
"_class_name": "StepVideoModel",
"_diffusers_version": "0.31.0",
"attention_head_dim": 128,
"attention_type": "parallel",
"caption_channels": [
6144,
1024
],
"dropout": 0.0,
"in_channels": 64,
"norm_elementwise_affine": false,
"norm_eps": 1e-06,
"norm_type": "ada_norm_single",
"num_attention_heads": 48,
"num_layers": 48,
"out_channels": 64,
"patch_size": 1,
"use_additional_conditions": false
}
flow matching scheduler
VAE¶
from stepvideo.vae.vae import AutoencoderKL
- vae 约 1、2GB 3
- 如何接入DiT流程的?如何逐步替换OSP1.5
text encoder¶
- stepllm(约40GB) 和 HunyuanClip(4GB)3 融成了
self.caption
- prompt 通过 encode_prompt 产生 encoder_hidden_states 经过 caption_projection
def build_llm(self, model_dir):
from stepvideo.text_encoder.stepllm import STEP1TextEncoder
text_encoder = STEP1TextEncoder(model_dir, max_length=320).to(dtype).to(device).eval()
print("Inintialized text encoder...")
return text_encoder
def build_clip(self, model_dir):
from stepvideo.text_encoder.clip import HunyuanClip
clip = HunyuanClip(model_dir, max_length=77).to(device).eval()
print("Inintialized clip encoder...")
return clip
def embedding(self, prompts, *args, **kwargs):
with torch.no_grad():
try:
y, y_mask = self.text_encoder(prompts)
clip_embedding, _ = self.clip(prompts)
len_clip = clip_embedding.shape[1]
y_mask = torch.nn.functional.pad(y_mask, (len_clip, 0), value=1) ## pad attention_mask with clip's length
data = {
'y': y.detach().cpu(),
'y_mask': y_mask.detach().cpu(),
'clip_embedding': clip_embedding.to(torch.bfloat16).detach().cpu()
}
return data
except Exception as err:
print(f"{err}")
return None
DiT¶
StepVideoModel
- 怎么接入text—encoder的;url的接口,怎么改回去?
- 把 encode_prompt 里的 asyncio.run 改回去即可。
- xfuser并行库?是否要去掉
- flowMatch
scheduler: FlowMatchDiscreteScheduler,
模型diffusion_pytorch_model-00001-of-00006.safetensors
约58GB3
适配流程¶
- 先训练再推理
- 将推理硬编码的超参变成json里可选的超参,并且各个组件传递tensor时要对齐。
- 问题:VAE和DiT的channel怎么对齐的?超参正好都是8,predictor:in_channels == ae:latent_dim == 8
- 超参改变权重还能加载吗?
- 推理:由于权重的超参是固定的,还不一定能跑得下。