Skip to content

【性能分析】这几天研究了一下nsys profile,尝试分析目前的generate时间瓶颈,分享下结论,不保证正确,希望各位讨论 #821

Open
@tarjintor

Description

@tarjintor

由于kt用了cuda graph加速,简单的profiler似乎看不到graph内部的步骤时长,所以我用了nsys profile
如果把每一层都统计速度,数据量太大了,我就简单的开头,中间,结束各取了几层
需要的步骤包括

  1. 改源码,我用的是0.2.0版本,引入import torch.cuda.nvtx as nvtx,然后在experts.py中KDeepseekV3MoE.forward中插入代码
    分别检查gate
        target_layers = [3, 22, 50]            
        if self.layer_idx_nvtx in target_layers:
            nvtx.range_push(f"moe_gate_{self.layer_idx_nvtx}")
        topk_idx, topk_weight = self.gate(hidden_states)
        if self.layer_idx_nvtx in target_layers:
            nvtx.range_pop()

检查路由专家,注意这里只是异步提交一个cpu任务,而不是任务完成了

            if self.layer_idx_nvtx in target_layers:
                nvtx.range_push(f"cpuinfer_{self.layer_idx_nvtx}")
            self.experts.generate_experts.submit_for_one_decode(hidden_states[0], topk_idx[0], topk_weight[0])
            if self.layer_idx_nvtx in target_layers:
                nvtx.range_pop()

检查共享专家

            if self.config.n_shared_experts is not None:
                if self.layer_idx_nvtx in target_layers:
                    nvtx.range_push(f"shared_experts_{self.layer_idx_nvtx}")
                y_ = self.shared_experts(identity).squeeze(0)
                if self.layer_idx_nvtx in target_layers:
                    nvtx.range_pop()

检查从cpu取回路由专家的结果

            if self.layer_idx_nvtx in target_layers:
                nvtx.range_push(f"self.experts.generate_experts.sync_for_one_decode {self.layer_idx_nvtx}")                                    
            y = self.experts.generate_experts.sync_for_one_decode().unsqueeze(0)
            if self.layer_idx_nvtx in target_layers:
                nvtx.range_pop()

然后作为对比,把attention的代码也监控起来,在modeling_deepseek_v3.py的DeepseekV3DecoderLayer的forward中添加


 target_layer = [2,3,4, 22,23, 50, 51]
        # Self Attention
        if self.layer_idx_nvtx in target_layer:
            nvtx.range_push(f"self_attention_{self.layer_idx_nvtx}")
        # if hasattr(self, "logger"):
        #     self.logger.info("------------ nvtx.range_push self_attn")
        hidden_states, self_attn_weights, present_key_value = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_value=past_key_value,
            output_attentions=output_attentions,
            use_cache=use_cache,
            cache_position=cache_position,
            **kwargs,
        )
        if self.layer_idx_nvtx in target_layer:
            nvtx.range_pop()

然后执行命令
nsys profile --trace=cuda,nvtx,osrt,cudnn --cuda-graph-trace=node -o report python model_exam.py

这里model_exam.py可以改一下local_chat.py,不要轮询input,直接content是‘你好’来跑一轮对话再break即可
就可以得到一个分析结果,文件有点大,我就简单发了个截图
我的配置是ubuntu22.04,cuda12.2,python3.11,3960x用的20core,256g内存,2.51bit量化版本,4090显卡
这个图大概看上去每一层attention和moe的用时比例在2:3,所以我理解简单的说加快moe速度或者gpu速度可能都容易有边界效应,对我而言,先加快moe速度边界收益应该更高,但cpuinfer速度在整个计算中占比变低之后,也许能优化gpu算法会更好,当然,普通人4090已经够好了
抛砖引玉,求讨论

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions