赞
踩
该模型总的参数为57B,激活参数为14B,推理速度比32B的快,而且性能更好。
<class 'transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForCausalLM'> Qwen2MoeForCausalLM( (model): Qwen2MoeModel( (embed_tokens): Embedding(151936, 3584) (layers): ModuleList( (0-27): 28 x Qwen2MoeDecoderLayer( (self_attn): Qwen2MoeSdpaAttention( (q_proj): Linear(in_features=3584, out_features=3584, bias=True) (k_proj): Linear(in_features=3584, out_features=512, bias=True) (v_proj): Linear(in_features=3584, out_features=512, bias=True) (o_proj): Linear(in_features=3584, out_features=3584, bias=False) (rotary_emb): Qwen2MoeRotaryEmbedding() ) (mlp): Qwen2MoeSparseMoeBlock( (gate): Linear(in_features=3584, out_features=64, bias=False) (experts): ModuleList( (0-63): 64 x Qwen2MoeMLP( (gate_proj): Linear(in_features=3584, out_features=2560, bias=False) (up_proj): Linear(in_features=3584, out_features=2560, bias=False) (down_proj): Linear(in_features=2560, out_features=3584, bias=False) (act_fn): SiLU() ) ) (shared_expert): Qwen2MoeMLP( (gate_proj): Linear(in_features=3584, out_features=20480, bias=False) (up_proj): Linear(in_features=3584, out_features=20480, bias=False) (down_proj): Linear(in_features=20480, out_features=3584, bias=False) (act_fn): SiLU() ) (shared_expert_gate): Linear(in_features=3584, out_features=1, bias=False) ) (input_layernorm): Qwen2MoeRMSNorm() (post_attention_layernorm): Qwen2MoeRMSNorm() ) ) (norm): Qwen2MoeRMSNorm() ) (lm_head): Linear(in_features=3584, out_features=151936, bias=False) )
#输入的Embedding层 model.embed_tokens.weight: torch.Size([151936, 3584]) #主体的layer层,model.layers.0是第一层,共有28层 #下面是model.layers.0的attention层 model.layers.0.self_attn.q_proj.weight: torch.Size([3584, 3584]) model.layers.0.self_attn.q_proj.bias: torch.Size([3584]) model.layers.0.self_attn.k_proj.weight: torch.Size([512, 3584]) model.layers.0.self_attn.k_proj.bias: torch.Size([512]) model.layers.0.self_attn.v_proj.weight: torch.Size([512, 3584]) model.layers.0.self_attn.v_proj.bias: torch.Size([512]) model.layers.0.self_attn.o_proj.weight: torch.Size([3584, 3584]) model.layers.0.mlp.gate.weight: torch.Size([64, 3584]) #下面是model.layers.0的moe结构的mlp层 model.layers.0.mlp.experts.0.gate_proj.weight: torch.Size([2560, 3584]) model.layers.0.mlp.experts.0.up_proj.weight: torch.Size([2560, 3584]) model.layers.0.mlp.experts.0.down_proj.weight: torch.Size([3584, 2560]) model.layers.0.mlp.experts.1.gate_proj.weight: torch.Size([2560, 3584]) model.layers.0.mlp.experts.1.up_proj.weight: torch.Size([2560, 3584]) model.layers.0.mlp.experts.1.down_proj.weight: torch.Size([3584, 2560]) model.layers.0.mlp.experts.2.gate_proj.weight: torch.Size([2560, 3584]) model.layers.0.mlp.experts.2.up_proj.weight: torch.Size([2560, 3584]) model.layers.0.mlp.experts.2.down_proj.weight: torch.Size([3584, 2560]) ...有64个model.layers.0.mlp.experts层,这里省略model.layers.0.mlp.experts.3----model.layers.0.mlp.experts.62 model.layers.0.mlp.experts.63.gate_proj.weight: torch.Size([2560, 3584]) model.layers.0.mlp.experts.63.up_proj.weight: torch.Size([2560, 3584]) model.layers.0.mlp.experts.63.down_proj.weight: torch.Size([3584, 2560]) #下面是model.layers.0的shared moe结构的mlp层 model.layers.0.mlp.shared_expert.gate_proj.weight: torch.Size([20480, 3584]) model.layers.0.mlp.shared_expert.up_proj.weight: torch.Size([20480, 3584]) model.layers.0.mlp.shared_expert.down_proj.weight: torch.Size([3584, 20480]) model.layers.0.mlp.shared_expert_gate.weight: torch.Size([1, 3584]) #下面是是model.layers.0的Qwen2MoeRMSNorm层 model.layers.0.input_layernorm.weight: torch.Size([3584]) model.layers.0.post_attention_layernorm.weight: torch.Size([3584]) ...这里省略model.layers.1---model.layers.27,它们的结构与model.layers.0一样 #下面是马上要输出前的归一化norm层 model.norm.weight: torch.Size([3584]) #下面是输出到最后的151936个token概率分布的mlp层 lm_head.weight: torch.Size([151936, 3584])
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。