L icP dZddlZddlmZddlmZddlmZddlm Z m Z m Z m Z ddl mZmZmZmZe r ddlZddlmZe j*eZgd gd gd d d gd gdgd d ddgd gd gd d d gd gd gd d d dZdddddddddddddddddddddddddZdZ d$defdZdZdZdZd Z d!Z!d"Z"d#Z#y)%z;AWQ (Activation aware Weight Quantization) integration fileN)version)ACT2FN)PreTrainedModel)is_auto_awq_availableis_ipex_availableis_torch_availablelogging)AwqBackendPackingMethod AwqConfigAWQLinearVersionExllamaVersion)q_projk_projv_projo_proj) gate_projup_proj down_proj)input_layernormpost_attention_layernormnormF) attentionmlp layernorm use_alibi)w1w3w2g.A)rrrr rope_theta)mistralmixtralllamallavaactc_fc)r%layer_before_act dense_h_to_4hrfc_in gelu_impl) starcoder2RefinedWebModelfalconmptgptjgpt_neox gpt_bigcodebloomcZddlm}|tvr|S|jD]\}}t|d}t|d}||k(rYt ||rMt |t|d}|j }tj|} ||| |j|<t||} |S)Nr)ScaledActivationr%r') awq.modules.actr4AWQ_SCALES_MAPPINGSnamed_childrenhasattrgetattr out_featurestorchones_modulesreplace_quantization_scales) model model_typer4namemoduleact_namelayer_before_act_namer'size scale_like_s c/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/transformers/integrations/awq.pyr>r>Ms0,, ,,.< f&z259 3J ?@R S 8 /D E&u.A*.MN`.ab #00DD)J#3FJ#GENN4 ' ;< Lreturnc |g}|j}ts td|tjk(r|j t jk(r ddlm }|}n|j t jk(r ddl m }|}n|j t jk(rm|jdtj k(r ddlm} | }n|jdtj&k(r ddlm} | }natd |jd|j t j,k(r dd lm} | }n td |j dd lm} | }|j7D]\} }gj9| t;|t<j>r| |vrtAfd |Ds|jB}|jD}||jF|jH|||jJdu|jLjN|jP| <d}|jP| jSdtUtW|jYdkDrt[||||\}}j]d ||fS)a Public method that recursively replaces the Linear layers of the given model with AWQ quantized layers. `accelerate` is needed to use this method. Returns the converted model and a boolean that indicates if the conversion has been successful or not. During the module replacement, we also infer the backend to use through the `quantization_config` object. Args: model (`torch.nn.Module`): The model to convert, can be any `torch.nn.Module` instance. quantization_config (`AwqConfig`): The quantization config object that contains the quantization parameters. modules_to_not_convert (`list`, *optional*): A list of modules to not convert. If a module name is in the list (e.g. `lm_head`), it will not be converted. current_key_name (`list`, *optional*): A list that contains the current key name. This is used for recursion and should not be passed by the user. has_been_replaced (`bool`, *optional*): A boolean that indicates if the conversion has been successful or not. This is used for recursion and should not be passed by the user. NzAWQ (either `autoawq` or `llmawq`) is not available. Please install it with `pip install autoawq` or check out the installation guide in https://github.com/mit-han-lab/llm-awqr) WQLinear_GEMM) WQLinear_GEMVr)WQLinear_Exllama)WQLinear_ExllamaV2Unrecognized Exllama version:  WQLinear_IPEXzUnrecognized AWQ version: )WQLinearc3DK|]}|djvyw).N)join).0keycurrent_key_names rH z*replace_with_awq_linear..s [Sschh'788[s )w_bit group_size in_featuresr:biasdevTF)modules_to_not_convertrYquantization_confighas_been_replaced)/backendr ValueErrorr AUTOAWQrr GEMMawq.modules.linear.gemmrLGEMVawq.modules.linear.gemvrMEXLLAMAexllama_configrONEawq.modules.linear.exllamarNTWOawq.modules.linear.exllamav2rOIPEXawq.modules.linear.gemm_ipexrRawq.quantize.qmodulerSr7append isinstancennLinearanyr]r:bitsr\r^weightdevicer=requires_grad_lenlistchildrenreplace_with_awq_linearpop)r?r`rarYrbrdrL target_clsrMrNrOrRrSrArBr]r:rGs ` rHrr^sf8%!#!))G " ~  )111  & &*:*?*? ? =&J ( (,<,A,A A =&J ( (,<,D,D D"11)<@R@RRG- $33I>.BTBTTK/  #ABUBdBdenBoAp!qrr ( (,<,A,A A B&J9:M:U:U9VWX X1 ,,. ! f  #! % fbii (T9O-O[DZ[[$00 %22 '1-222== +!-D0 ,, (t$%)!t$33E: tFOO%& '! +#:'=!1$7"3 $ A  R A !B # ##rIct|ts"td|jj|j |j }|j |d<|S|jjtvr~t|jj}|jjd}|j}|j}t|d|}||d<||d<||d<|j |d<|Std) af Returns the fusing mapping given the quantization config and the model Args: model (`~PreTrainedModel`): The model to fuse - note this model should have been converted into AWQ format beforehand. quantization_config (`~transformers.quantization_config.AWQConfig`): The quantization configuration to use. z:The model should be an instance of `PreTrainedModel`, got max_seq_lenTdecodernum_key_value_heads hidden_sizenum_attention_headsaFusing mapping not found either on the quantization config or the supported `AWQ_FUSED_MAPPINGS`. Please pass a `fused_mapping` argument in the `quantization_config` or raise an issue on transformers https://github.com/huggingface/transformers to add its support.)rur TypeError __class____name__modules_to_fusefuse_max_seq_lenconfigr@AWQ_FUSED_MAPPINGSget_text_configrrr9re)r?racurrent_fused_mappingrrrrs rHget_modules_to_fusers" e_ -TUZUdUdUmUmTnopp**6 3 C C/B/S/Sm,, ! +  $6 6 25<<3J3J K--d-;(( $88%f.CEXY0;m,7J347J34/B/S/Sm, !  N  rIc> t|trtj|}|j}t ||}t |dd}|tjk(rddl m }ddl m }ddl m}n tdg |j!D]\ }|t# fd|Drt%|d |||j&d k7rt)| |d ||nt*j-d t/||| |} | sv j1 j3d dt5 dkDrc|j!D]P\ }t# fd Dst7|ds)t7|j8ds@d|j8_R|S)aJ Optionally fuse some modules in the model to speedup inference. Args: model (`~PreTrainedModel`): The model to fuse - note this model should have been converted into AWQ format beforehand. quantization_config (`Union[AwqConfig, dict]`): The quantization configuration to use. r`Nr)QuantAttentionFused) QuantFusedMLP)FasterTransformerRMSNormz0Fusing is only supported for the AutoAWQ backendc3&K|]}|v ywN)rWmodule_name_to_not_convertrAs rHrZz#fuse_awq_modules..so:T-5oripexrz7The IPEX version AWQ does not support fuse mlp for now.rUc3&K|]}v ywrr)rWfused_attention_parent_modulefused_attention_modules module_names rHrZz#fuse_awq_modules..(s;X 66rr_attn_implementationcustom)rudictr from_dictrdrr9r rfawq.modules.fused.attnrawq.modules.fused.mlprawq.modules.fused.normrre named_modulesrx_fuse_awq_layernormr _fuse_awq_mlploggerinfo_fuse_awq_attention_layersrtsplitr}r8rr) r?rardrr`rrrrBattention_has_been_fusedrrrAs @@@rHfuse_awq_modulesrs%t,'112EF!))G)%1DEO$%8:RTXY)111>7CKLL ++-? f ! -oXnoo OK8&BZ[  & && 0 %u'=v} U KKQ R$> 6?D2E$   $ # * *4::c?1+= >)?2 "#a'#(#6#6#8 B K\s68,H^1_9AFMM6  B LrIc|D]i}t||st||}||j|jj |jj |j |<~ky)a Fuse the LayerNorm layers into a target class using autoawq Args: fuse_module_names (`list[str]`): The list of module names to fuse module (`nn.Module`): The pytorch parent module that has layernorm modules to fuse target_cls (`~autoawq.FasterTransformerRMSNorm`): The `FasterTransformerRMSNorm` class as it only supports that class for now. N)r8r9rzvariance_epsilontor{r=)fuse_module_namesrBrr old_modules rHrr0sn) 6; ' 5J+5!!++,b""))* OOK (rIct|dk(ryt||drt||d}t||d}t||d}|jj}|j j d} | j} t| } ||||| } |jdd\} }|j| }t||| j|~~~yy)a Fuse the MLP layers into a target class using autoawq Args: model (`~PreTrainedModel`): The input pretrained model current_module_name (`str`): The current submodule name fuse_module_names (`list[str]`): The list of module names to fuse. For the MLP layers it has to be an array of length 3 that consists of the 3 MLP layers in the order (gate (dense layer post-attention) / up / down layers) module (`nn.Module`): The pytorch parent module that has layernorm modules to fuse target_cls (`~autoawq.QuantFusedMLP`): The `QuantFusedMLP` class as it only supports that class for now. rNrTrrU) r}r8r9qweightr{rr hidden_actrrsplit get_submodulesetattrr)r?current_module_namerrBrrrrprevious_devicerr activation_fn new_module parent_name child_nameparents rHrrGs$ "v(+,F$5a$89 &"3A"67F$5a$89 #++22--d-;&& z*  9g}M "5"<"boolrrrrrrrrrrIrHrs4> ,YY   H %>4L >!L >4L >4L ): V<$/JAi 8w 7?C f= o F &  f$  f$R&!R=@.&*R\!~2 rI