L i} $UddlZddlmZmZmZddlmZddlmZm Z m Z m Z m Z ddl mZddlmZddlmZdd lmZdd lmZdd lmZddlZd d gZe dZedZej:j<ZdZiZ e!eefe"d<dZ#d:de e eefge eefffdZ$e$ejJddde&fdZ'e$ejPd;de&fdZ)e$ejTd;de&fdZ+e$ejXd;de&fdZ-e$ej\ dd$Z?dd%dee@e@e&d&fe@e&d&fe@e&d&fe e@e&d&fffd'ZAdd%dee@e@e&d&fe@e&d&fe@e&d&fe e@e&d&fffd(ZBe$ejd)*ddde&fd+ZDe$ejd)*de&fd,ZFd-ZGe$ejejejgddde&fd.ZKe$ejd)*de&fd/ZMe$ejd)*de&fd0ZOiejJe'ejPe)ejTe+ejXe-ej\e/ejfe7ejhe7ejje7ejle7ejpe9ejve>ejxe>ejze>ejeKejeKejeKejeDejeFejeMejeOiZ d1ZPgd2ZQd3ZRd4ZSd5ZTd6ZUGd7d ZVGd8d9eZWy)=N)tree_map tree_flattentree_unflatten) ModuleTracker)AnyOptionalUnionTypeVarCallable)Iterator) ParamSpec) defaultdict)TorchDispatchModeprodwrapsFlopCounterModeregister_flop_formula_T_PcRt|tjr |jS|SN) isinstancetorchTensorshape)is ^/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/torch/utils/flop_counter.py get_shaper!s!U\\"ww H flop_registryc4tddfd }|S)N)out_valcFtt|||f\}}}|d|i|S)N out_shape)rr!)r%argskwargsr'fs r nfzshape_wrapper..nfs2"*9tVW6M"Nfi$6)6v66r"rr*r+s` r shape_wrapperr-s# 1X77 Ir"returncddtttfdtttfffd }|S)N flop_formular.cs tfd}tjjj |S)Nct|tjjst d|dt ||t vrtd|t |<y)Nzlregister_flop_formula(targets): expected each target to be OpOverloadPacket (i.e. torch.ops.mylib.foo), got z which is of type zduplicate registrations for )rr_opsOpOverloadPacket ValueErrortyper# RuntimeError)targetr0s r registerz=register_flop_formula..register_fun..register(sifejj&A&AB Hh0f@AA&"%A&#JKK$0M& !r")r-rutils_pytree tree_map_)r0r9get_rawtargetss` r register_funz+register_flop_formula..register_fun$s7(6L 1  %%h8r")r rr)r>r=r?s`` r rr#s08BF#3R8H& r")r'c:|\}}|\}}||k(sJ||zdz|zS)zCount flops for matmul.) a_shapeb_shaper'r(r)mkk2ns r mm_floprI9s3 DAq EB 7N7 q519q=r"c t||S)zCount flops for addmm.rI self_shaperCrDr'r)s r addmm_floprNDs 7G $$r"c V|\}}}|\}}} ||k(sJ||k(sJ||z| zdz|z} | S)z"Count flops for the bmm operation.rArB) rCrDr'r)brErFb2rGrHflops r bmm_floprSIsK GAq!IBA 7N7 7N7 q519q=1 D Kr"c t||S)z&Count flops for the baddbmm operation.rSrLs r baddbmm_floprVVs GW %%r"c  t||S)zCount flops for _scaled_mm.rK) rCrD scale_a_shape scale_b_shape bias_shapescale_result_shape out_dtypeuse_fast_accumr'r)s r _scaled_mm_flopr^]s 7G $$r"x_shapew_shaper' transposedct|d}|r|n|dd}|^}}} t|t|z|z|z|zdz} | S)aCount flops for convolution. Note only multiplication is counted. Computation for bias are ignored. Flops for a transposed convolution are calculated as flops = (x_shape[2:] * prod(w_shape) * batch_size). Args: x_shape (list(int)): The input shape before convolution. w_shape (list(int)): The filter shape. out_shape (list(int)): The output shape after convolution. transposed (bool): is the convolution transposed Returns: int: the number of flops rrANr) r_r`r'ra batch_size conv_shapec_outc_in filter_sizerRs r conv_flop_countrhns](J''Y;J 'E4+  d;/ /* .ts$a%(#d59o55r"rrFrj)r!rh)grad_out_shaper_r`rkrlrmrnra_output_padding_groups output_maskr'rs flop_countgrad_input_shapegrad_weight_shapes r conv_backward_flopr{s6JDL1~$Yq\2ong?OU_Q_`` 1~%il3  /!N*;QwZK\I]jop pJ  /!G*a6GK\I]jop pJ r"c|\}}}}|\}}} } |\} } } }||cxk(r| k(r"nJ||cxk(r| k(rnJ|| k(r | | k(r|| k(sJd}|t||z||f||z|| fz }|t||z|| f||z| |fz }|S)z^ Count flops for self-attention. NB: We can assume that value_shape == key_shape rrU) query_shape key_shape value_shaperPhs_qd_q_b2_h2s_k_d2_b3_h3_s3d_v total_flopss r sdpa_flop_countrs !NAq#s"Cc3$Cc3 ?s?[ [qC3[ [3#:#*QTX[Q[[ [K8QUC-AsC/@AAK8QUC-AsC/@AAK r"ct|||S)Count flops for self-attention.r)r}r~rr'r(r)s r sdpa_floprs ; ; ??r"cddlm}ddlm}t |||fs7|j j dk7r|jjS|g|jddz zS)z If the offsets tensor is fake, then we don't know the actual lengths. In that case, we can just assume the worst case; each batch has max length. r) FakeTensor)FunctionalTensormetar) torch._subclasses.fake_tensorr#torch._subclasses.functional_tensorrrdevicer6difftolistsize)offsetsmax_lenrrs r _offsets_to_lengthsrs[ 9D g ,<= >7>>CVCVZ`C`||~$$&& 9 Q!+ ,,r")grad_out.c#ZK|t|jdk(sJt|jdk(sJ||j|jk(sJ|j\}} } |j\}} } |j\}} }|J|J|j|jk(sJt||}t||}t||D]%\}}d| || f}d| || f}d| ||f}||nd}||||f'y|j|j|j| |jndfyw)a; Given inputs to a flash_attention_(forward|backward) kernel, this will handle behavior for NestedTensor inputs by effectively unbinding the NestedTensor and yielding the shapes for each batch element. In the case that this isn't a NestedTensor kernel, then it just yields the original shapes. Nrlenrrzip)querykeyvaluer cum_seq_q cum_seq_kmax_qmax_k_h_qrh_kd_kh_vr seq_q_lengths seq_k_lengths seq_q_len seq_k_lennew_query_shape new_key_shapenew_value_shapenew_grad_out_shapes r %_unpack_flash_attention_nested_shapesr)s[$ 399~"""5;;1$$$8>>U[[#@@@kk 3ii 3kk 3$$$$$$)//111+Iu= +Iu= &)-&G V "Y  #y#6OY4M #y#6O4<4Hd !=/CUU U  V  ++syy%++AUx~~[_ __sD)D+c#`K|t|jdk(sJt|jdk(sJ||j|jk(sJ|j\}}} } |j\}}} } |j\}}} }|J|J|j|jk(sJt||}t||}t||D]%\}}d| || f}d| || f}d| ||f}||nd}||||f'y|j|j|j| |jndfyw)a? Given inputs to a efficient_attention_(forward|backward) kernel, this will handle behavior for NestedTensor inputs by effectively unbinding the NestedTensor and yielding the shapes for each batch element. In the case that this isn't a NestedTensor kernel, then it just yields the original shapes. Nrr)rrrr cu_seqlens_q cu_seqlens_k max_seqlen_q max_seqlen_krrrrrrr seqlens_q seqlens_klen_qlen_krrrrs r )_unpack_efficient_attention_nested_shapesrWsd$399~"""5;;1$$$8>>U[[#@@@1c31c31c3''''''!!\%7%7777' lC ' lC  95 VLE5 #uc2OUC0M #uc2O4<4Hd !=/CUU U  V  ++syy%++AUx~~[_ __sD,D.T)r=c Jt|||||||} td| DS)r)rrrrrrrc3@K|]\}}}}t|||ywrr.0r}r~rrs r z0_flash_attention_forward_flop..) 2KK  Y <rsum) rrrrrrrr'r(r)sizess r _flash_attention_forward_floprs?" 2  E 6; r"c Jt|||||||} td| DS)r)rrrrrrrc3@K|]\}}}}t|||ywrrrs r rz4_efficient_attention_forward_flop..rrrr) rrrbiasrrrrr(r)rs r !_efficient_attention_forward_floprs?" 6 !!!! E 6; r"cd}|\}}}}|\} } } } |\} }}}|\}}}}|| cxk(r | cxk(r|k(rnJ|| cxk(r |cxk(r|k(r nJ|| k(sJ||k(r | |k(r||k(sJd}|t||z||f||z|| fz }|t||z||f||z|| fz }|t||z| |f||z||fz }|t||z|| f||z| |fz }|t||z||f||z|| fz }|SNrrU)rtr}r~rrrPrrrrrrrrrrr_b4_h4_s4_d4s r sdpa_backward_flop_countrsfK NAq#s"Cc3$Cc3'Cc3  !s !c !K Ka3&<#&<&<K KK K #:#*3 3K8QUC-AsC/@AAK8QUC-AsC/@AAK8QUC-AsC/@AAK8QUC-AsC/@AAK8QUC-AsC/@AAK r"ct||||S)z(Count flops for self-attention backward.r)rtr}r~rr'r(r)s r sdpa_backward_floprs $NKK XXr"c Lt|||||||| } td| DS)N)rrrrrrrrc3BK|]\}}}}t||||ywrrrr}r~rrts r rz1_flash_attention_backward_flop..+ ?KK !iUr) rrrrout logsumexprrrrr(r)shapess r _flash_attention_backward_floprsB"3  F CI r"c Lt|||||||| } td| DS)N)rrrrrrrrc3BK|]\}}}}t||||ywrrrs r rz5_efficient_attention_backward_flop..%rrr) rrrrrrrrrrr(r)rs r "_efficient_attention_backward_flopr sB"7 !!!! F CI r"c,t|ts|fS|Sr)rtuple)xs r normalize_tuplerBs a t Hr")KMBTc tdtttdz tt |dz dz}t|S)NrrrAr)maxminrsuffixesstr)numberindexs r get_suffix_strrKs= 3s8}q(3s6{+;a+?A*EF GE E?r"cXtj|}|d|zz d}|t|zS)Niz.3f)rr)rsuffixrrs r convert_num_with_suffixrRs2 NN6 "E %c*E 8E? ""r"c|dk(ry||z dS)Nr0%z.2%rB)numdenoms r convert_to_percent_strrYs zEk# r"c.tfd}|S)NcBt|\}}|}t||Sr)rr)r( flat_argsspecrr*s r r+z)_pytreeify_preserve_structure..nf_s'&t, 4mc4((r"rr,s` r _pytreeify_preserve_structurer^s  1X)) Ir"c eZdZdZ ddeeejje ejjfde de dee e e fffd Zde fdZde ee e e fffd Zdd Zd Zd Zd ZxZS)ra ``FlopCounterMode`` is a context manager that counts the number of flops within its context. It does this using a ``TorchDispatchMode``. It also supports hierarchical output by passing a module (or list of modules) to FlopCounterMode on construction. If you do not need hierarchical output, you do not need to use it with a module. Example usage .. code-block:: python mod = ... with FlopCounterMode(mod) as flop_counter: mod.sum().backward() modsdepthdisplaycustom_mappingc dt|td|_||_||_d|_|i}|tjddit|jDcic] \}}|t|ddr|n t|"c}}|_ t|_ycc}}w)Nc ttSr)rintrBr"r z*FlopCounterMode.__init__..s +VYJZr"z|S)N z - )rrr!appendrrrr) mod_namer rpaddingr!rFr global_flops global_suffixis_global_subsumedrs r process_modz.FlopCounterMode.get_table..process_modsd..x8??ABK +"= = EkGF MM("' ]C&{LA  ((288: 1 eOc!f,+A}=*1l;  Mr"r .rr+)r 0r)leftrightr6)headerscolalign) r tabulatePRESERVE_WHITESPACEr#rsortedrkeyscountextendr) rr r9headerr!r2mod mod_depth cur_valuesrr/r0r1s ` @@@r get_tablezFlopCounterMode.get_tables. =JJE =E'+$.++- &|4 " ,$**//12 &Ch #*I5 $S)a-8J MM* % & t'' '0B *q>a *!1-6F v;! +,F  B\ ]]r"c|jj|jjt ||_|j j|Sr)rclearr __enter___FlopCounterModerr"s r rFzFlopCounterMode.__enter__sG   ""$$T*   r"c|jJ|jj|}d|_|jj|jr$t |j |j |Sr)r__exit__rr printrCr )rr(rPs r rIzFlopCounterMode.__exit__sbyy$$$ DII   %  !!# << $.., -r"c||jvrY|j|}||i|d|i}t|jjD]}|j||xx|z cc<|S)Nr%)r#setrparentsr)r func_packetrr(r)flop_count_funcrxpars r _count_flopszFlopCounterMode._count_flopssx $,, ,"00=O($F&F#FJ4++334 A  %k2j@2 A r")NrATNr)__name__ __module__ __qualname____doc__r r rnnr(rrrboolr%rrr#rr&rCrFrIrQ __classcell__)rs@r rrhs*MQ 7; +5$uxx2G!GHI++ + %T#s(^4 +*88 Ac4S>&9!: A:^zr"c0eZdZdZdefdZdZdZddZy) rGTcounterc||_yr)rZ)rrZs r rz_FlopCounterMode.__init__s  r"cddl}|j|jj}|5||}ddd|j|jj}||j_|fS#1swYCxYw)aExecute a branch function and capture its FLOP counts without affecting self.counter.flop_counts Args: branch_fn: The branch function to execute operands: Arguments to pass to the branch function Returns: Tuple of (result, flop_counts) where result is the branch output and flop_counts is a copy of the FLOP counts after execution rN)copyrZr)r branch_fnoperandsr]checkpointed_flop_countsresultrs r $_execute_with_isolated_flop_countingz5_FlopCounterMode._execute_with_isolated_flop_countingsq #'99T\\-E-E#F  *)F *ii 8 89 #;  {""  * *s A44A=c>|tjjjhvrtS|tjjjurI|\}}}}|j ||\} } | turtS|j ||\} } | turtSt | jt | jz} i}| D]}| |}| |}i}t |jt |jz}|D]5}|j|d}|j|d}t||||<7|||<|jD]-\}}|jj|j|/| Syr)rops higher_ordercondNotImplementedrbrLr<getrrrZrupdate)rfunctypesr(r)pred true_branch false_branchr_true_outtrue_flop_counts false_outfalse_flop_counts all_mod_keysmerged_flop_counts outer_keytrue_func_countsfalse_func_countsmerged_func_counts all_func_keysfunc_keytrue_val false_val inner_dicts r _handle_higher_order_opsz)_FlopCounterMode._handle_higher_order_opss  ..336 6! ! 599)).. .8< 5D+|X)-)R)RX* &H&>)%%+/+T+Th, (I(N*%%/4467#>O>T>T>V:WWL!# ) C #3I#> $5i$@!%'" #$4$9$9$; >IINN99AAIINN??GGIINN''//IINN++33IINN))11IINN--55IINN1199IINN55==IINN((00IINN,,44IINN&&..IINN))113 3 " ! dEJJ:: ;00udFK K t||11 1d%))..BWBWB_B_6_ "DNND3F3N*  *  D#F#||(()=)=sD&QQ  s 6NN)rBN) rRrSrTsupports_higher_order_operatorsrrrbr~rrBr"r rGrGs%&*##(/b"Rr"rG)Fr)NNNFN)Xrtorch.utils._pytreerrrmodule_trackerrtypingrr r r r collections.abcr typing_extensionsr collectionsrtorch.utils._python_dispatchrmathr functoolsrr__all__rrrdrr!r#r%__annotations__r-rmmrrIaddmmrNbmmrSbaddbmmrV _scaled_mmr^rrrWrh convolution _convolutioncudnn_convolution_slow_conv2d_forwardroconvolution_backwardr{r'_scaled_dot_product_efficient_attention#_scaled_dot_product_flash_attention#_scaled_dot_product_cudnn_attentionrrrrr_flash_attention_forwardr_efficient_attention_forwardrr0_scaled_dot_product_efficient_attention_backward,_scaled_dot_product_flash_attention_backward,_scaled_dot_product_cudnn_attention_backwardr_flash_attention_backwardr_efficient_attention_backwardrrrrrrrrrGrBr"r rs FF)::$'#: 5 6 T]t_ yy~~ !# tCH~"XxB?O>PRZ[]_a[aRb>b5c,tww/3# tzz"%#%#%txx  C ! t||$&C&%& t'  %  %( %( $ #Y$ #Y$Cy$ $  $L(($*;*;T=S=SUYUnUnopbfOuxOqO t001ee2eN$DD@@@@BCEI@WZ@C@ -" +`eE#s(OU38_eCHoxPUVY[^V^P_G``ab+`f -`eE#s(OU38_eCHoxPUVY[^V^P_G``ab-``t44dC D>t88$G H>6MMIIIIKL^bYpsYLYt55tD E@t994H I@GGWJJ  HHh LL,   OO_   i  y I y 1 00) ,,i ,,i 99;M 557I  557I!" !!#@#$ %%'H""$B&&(J) .  $# LL\mR(mRr"