L i5HdZddlmZmZddlZddlmZddlmZm Z ddl m Z m Z m Z erddlmZdd lmZmZmZe j(eZGd d Z d!d ej0d ej0dej0deej0eej0ej0fffdZeej0efZ d"dej0deedeeeefdeeddf dZdej0dedej0fdZ d#dej@jBd ej0d ej0dej0deej0dfdee"dee"deej0deej0deej0eej0ffd Z#y)$a7 Partially inspired by torchtune's flex attention implementation Citation: @software{torchtune, title = {torchtune: PyTorch's finetuning library}, author = {torchtune maintainers and contributors}, url = {https//github.com/pytorch/torchtune}, license = {BSD-3-Clause}, month = apr, year = {2024} } )OptionalUnionN)version)is_torch_flex_attn_availablelogging)_torch_versionis_torch_less_or_equalis_torchdynamo_compiling)_DEFAULT_SPARSE_BLOCK_SIZE) BlockMaskcreate_block_maskflex_attentioncxeZdZdZdZdZdZfdZejjddZ dZ xZ S)WrappedFlexAttentionzh We are doing a singleton class so that flex attention is compiled once when it's first called. NFc\|jt| ||_|jSN) _instancesuper__new__)clsargskwargs __class__s n/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/transformers/integrations/flex_attention.pyrzWrappedFlexAttention.__new__7s' == !GOC0CM}}) recursivec|jr||jk7r||_tdr!tjt d|_nhtjtjdk(r$|r"tjt dd|_ntjt |_d|_yy) z> Initialize or update the singleton instance. 2.5.1F)dynamicz2.6.0zmax-autotune-no-cudagraphs)r modeTN) _is_flex_compiledtrainingr torchcompiler_compiled_flex_attentionrparser base_version)selfr#s r__init__zWrappedFlexAttention.__init__=s %%T]])B$DM%g.05 nV[0\-~.;;wF805 "E8T1- 16 n0M-%)D "*Crc|jSr)r&)r)s r__call__zWrappedFlexAttention.__call__Ss,,,r)__name__ __module__ __qualname____doc__rr"r&rr$compilerdisabler*r, __classcell__)rs@rrr.sKI#  ^^e,*-**-rrquerykeyvaluereturnc Xtst|nt}||||fi|Sr)r rr)r4r5r6r#rflex_attention_compileds rcompile_friendly_flex_attentionr:Ws@G_F`<28<>ft "       rattention_mask_2dattention_chunk_sizeoffsets is_causalr c D j\}}|s|}|s|}|tzdztz}tjjj dd||z fj } j|4jjdjddz |zfd fd} fd} |s| n| n| |0|dj| |dj| fd } n} t| |d||| td  S) aG IMPORTANT NOTICE: This function is deprecated in favor of using the mask primitives in `masking_utils.py`, and will be removed in a future version without warnings. New code should not use it. It is only kept here for BC for now, while models using it are being patched accordingly. Create a block (causal) document mask for a batch of sequences, both packed and unpacked. Create Block (causal) logic and passing it into :func:`torch.nn.attention.flex_attention.create_block_mask`. The resultant BlockMask is a compressed representation of the full (causal) block mask. BlockMask is essential for performant computation of flex attention. See: https://pytorch.org/blog/flexattention/ Args: attention_mask_2d (torch.Tensor): Attention mask for packed and padded sequences of shape (batch_size, total_seq_len). e.g. For unpacked sequence: [[1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0, 0]] For packed sequence: [[1, 1, 1, 2, 2, 2, 0], [1, 1, 2, 2, 2, 3, 3]] Returns: BlockMask r)r6padNcT||k\} ||f ||fk(}||fdkD}||z|z}|S)z Defines the logic of a block causal mask by combining both a standard causal mask and a block diagonal document mask. See :func:`~torchtune.modules.attention_utils.create_block_causal_mask` for an illustration. r) batch_idxhead_idxq_idxkv_idx causal_mask document_mask padding_mask final_maskr; document_idss rcausal_mask_modz4make_flex_block_causal_mask..causal_mask_modsVvo $Y%56,yRXGX:YY (E)9:Q>  </-? rcB||f||fk(}||||}||zS)zU Combines the chunk mask with the causal mask for chunked attention. rD)rErFrGrH chunk_maskcausal_doc_maskrN chunk_idxss rchunk_causal_mask_modz:make_flex_block_causal_mask..chunk_causal_mask_mods> 5 01Z 6@Q5RR ))XufMO++rcD||f||fk(}||fdkD}||z}|S)zp Utilizes default attention mask to enable encoder and encoder-decoder attention masks. rrD) rErFrGrHrJrKrLr;rMs rdefault_mask_modz5make_flex_block_causal_mask..default_mask_modsH %Y%56,yRXGX:YY (F):;a? !M1 rc.|z}|z}||||SrrD) rErFrGrHoffset_q offset_kv kv_offsetmask_mod_maybe_combinedq_offsets rmask_modz-make_flex_block_causal_mask..mask_mods(x'H*I*9h)T Trr)r\BHQ_LENKV_LENdevice_compile) shapeflex_default_block_sizer$nn functionalrAraclonefill_cumsumtorr )r;r< query_length key_lengthr=r> batch_size total_seq_lenpad_lenrarSrUr\rNrRrMrYrZr[s` @@@@@@rmake_flex_block_causal_maskrpmsCD!2 7 7J " $ 55:>UUG++//0AQRT[^hThPi/j  % %F$**,L'"((*003::2>BH\]  ,  "25I5Q/Wl1:==(AJMM&)  U +   +G44  r hidden_statesn_repc|j\}}}}|dk(r|S|dddddddddfj|||||}|j|||z||S)z This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim) r@N)rcexpandreshape)rqrrbatchnum_key_value_headsslenhead_dims r repeat_kvrzso 2?1D1D.E h z!!Qa"23::5BUW\^bdlmM  (;e(CT8 TTrmoduleattention_maskscalingsoftcap head_masks_auxc Htjd| jdddkDr tdd} dt |t r|} n|ddddddd|j dffd} d} |j d } | | d z zdk7rTt||j d |j d z}t||j d |j d z}d } | jd }|jjd k7}|s | td t|||| | | ||||j }|r|\}}|j|j}||j \}}}}|jd dd d j|||d }|j!d}t#j$t#j&||gddd}t#j(||z }||z}n|}d}|j+d dj-}||fS)Nzm`flex_attention` does not support `head_mask`. Please set your attention to `eager` if you want this feature.dropoutgrz`flex_attention` does not support `dropout`. Please use it with inference only (`model.eval()`) or turn off the attention dropout in the respective config.ctj|z z}||d||z}|||ddz}|S)Nr)r$tanh)scorerErFrGrHr score_maskr~s r score_modz)flex_attention_forward..score_mod so  ejj99E  !Jy1!4U;FCCE  Ii0:1=a@@E rTr@Fkernel_optionscpuzhAttention sinks cannot be run on CPU with flex attention. Please switch to a different device, e.g. CUDA)r block_mask enable_gqascaler return_lser#rB)dim)rkeepdimr)logger warning_onceget ValueError isinstancer rcrzratyper:r#rjdtypeviewrt unsqueezer$ logsumexpcatexp transpose contiguous)r{r4r5r6r|r}r~rrrrrrnum_local_query_headsrrflex_attention_outputattention_outputlserm num_heads seq_len_q_sinks lse_expanded combined_lse renorm_factorrs `` @rflex_attention_forwardrsT { zz)S!A% a  JJ.),# # 1a399R= 89  J!KKN !6!:;AU[[^syy|;<%Q5;;q>!AB ZZ 01N""e+J %+ v  <   %  5#ffU[[!  2B2H2H /J 9aJJq"a+22:y)UVWE ==,L ??599lE5JPR+SY[eijL"IIl\&ABM/-? 0'11!Q7BBD S  r)F)NNNNT)NNNN)$r0typingrrr$ packagingrutilsrrutils.import_utilsr r r !torch.nn.attention.flex_attentionr rdr rr get_loggerr-rrTensortupler:intOffsetboolrprzreModulefloatrrDrrrs 8# 9aa !g^^   H %&-&-Z  <<  <<  5<<u||U\\9: :; $ u||S ! +//3 $ o||o"3-o eFFN+ , o ~ ood UU\\ U# U%,, U$ $#(,$(e! HHOOe! <<e! e! << e! %,, 34 e! e_ e!e_e! %e! ELL !e! 5<<%,,/ /0e!r