L iodddlmZddlZddlmZddlmZ er ddlmZeZ ne d dd ejj d ej"d ej"d ej"d eej"dedej"fdZy#e $rZ e e ZdZ YdZ [ ydZ [ wwxYw))OptionalN)PagedAttentionCache)is_flash_attn_2_available)flash_attn_varlen_funczFlash Attention 2 is not installed. Please refer to https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install itc&tdt)Nz)flash_attn_varlen_func is not available: ) Exceptionmsg)argskwargss k/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/transformers/integrations/flash_paged.pyFLASH_ATTN_VARLEN_FUNCrsCC5IJJmoduleqkvattention_maskcachereturnc t|ddsdn|jdz df} | dk(rdnd} |"|j|||jfi| \}}t |t r || }| | } | t | d r | j}nt}d | vrd | jd ini}||jdd jdj|j|j|jtj|jtjj!|| f|j"d | d |}t |t$r|d}|dfS)aPerform the forward pass of attention with paged key-value cache. This function handles the cache updates and performs the attention computation using the flash_attn_varlen_func for efficient processing. Args: q: (total_q, nheads, headdim), where total_q = total number of query tokens in the batch. k: (total_k, nheads_k, headdim), where total_k = total number of key tokens in the batch. but if there is a block table it can be the full k v: (total_k, nheads_k, headdim), where total_k = total number of key tokens in the batch. but if there is a block table it can be the full v cu_seq_lens_q: (batch_size + 1,), dtype torch.int32. The cumulative sequence lengths of the sequences in the batch, used to index into q. cu_seq_lens_k: (batch_size + 1,), dtype torch.int32. The cumulative sequence lengths of the sequences in the batch, used to index into kv. max_seqlen_q: int. Maximum query sequence length in the batch. max_seqlen_k: int. Maximum key sequence length in the batch. dropout_p: float. Dropout probability. softmax_scale: float. The scaling of QK^T before applying softmax. Default to 1 / sqrt(headdim). causal: bool. Whether to apply causal attention mask (e.g., for auto-regressive modeling). window_size: (left, right). If not (-1, -1), implements sliding window local attention. softcap: float. Anything > 0 activates softcapping attention. sliding_windowF)rrfull_attentionsliding_attentionNrs_auxrT) softmax_scalecausal window_size)getattrrupdate layer_idx isinstancedicthasattrrrget transposesqueeze contiguoustotorchint32clonescalingtuple)rrrrrr cu_seq_lens_q cu_seq_lens_k max_seqlen_q max_seqlen_kimplementationr r layer_typer custom_kwargs attn_outputs r paged_attention_forwardr9ssH&-V5Eu%MXTZTiTilmTmopSqN%3x%?!EXJ u||Aq&"2"2=f=1-&%j1 #J/ !gn>V&W!/!F!F!76=6GWfjj12RM( Aq!!!$//1  %%++- nn"   K+u%!!n  r)NNNNNNN)typingrr,generation.continuous_batchingrutilsr flash_attnrr RuntimeErrorr ereprr nnModuleTensorr9rr rEs @- K "5!7 ]   ".2!%F HHOOF ||F ||F || F U\\* F  F \\FK q'CKKsBB/ B**B/