L ifddlZddlmZmZddlZddlmZmZmZm Z m Z ddl m Z ddl mZmZmZmZddlmZmZddlmZmZmZGd d ZGd d ej4j6ZGd dej4j6ZGddej4j6Z d/dedeej>deej>dee dee!f dZ"Gddej4j6Z#Gddej4j6Z$Gddej4j6Z% d0dedeej>deej>fdZ&dZ'd efd!Z(d"ejRjTjVfd#Z, d1d$e-d%ej>d&e-d'e-d(eed)eej>d*ee-d+e!d,e!d-eej>fd.Z.y)2N)CallableOptional) DynamicCache DynamicLayerDynamicSlidingWindowLayerEncoderDecoderCache StaticCache)GenerationConfig)ALL_MASK_ATTENTION_FUNCTIONS_ignore_causal_mask_sdpa#_is_torch_greater_or_equal_than_2_5prepare_padding_mask)ALL_ATTENTION_FUNCTIONSPreTrainedModel)is_torch_greater_or_equal"is_torch_greater_or_equal_than_2_3"is_torch_greater_or_equal_than_2_6cJeZdZdZd dedefdZdZdZdZdZ d Z d d Z y )TorchExportableModuleForVLMa| A wrapper class for exporting Vision-Language Models (VLMs) like SmolVLM2 for ExecuTorch. This class handles the export of three main components: 1. Vision encoder (processes images to visual features) 2. Connector/projector (maps visual features to text embedding space) 3. Text decoder (generates text from combined visual and text tokens) max_batch_size max_cache_lenc||_||_||_|j|_|jj|_|jj |_|jj|_d|_ d|_ d|_ y)a  Initialize the exportable VLM module. Args: model: The VLM (e.g. SmolVLM) model instance max_batch_size: Maximum batch size. Always 1 for ExecuTorch max_cache_len: Maximum cache length for text generation N) modelrrconfig vision_modelvision_encoder connector text_model text_decoderexported_vision_encoderexported_connectorexported_text_decoder)selfrrrs j/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/transformers/integrations/executorch.py__init__z$TorchExportableModuleForVLM.__init__2sw ,*ll $kk66..!KK22(,$"&%)"c|jjtjddddtj}dtj j jtj j jdi}tj j |j|f|d|_|jS) z$Export the vision encoder component.idtype pixel_values)rr*Fargsdynamic_shapesstrict) revaltorchrandnfloat32exportDimAUTOr!)r$r-r0s r%export_vision_encoderz1TorchExportableModuleForVLM.export_vision_encoderJs   "{{1acG  <<##((<<##(( (-||':':   ) (;( $+++r'c&|jj|jjj}|jjj }|jjj }||z}||z}tjd||tj}ddtjjjii}tjj|j|f|d|_ |jS)zExport the connector component.r)r+image_hidden_statesFr.)rr2r vision_config hidden_size image_size patch_sizer3r4r5r6r7r8r")r$vision_hidden_sizer>r?patches_per_dim num_patchesr;r0s r%export_connectorz,TorchExportableModuleForVLM.export_connectorbs "[[66BB[[..99 [[..99 $ 2%7 #kk![:LTYTaTab0!U\\5E5E5J5J1KL#(,,"5"5 NN%') #6# &&&r'ct|j|_d}tjd|ftj }tj |tj }t|j|jjj}tjjd|dz }d|id|id}|jj|||d |_|jS) z"Export the text decoder component.)rr*r)r+seq_length_dimmaxr input_idscache_positionF)rIrJr0r1)%TorchExportableModuleForDecoderOnlyLMr exportable_text_decoderr3zeroslongarangeminrr text_configmax_position_embeddingsr6r7r#)r$ seq_lengthrIrJmax_seq_length seq_len_dimr0s r%export_text_decoderz/TorchExportableModuleForVLM.export_text_decoder{s(MSWSdSd'e$ KKJuzzB j CT//1H1H1`1`all&&'7^a=O&P [) +.  &*%A%A%H%H)) &I& ")))r'c |jdi||jdi||jdi||j|j|j dS)z'Export all components of the VLM model.)rrr )r9rCrVr!r"r#)r$kwargss r%r6z"TorchExportableModuleForVLM.exports`""",V,''   *6*"::00 66  r'cy)a Simplified forward pass for inference with guaranteed non-null input_ids and cache_position. Args: pixel_values: Input images [1, channels, height, width] (optional) input_ids: Text token IDs [1, seq_len] (required - won't be None) cache_position: Cache positions [seq_len] (required - won't be None) Returns: Output with logits for text generation NrX)r$r-rIrJs r%forwardz#TorchExportableModuleForVLM.forwards r'Nc y)a Simplified generate method with guaranteed non-null input_ids. Args: pixel_values: Input images [1, channels, height, width] (optional) input_ids: Initial text tokens [1, seq_len] (required - won't be None) max_new_tokens: Maximum number of tokens to generate do_sample: Whether to use sampling or greedy decoding temperature: Temperature for sampling Returns: Generated sequences NrX)r$r-rImax_new_tokens do_sample temperaturerYs r%generatez$TorchExportableModuleForVLM.generates r')r))NN2F?) __name__ __module__ __qualname____doc__intr&r9rCrVr6r[r`rXr'r%rr(s?*c*c*0,0'2*6    be r'rceZdZdZ ddedeedeedeejddf fd Z dd eejd eejd eejdejfd Z dd eejd eejd eejd ee dee dejjf dZe ddejjdedede dededededefdZxZS)rKa A recipe module designed to make a `PreTrainedModel` exportable with `torch.export`, specifically for decoder-only LM with cache. This module ensures that the exported model is compatible with further lowering and execution in `ExecuTorch`. Nr batch_sizerdevicereturnct||jj}t |dr|j dur t dt |dr!t|ddt|||||_ n(tjdt|||||_ tjdtt!jdt d d|jjj_y) z Initializes the exportable module. Args: model (`PreTrainedModel`): The pretrained model to wrap. Raises: ValueError: If the model is configured with a unsupported cache implementation. use_cacheFz5The model must have caching enabled to be performant. layer_typessliding_windowNzmUsing `StaticCache` for export as `layer_types` is not specified or `sliding_window` is `null` in the config.sdpa_without_vmapsdpa)superr&rget_text_confighasattrrn ValueErrorgetattr$TorchExportableModuleWithHybridCacherlogginginfo$TorchExportableModuleWithStaticCacher registersdpa_mask_without_vmapr_attn_implementation)r$rrjrrkr __class__s r%r&z.TorchExportableModuleForDecoderOnlyLM.__init__s --/v{+v/?/?5/HTU U 6= )gf>NPT.U.a=eZQ^`fgDJ LL >eZQ^`fgDJ$--.ACYZ(()<>UV\>]^7J 4r'rI inputs_embedsrJc>|jj|||S)a Forward pass of the module, which is compatible with the ExecuTorch llm runner. Args: input_ids (`torch.Tensor`): Tensor representing current input token id to the module. inputs_embeds (`torch.Tensor`): Tensor representing current input embeddings to the module. cache_position (`torch.Tensor`): Tensor representing current input position in the cache. Returns: torch.Tensor: Logits output from the model. )rIrrJ)rr[)r$rIrrJs r%r[z-TorchExportableModuleForDecoderOnlyLM.forwards)"zz!!')"  r'r0r1c|du|duz s tdt|jdrBt|j|jj|j}|j }nNt|jdr!|jjj }nd}t jd|;|||n2tj|jdtj|d }n:|||n2tj|jd tj|d }tjj|jd ||||nd } | S)aw Export the wrapped module using `torch.export`. Args: input_ids (`Optional[torch.Tensor]`): Tensor representing current input token id to the module. Must specify either this or inputs_embeds. inputs_embeds (`Optional[torch.Tensor]`): Tensor representing current input embeddings to the module. Must specify either this or input_ids. cache_position (`Optional[torch.Tensor]`): Tensor representing current input position in the cache. If not provided, a default tensor will be used. dynamic_shapes (`Optional[dict]`): Dynamic shapes to use for export if specified. strict(`Optional[bool]`): Flag to instruct `torch.export` to use `torchdynamo`. Returns: torch.export.ExportedProgram: The exported program that can be used for inference. Examples: Export with input_ids: ```python # Prepare inputs input_ids = torch.tensor([[1, 2, 3]], dtype=torch.long, device=model.device) cache_position = torch.arange(input_ids.shape[-1], dtype=torch.long, device=model.device) # Export exported = exportable_module.export( input_ids=input_ids, cache_position=cache_position ) ``` Export with inputs_embeds: ```python # Prepare embeddings inputs_embeds = torch.randn(1, 3, 768, device=model.device) # batch_size=1, seq_len=3, hidden_size=768 cache_position = torch.arange(inputs_embeds.shape[1], dtype=torch.long, device=model.device) # Export exported = exportable_module.export( inputs_embeds=inputs_embeds, cache_position=cache_position ) ``` Nz2Need to specify either input_ids or inputs_embeds.base_model_prefixrcpuzfTorchExportableModuleForDecoderOnlyLM.export Can't infer device from the model. Set to CPU by default.r,rkrHr))rrJrXTr/rYr0r1) rvrurrwrrkrywarningr3rOshaperNr6) r$rIrrJr0r1base model_device input_kwargsexported_programs r%r6z,TorchExportableModuleForDecoderOnlyLM.exportsBjT!mt&;<QR R 4::2 34::tzz'C'CTZZPD;;L TZZ )::++22L L OOx   &!-#1\\)//""5UZZP\] L"/!-#1\\-"5"5a"8 S_` L!<<.. JJ)#/6T /  r'rpromptr]r^r_top_ktop_pc |j} ||djj|} | j} d} t | j dD]F} | dd| | dzf}t j| gt j|}| ||}| dz } Ht |D]}| ddddf}t j| gt j|}| ||}|r|dkDr||z }n|}|dkDr-|t j||dd k}td ||<|d krt j|d \}}t jt j|dd}||kD}|dddfj|dddf<d|d<|jd||}td ||<t j|d}t j|d}n|j!dd }|j#dkDr|j%d}t j&| |gd} | dz } |j)|j*k(sn|j-| dd S)a Generate a sequence of tokens using an exported program. Args: exported_program (`torch.export.ExportedProgram`): The exported model being used for generate. tokenizer: The tokenizer to use. prompt (str): The input prompt. max_new_tokens (int): Maximum number of new tokens to generate. do_sample (bool): Whether to use sampling or greedy decoding. temperature (float): The temperature for sampling. top_k (int): The number of highest probability tokens to keep for top-k sampling. top_p (float): The cumulative probability for nucleus sampling. device (str): The device to use. Returns: str: The generated text. pt)return_tensorsrr)NrrHr).rNz-infrcT) descendingdim.).r) num_samplesrkeepdimr)skip_special_tokens)modulerItoclonerangerr3tensorrNtopkfloatsortcumsumsoftmaxscatter multinomialargmaxrsqueezecatitem eos_token_iddecode)r tokenizerrr]r^r_rrrkexported_modulerI generated_ids curr_positionicurr_input_idscurr_cache_position_outputslogitsindices_to_remove sorted_logitssorted_indicescumulative_probssorted_indices_to_removeprobs next_token_ids r%r`z.TorchExportableModuleForDecoderOnlyLM.generateas<+113fT:DDGGO ")  yq)* A&q!a!e)|4N"',, ejjY_"`  .I\]A Q M ~&5 A*1bc62N"',, ejjY_"` &ObcG?${2F$F19(.FE1J11Mm1\(\%05f F,-3;49JJvRV4W1M>',||EMM-UW4X^`'a$0@%/G,8PQTVYWYVYQY8Z8`8`8b,S!"W578,V4)A(H(H^]u(v%05f F,- f"5 % 1 1%Q G !(2t D   "Q& - 5 5b 9 "II}m&D"MM Q M!!#y'='==k5 p a 0dKKr'NNN)NNNNN)Frcrbrcr)rdrerfrgrrrhr3rkr&Tensorr[dictboolr6ExportedProgram staticmethodstrrr` __classcell__rs@r%rKrKs%)'+)- #K#KSM#K } #K & #K  #KN-10415  ELL)   - !.    2-10415)-!% Z ELL)Z   -Z !. Z ! Z  Z   % %Z x ! iL,,66iLiL iL  iL  iLiLiLiL iLiLr'rKc BeZdZdZ ddedeedeedeejddf fd Z dd eejd eejd eejfd Z e d ejjdejdedejfdZxZS)r{a A recipe module designed to make a `PreTrainedModel` exportable with `torch.export`, specifically for decoder-only LM to `StaticCache`. This module ensures that the exported model is compatible with further lowering and execution in `ExecuTorch`. Note: This class is specifically designed to support export process using `torch.export` in a way that ensures the model can be further lowered and run efficiently in `ExecuTorch`. Nrrjrrkrlct ||jj}|j}| t d|j s t d|jdk7r t d|jin |j}||jdd}| td||jdd}| td ||jd |j}||_ t|| |_t|d |j |j"z}t|d |j"} |jj$} |jj'|| || |t)t+|jD]r} |j-d| |jj.| j0d|j-d| |jj.| j2dty)a Initializes the wrapper module with the pretrained model. Args: model (`PreTrainedModel`): The pretrained model to wrap. The model must have caching enabled and use a 'static' caching implementation. batch_size (`Optional[int]`): The batch size of the model. If not provided, we check if a value can be found in `generation_config.cache_config` and otherwise we raise a ValueError. max_cache_len (`Optional[int]`): The maximum cache length for generation. Same mechanism as `batch_size` if not provided. device (`Optional[torch.device]`): The device to use. If not provided, we check if a value can be found in `generation_config.cache_config` and otherwise we use `model.device` (no error is raised). Raises: AssertionError: If the pretrained model does not have caching enabled or if it does not use a 'static' caching implementation in `model.generation_config`. ValueError: If `batch_size` or `max_cache_len` is not provided, either as an argument or in `cache_config`. NvThe model must have a generation config to be exported with static caching. Please set `generation_config` in `model`.zvThe model must have caching enabled to be exported with static caching. Please set `generation_config.use_cache=True`.staticzThe model must use a 'static' caching implementation to be exported with static caching. Please set `generation_config.cache_implementation='static'`.rjFbatch_size must be provided, either as an argument or in cache_config.rImax_cache_len must be provided, either as an argument or in cache_config.rk)rrhead_dimnum_key_value_heads key_cache_F persistent value_cache_)rsr&rrtgeneration_configAssertionErrorrncache_implementation cache_configgetrvrkrr static_cacherwr=num_attention_headsr,early_initializationrlenregister_bufferlayerskeysvalues r$rrjrrkrrrr num_headsr,rrs r%r&z-TorchExportableModuleWithStaticCache.__init__s2 --/!33  $ = !** A   1 1X = P  /;;CrIZIgIg   %)),=J! !ijj  (,,_dCM$ !lmm >!%%h =F 'mFS6:v/A/AVE_E_/_`F$96;U;UV     ..z9hPUW]^s4,,-. kA  :aS!143D3D3K3KA3N3S3S`e f  <s!3T5F5F5M5Ma5P5W5Wdi j kr'rIrrJc|j}|j|||d|d}t|dr |jS|jS)a8 Forward pass of the module, which is compatible with the ExecuTorch runtime. Args: input_ids (`torch.Tensor`): Tensor representing current input token id to the module. inputs_embeds (`torch.Tensor`): Tensor representing current input embeddings to the module. cache_position (`torch.Tensor`): Tensor representing current input position in the cache. Returns: torch.Tensor: Logits output from the model. This forward adapter serves two primary purposes: 1. **Making the Model `torch.export`-Compatible**: The adapter hides unsupported objects, such as the `Cache`, from the graph inputs and outputs, enabling the model to be exportable using `torch.export` without encountering issues. 2. **Ensuring Compatibility with `ExecuTorch` runtime**: The adapter matches the model's forward signature with that in `executorch/extension/llm/runner`, ensuring that the exported model can be executed in `ExecuTorch` out-of-the-box. NTrIrrJattention_maskpast_key_valuesrnr)rrrurlast_hidden_state)r$rIrrJroutss r%r[z,TorchExportableModuleWithStaticCache.forward#sX6++zz')+   4 ";; )) )r'rprompt_token_idsr]c f|j}|jd}||z}|jD]3\}}|jds|jd}t ||}ng} t t ||D]y} |j j|dd| | dzftj| gtj|} | j|d| j{tj dddddfd j} | j| t| |kr|j jtj| ggtj|tjt| gtj|} tj| dddddfd j} | j| t| |krtj| gtj|S) a Generate a sequence of tokens using an exported program. This util function is designed to test exported models by simulating the generation process. It processes the input prompt tokens sequentially (no parallel prefill). This generate function is not intended to replace the original `generate` method, and the support for leveraging the original `generate` is potentially planned! Args: exported_program (`torch.export.ExportedProgram`): The exported program generated via `torch.export`. prompt_token_ids (`torch.Tensor`): Tensor representing the input prompt token IDs. max_new_tokens (`int`): Maximum number of new tokens to generate. Note that the total generation length is limited by both `max_new_tokens` and the model's cache size. Returns: torch.Tensor: A tensor containing the generated sequence of token IDs, including the original prompt tokens. r key_cacherNr)rrHrr)rkr named_buffers startswithrPrrr[r3rrNappendrrr) rrr]rkprompt_token_lenmax_generation_length buffer_namebufferrresponse_tokens input_posresult current_tokens r%r`z-TorchExportableModuleWithStaticCache.generateOs."((+11"5 0> A#3#A#A#C  K%%k2 & Q (+,A=(Q%   s#8:JKL JI%,,.66*1i)a-.G+GH$||YKuzzRXY7F  " "#3A#6y#A#F#F#H I  J VAr1H%52>CCE }-/"%::%,,.66,,'8 SYZ$||S-A,B%**]cd7F"LL2q)9rBGGIM  " "= 1 /"%::||_-UZZOOr'r)rdrerfrgrrrhr3rkr& LongTensorrr[rr6rr`rrs@r%r{r{s%)'+)- HkHkSMHk } Hk & Hk  HkX150415 **E,,-**  -**!. **X2P,,662P,,2P2P  2P2Pr'r{c eZdZdZ d dedeedeedeejddf fd Z d d eejd eejd eejdejfd Z xZ S)rxa A recipe module designed to make a `PreTrainedModel` exportable with `torch.export`, specifically for decoder-only LM to hybrid `StaticCache`. This module ensures that the exported model is compatible with further lowering and execution in `ExecuTorch`. Nrrjrrkrlct |||_|jj }|j }| t d|js t d|jin |j}||jdd}| td||jdd}| td||jd|j}t|| |_ t|d |j|j z}t|d |j } |jj"} |jj%|| || |t't)|jD]r} |j+d | |jj,| j.d |j+d| |jj,| j0d ty)a Initializes the exportable module. Args: model (`PreTrainedModel`): The pretrained model to wrap. batch_size (`Optional[int]`): The batch size of the model. If not provided, we check if a value can be found in `generation_config.cache_config` and otherwise we raise a ValueError. max_cache_len (`Optional[int]`): The maximum cache length for generation. Same mechanism as `batch_size` if not provided. device (`Optional[torch.device]`): The device to use. If not provided, we check if a value can be found in `generation_config.cache_config` and otherwise we use `model.device` (no error is raised). Raises: AssertionError: If the model doesn't have the expected configuration for hybrid StaticCache. ValueError: If `batch_size` or `max_cache_len` is not provided, either as an argument or in `cache_config`. Nrz Model must have caching enabled.rjrrrrkrrrrrFrr)rsr&rrrtrrrnrrrvrkr cacherwr=rr,rrrrrrrrs r%r&z-TorchExportableModuleWithHybridCache.__init__s,  --/!33  $ =  !CD D.;;CrIZIgIg  %)),=J! !ijj  (,,_dCM$ !lmm >!%%h =F!mL 6:v/A/AVE_E_/_`F$96;U;UV     '' IxPVWs4::' dA  :aS!14::3D3DQ3G3L3LY^ _  <s!3TZZ5F5Fq5I5P5P]b c dr'rIrrJc\|j|||d|jd}|jS)a Forward pass of the module, which is compatible with the ExecuTorch llm runner. Args: input_ids (`torch.Tensor`): Tensor representing current input token id to the module. inputs_embeds (`Optional[torch.Tensor]`): Tensor representing current input embeddings to the module. cache_position (`torch.Tensor`): Tensor representing current input position in the cache. Returns: torch.Tensor: Logits output from the model. NTr)rrr)r$rIrrJrs r%r[z,TorchExportableModuleWithHybridCache.forwards9$**') JJ  ~~r'r)rdrerfrgrrrhr3rkr&rrr[rrs@r%rxrxs%)'+)- =d=dSM=d } =d & =d  =dB150415 E,,-  -!.   r'rxrexample_input_idsexample_cache_positionr0r1cts tdddl}tjdt t jdt dd|j_|j5||n*|jdgg|j|j}||n)|jdg|j|j}tdr1|jjt|d ||d |||nd }nd|t!j"d |t!j"d|jj$j't|d ||d dd }|cdddS#1swYyxYw)a Convert a `PreTrainedModel` into an exportable module and export it using `torch.export`, ensuring the exported model is compatible with `ExecuTorch`. Args: model (`PreTrainedModel`): The pretrained model to be exported. example_input_ids (`Optional[torch.Tensor]`): Example input token id used by `torch.export`. example_cache_position (`Optional[torch.Tensor]`): Example current cache position used by `torch.export`. dynamic_shapes(`Optional[dict]`): Dynamic shapes used by `torch.export`. strict(`Optional[bool]`): Flag to instruct `torch.export` to use `torchdynamo`. Returns: Exported program (`torch.export.ExportedProgram`): The exported program generated via `torch.export`. torch >= 2.3 is required.rNrqrrr)rz2.6.0rXrHTrzWDynamic shapes spec will be ignored by convert_and_export_with_cache for torch < 2.6.0.zSThe strict flag will be ignored by convert_and_export_with_cache for torch < 2.6.0.F)r/rY pre_dispatchr1)r ImportErrortorch.export._tracer r|r}rrr~no_gradrrNrkrr6r{ryr_trace_export)rrrr0r1r3rs r%convert_and_export_with_cachers* .566!))*=?UV$$%8:QRX:YZ(;ELL% ' !, se5::ellK &1 #qcELLI  %W -$||224U;%6J`a-!'!3v 3  )m! uv %||22::4U;%6J`a" ;   O' ' ' s /C?E88Fc(eZdZdZfdZdZxZS) Seq2SeqLMEncoderExportableModulez A wrapper module designed to make a Seq2Seq LM encoder exportable with `torch.export`. This module ensures that the exported encoder model is compatible with ExecuTorch. c0t|||_yN)rsr&encoder)r$ encoder_modelrs r%r&z)Seq2SeqLMEncoderExportableModule.__init__9s $ r'c:|j|jS)N)rI)r r)r$rIs r%r[z(Seq2SeqLMEncoderExportableModule.forward=s||i|0BBBr'rdrerfrgr&r[rrs@r%r r 3s %Cr'r c(eZdZdZfdZdZxZS)/Seq2SeqLMDecoderExportableModuleWithStaticCachez A wrapper module designed to make a Seq2Seq LM decoder exportable with `torch.export`, specifically for use with static caching. This module ensures the exported decoder is compatible with ExecuTorch. ct||j|_|j|_|j |_t |jj}t|j ||_ t|j d|j j|j jz}t|j d|j j}|jj|||tj |t#|jt%|j |_t)t+t-|jD]r}|j/d||jj0|j2d|j/d||jj0|j4dty) NrrrrrFrr)rsr& get_decoderdecoderlm_headrnext parametersrkr rrwr=rrr3r5r rr%register_dynamic_cache_export_supportrrrrrr) r$rmax_static_cache_lengthrjrrrrrs r%r&z8Seq2SeqLMDecoderExportableModuleWithStaticCache.__init__Hs ((* }} ll E,,./66 (t{{Jab4;; DKK4K4Kt{{OnOn4noDKK)> @_@_`  ..z9hPUP]P]_kl():):LPTP[P[<\] -/s4,,-. kA  :aS!143D3D3K3KA3N3S3S`e f  <s!3T5F5F5M5Ma5P5W5Wdi j kr'cn|j|||jd|}|j|d}|S)NT)rIencoder_hidden_statesrrnrJr)rrr)r$decoder_input_idsrrJr lm_logitss r%r[z7Seq2SeqLMDecoderExportableModuleWithStaticCache.forwardbsB,,'"7 JJ)  LL, r'rrs@r%rrAs k4 r'rc<eZdZ dfd ZdZdZddZdZxZS)Seq2SeqLMExportableModulect|||_|j|_|j |_||_td||||d|_d|_ d|_ y)NT)rjr)rn max_lengthrr) rsr& full_model get_encoderr rmax_hidden_seq_lengthr rexported_encoderexported_decoder)r$rrjr%rmax_cache_lengthrs r%r&z"Seq2SeqLMExportableModule.__init__ssm ((* ll %:"!1'!5(!1 " !% $r'ct|jj|jjj }t jjd|j}t j5t jj||fdd|iid}ddd|S#1swYSxYw)Nencoder_seq_lengthrFrIr)Tr0r1) r r rr#rkr2r3r6r7r%r)r$encoder_input_idswrapped_encoderrUr&s r%_export_encoderz)Seq2SeqLMExportableModule._export_encoders:4<<HKKDOOLbLbchhjll&&';A[A[&\ ]]_ $||22"3!5{UVXcTdFenr 3         s )B99Cc |jj}t|j|jjj d|jjj dj |j}|j |}|j |}|j |}tjjd|j}tj5tjj||||fdd|iddd }ddd|S#1swYSxYw) Nrrj)rrrjencoder_hidden_seq_lengthrFr))rrrJTr+) r#rkrrrrrr2r3r6r7r%r)r$rrrJ target_devicewrapped_decoderencoder_seq_len_dimr's r%_export_decoderz)Seq2SeqLMExportableModule._export_decoders1.. ;oo(,(>(>(K(K(O(OP_(`11>>BB<P  R  TV .00? 5 8 8 G'**=9$ll../JPTPjPj.k]]_ $||22"$9>J)-./1D-E&*   3      s -D;;EcX|jj}||n%tjdtj|}||n'tj dggtj|}||n&tj dgtj|}||n_tj |jjjdd|jjftj|} |j||_|j|| ||_|S)N)r) rrrjr6)r#rkr3onesrNrrMrrrrd_modelr5r.r&r4r') r$r,rrrJrkexample_encoder_input_idsexample_decoder_input_idsrexample_encoder_hidden_statess r%r6z Seq2SeqLMExportableModule.exports''!, G5::fE "!, se5::fE " -8NellA3V[V`V`io>p  %0 "''4488FDKKL_L_`mm &!% 4 45N O $ 4 4 %'DF\!   r'c tj5|jj}|j|k7r|j |}|j j |}tjdggtj|}dg}t|dz D]}|jj ||tj|gtj|}tj|dddddfdj} |j| tj| ggtj|}| |jjk(sn|cdddS#1swYyxYw)Nrrr)rr)r3rr#rkrr&rrrNrr'rrrrr) r$rr]rencoder_outputrrrr next_tokens r%r`z"Seq2SeqLMExportableModule.generatesR ]]_ !??11L &&,6#3#6#6|#D A-. 7..557%~u||QCuzzbn7o #\\&B*:CHHJ $$Z0%*LL:,uzzZf$g!!9!99 "!A ! ! !sEE?1E??F)r)irraNNNN) rdrerfr&r.r4r6r`rrs@r%r r rs!os%*  ! F@!!r'r example_attention_maskc ts tdtjdtt jdt dd|j _ttj5tjj|d||t|j ddd }|cd d d S#1swYy xYw) a Export a model with DynamicCache using `torch.export`, ensuring the exported model is compatible with `ExecuTorch`. Args: model (`PreTrainedModel`): The pretrained model to be exported. example_input_ids (`Optional[torch.Tensor]`): Example input token id used by `torch.export`. example_attention_mask (`Optional[torch.Tensor]`): Example attention mask used by `torch.export`. Returns: Exported program (`torch.export.ExportedProgram`): The exported program generated via `torch.export`. rrqrrrXrT)rIrrrnF)r1N) rrr r|r}rrr~rr3rr6r)rrr@rs r%export_with_dynamic_cacherBs .566!))*=?UV$$%8:QRX:YZ(;ELL%)+    <<..  ."8#/u||#D!   /       s 8>CC c^ tjjjtdt tj dtjdtjjjtdy#t$r}dt|vrYd}~yd}~wwxYw)z> Utilities for `DynamicCache` <> torch.export support cftjjjt |Sr )r3utils_pytree _dict_flatten_get_cache_dict dynamic_caches r%z7register_dynamic_cache_export_support...s!%++"5"5"C"COTaDb"cr'.cftjjjt |Sr )r3rErF_dict_flatten_with_keysrHrIs r%rKz7register_dynamic_cache_export_support..1s#u{{7J7J7b7b .8r')serialized_type_nameflatten_with_keys_fnchtjjjt ||Sr )r3fxrF_dict_flatten_specrH)rspecs r%rKz7register_dynamic_cache_export_support..8s$ 0 0 C COTYDZ\` ar'z!already registered as pytree nodeN) r3rErFregister_pytree_noder_unflatten_dynamic_cachererdrRregister_pytree_flatten_specrvr)es r%rr&s  00  c $$0$;$;#S>S=T!U" 1  55  a  .c!f <  =sBB B,B''B,rc`td|jDr tdtst j d|jDcgc]}|j |j c}|jDcgc]}|j|jc}dScc}wcc}w)z9Convert cache to dictionary format for pytree operations.c3JK|]}t|ttf ywr ) isinstancerr).0layers r% z"_get_cache_dict..Bs! fPUz%,0I!JK K fs!#zFThis pytree flattening function should only be applied to DynamicCachez[DynamicCache + torch.export is tested on torch 2.6.0+ and may not work on earlier versions.)r value_cache)anyr RuntimeErrorrryrrr)rr]s r%rHrH@s fY^YeYe ffcdd -uv/4llUUejj>TejjU27,,[%,,BZ [ U[sB&#B&B+B+contextctjjj||}t }|j dg}|j dg}t tt|t|D]?}|t|kr||nd}|t|kr||nd}|j|||A|S)Nrr_) r3rErF_dict_unflattenrrrrGrupdate) rrb dictionaryrkey_list value_listidxkeyvalues r%rVrVNs$$44VWEJ NE~~k2.H r2JSXJ89&"S]2hsm#&Z#8 3d S%%& Lr'rjrJ kv_length kv_offset mask_functionr local_sizeallow_is_causal_skipallow_torch_fixrlc  &|jd} t|||} |rt| | ||rytj||j } | |z } |j dd} t| ddd}t| ddd}| | td | | k}|| | |z kD}||z}n|| |z| |zk(}||z}|ddddddfj|ddd}| || ddddddfz}ts|r|tj|dd z}|S) a Create a 4D boolean mask of shape `(batch_size, 1, query_length, kv_length)` where a value of True indicates that the element should take part in the attention computation, and False that it should not. This is similar to `masking_utils.sdpa_mask` but does not use `vmap` which is incompatible with export. Args: batch_size (`int`): The batch size of the input sequence. cache_position (`torch.Tensor`): A tensor of shape (query_length,) indicating the current indices of the input sequence elements. kv_length (`int`): The size that the key and value states will have during the attention computation. kv_offset (`int`, optional): An optional offset to indicate at which first position the key and values states will refer to. mask_function (`Callable`): The mask factory function describing the mask pattern. attention_mask (`torch.Tensor`, optional): The 2D attention mask corresponding to padded tokens of shape (batch_size, number_of_seen_tokens+q_length) local_size (`int`, optional): The size of the local attention, if we do not use full attention. This is used only if `allow_is_causal_skip=True` to try to skip mask creation if possible. allow_is_causal_skip (`bool`, optional): Whether to allow to return `None` for the mask under conditions where we can use the `is_causal` argument in `torch.sdpa` instead. Default to `True`. allow_torch_fix (`bool`, optional): Whether to update the mask in case a query is not attending to any tokens, to solve a bug in torch's older versions. We need an arg to skip it when using eager. By default `True`. rN)rkrr)rrpattention_chunk_sizez;Cannot use both `sliding_window` and `attention_chunk_size`Tr) rrr r3rOrkviewrwrvexpandrall)rjrJrlrmrnrrorprqrYq_length padding_mask kv_arangereshaped_cache_positionrp chunk_size causal_masksliding_mask_overlaychunked_mask_overlays r%r}r}[spV##A&H' 9ML 8xQZ\f g Y~/D/DEI I,11"a8 VH-/?FN)+A4HJ!j&<VWW66K!(+B^+SS++  (J6:QU_:__++ dD!Q./66z2r2NK!LD$1A$BB  /?uyy+2tDD r'r?)NN)rNNNTT)/rytypingrrr3 cache_utilsrrrr r generation.configuration_utilsr masking_utilsr r rrmodeling_utilsrr pytorch_utilsrrrrnnModulerKr{rxrrrrr rr rBrrHrErFContextrVrhr}rXr'r%rs=% > FW W tILEHHOOILXtP588??tPnb588??bN1559%)! F F  -F %U\\2F TN F TN F R Cuxx C.ehhoo.bH!H!Z1559& &  -& %U\\2& R4 <  ekk.A.A.I.I "(,-1 $!% VVLLVV V H% V U\\* V VVVellVr'