L iddlmZmZddlmZddlmZmZddlZddl m Z ddl m Z m Z mZmZmZe rddlmZed d Zej*eZGd d eZGddeZGddeZGddeZGddeZGddeZGddeZGddeZGddZ Gdde Z!Gd d!e Z"Gd"d#e Z#Gd$d%e Z$Gd&d'eZ%Gd(d)eZ&Gd*d+e!Z'Gd,d-e"Z(Gd.d/e"Z)Gd0d1e"Z*Gd2d3e"Z+Gd4d5e"Z,Gd6d7e#Z-Gd8d9e#Z.Gd:d;e Z/y)<)ABCabstractmethod)Iterable)AnyOptionalN)PretrainedConfig)is_hqq_availableis_quanto_greateris_torch_greater_or_equalis_torchdynamo_compilinglogging) Quantizerz2.7T accept_devc veZdZdZdZdZdZedejfdZ e ddejdejd e e e efd eejejffd Zed ejd eeeffd Zed efdZed efdZdZdZddZdej0d dfdZy)CacheLayerMixinz0Base, abstract class for a single layer's cache.Fc.d|_d|_d|_yNF)keysvaluesis_initializedselfs ^/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/transformers/cache_utils.py__init__zCacheLayerMixin.__init__s,0 .2 #c0|jjSN) __class____name__rs r__repr__zCacheLayerMixin.__repr__$s..))*+r key_statescyrrr#s rlazy_initializationz#CacheLayerMixin.lazy_initialization's=@rN value_states cache_kwargsreturncyrr%rr#r(r)s rupdatezCacheLayerMixin.update*s-0rcache_positioncyrr%)rr.s rget_mask_sizeszCacheLayerMixin.get_mask_sizes/sORrcyrr%rs rget_seq_lengthzCacheLayerMixin.get_seq_length2%(rcyrr%rs rget_max_cache_shapez#CacheLayerMixin.get_max_cache_shape5s*-rc|jrE|jjdd|_|jjdd|_yy)z(Offload this layer's data to CPU device.cpuT non_blockingN)rrtorrs roffloadzCacheLayerMixin.offload8s@    U >DI++..T.BDK rc|jr}|jj|jk7rY|jj|jd|_|jj|jd|_yyy)zcIn case of layer offloading, this allows to move the data back to the layer's device ahead of time.Tr8N)rrdevicer:rrs rprefetchzCacheLayerMixin.prefetch>sa   499#3#3t{{#B T[[t DDI++..4.HDK$C rc|jr4|jj|jjt |drd|_yy)z4Resets the cache values while preserving the objectscumulative_lengthrN)rrzero_rhasattrr@rs rresetzCacheLayerMixin.resetDsA    IIOO  KK    4, -%&D " .rbeam_idxc<|jdkDr|jjd|j|jj|_|j jd|j|j j|_yy)z,Reorders this layer's cache for beam search.rN)r2r index_selectr:r=rrrDs r reorder_cachezCacheLayerMixin.reorder_cacheMsn    1 $ ..q(++dii>N>N2OPDI++221hkk$++BTBT6UVDK %rrr*N)r! __module__ __qualname____doc__is_compileablerr"rtorchTensorr'rdictstrrtupler-intr0r2r5r;r>rC LongTensorrHr%rrrrs:N$ ,@ell@@mq0,,06;ll0RZ[_`ceh`h[iRj0 u||U\\) *00RU\\ReCHoRR(((-S--C I 'We&6&6W4Wrrc DeZdZdZdZdej fdZ ddej dej dee e e fde ej ej ffd Z d ej de eeffd Zdefd Zdefd ZdeddfdZdeddfdZdej ddfdZy) DynamicLayerz A cache layer that grows dynamically as more tokens are generated. This is the default for generative models. It stores the key and value states as tensors of shape `[batch_size, num_heads, seq_len, head_dim]`. Fr#c|j|jc|_|_tjg|j|j|_tjg|j|j|_d|_y)Ndtyper=T)rYr=rNtensorrrrr&s rr'z DynamicLayer.lazy_initialization\s^","2"2J4E4E DKLL4::dkkJ ll2TZZ L "rNr(r)r*c |js|j|tj|j|gd|_tj|j |gd|_|j|j fS) Update the key and value caches in-place, and return the necessary keys and value states. Args: key_states (`torch.Tensor`): The new key states to cache. value_states (`torch.Tensor`): The new value states to cache. cache_kwargs (`dict[str, Any]`, *optional*): Additional arguments for the cache. Returns: tuple[`torch.Tensor`, `torch.Tensor`]: The key and value states. dim)rr'rNcatrrr,s rr-zDynamicLayer.updatebsd$""  $ $Z 0IItyy*52> iil ;D yy$++%%rr.cRd}|jd}|j|z}||fS)zDReturn the length and offset of the cache, used to generate the maskr)shaper2)rr. kv_offset query_length kv_lengths rr0zDynamicLayer.get_mask_sizes{s5 %++A. '')L8 )##rc|jr|jjdk(ry|jjdS)1Returns the sequence length of the cached states.rr])rrnumelrbrs rr2zDynamicLayer.get_seq_lengths3""diioo&71&<yyr""rcy)zeReturns the maximum sequence length of the cache object. DynamicLayer does not have a maximum length.r%rs rr5z DynamicLayer.get_max_cache_shapesr max_lengthc|dkr|jt|z }|j|kry|jdd|ddf|_|jdd|ddf|_y)z Crop the past key values up to a new `max_length` in terms of tokens. `max_length` can also be negative to remove `max_length` tokens. rN.)r2absrr)rrks rcropzDynamicLayer.cropsl >,,.Z@J    J . IIc;J;12 kk#{ {A"56 rrepeatsc|jdkDrE|jj|d|_|jj|d|_yy)z8Repeat the cache `repeats` times in the batch dimension.rr^N)r2rrepeat_interleaverrros rbatch_repeat_interleavez$DynamicLayer.batch_repeat_interleavesN    1 $ 33G3CDI++77Q7GDK %rindicesc|jdkDr-|j|df|_|j|df|_yy)zt|||_d|_yNr)superrr{r@)rr{r s rrz"DynamicSlidingWindowLayer.__init__s ,!"rNr#r(r)r*c|js|j||xj|jdz c_t j |j |gd}t j |j|gd}|dddd|j dzdddf|_|dddd|j dzdddf|_||fS)r\r]r^Nr) rr'r@rbrNr`rrr{)rr#r(r)full_key_statesfull_value_statess rr-z DynamicSlidingWindowLayer.updates$""  $ $Z 0 *"2"22"66 ))TYY $;D!IIt{{L&ArJ#Aq4+>+>*>*B*Da$GH '1t/B/B.BQ.F.H!(KL  111rr.c|jd}|j|jk\}t|j|jz dzd}|r|jdz |z}||fS|j|z}||fSNReturn the length and offset of the cache, used to generate the attention maskrr)rbr@r{max)rr.rdis_fullrcres rr0z(DynamicSlidingWindowLayer.get_mask_sizess%++A. ((D,?,??..1D1DDqH!L ++a/,>I)##..=I)##rc|jSrgr@rs rr2z(DynamicSlidingWindowLayer.get_seq_length%%%rc|jSz+Return the maximum cache shape of the cacher{rs rr5z-DynamicSlidingWindowLayer.get_max_cache_shapes"""rrkc|j|jk\r tdt|||j j d|_y)z Crop the past key values up to a new `max_length` in terms of tokens. `max_length` can also be negative to remove `max_length` tokens. zCannot `crop` a `DynamicSlidingWindowLayer` after it has seen more tokens than itssliding window (otherwise some states are lost)r]N)r2r{ ValueErrorr~rnrrbr@)rrkr s rrnzDynamicSlidingWindowLayer.cropsR    D$7$7 7B   Z !%!4rr)r!rJrKrLrxrSrrNrOrrPrQrrRr-r0r2r5rn __classcell__r s@rrzrzs J#s#26 2LL2ll2tCH~. 2 u||U\\) * 2B $U\\ $eCHo $&&#S# 5s 5t 5 5rrzc eZdZdZdZdZdeffd ZdejfdZ ddejdejd e e e efd eejejffd Zd ejd eeeffd Zd efdZd efdZxZS) StaticLayera A static cache layer that stores the key and value states as static tensors of shape `[batch_size, num_heads, max_cache_len), head_dim]`. It lazily allocates its full backing tensors, and then mutates them in-place. Built for `torch.compile` support. Args: max_cache_len (`int`): Maximum number of tokens that can be stored, used for tensor preallocation. TF max_cache_lenc0t|||_yr)r~rr)rrr s rrzStaticLayer.__init__s *rr#c|j\|_|_}|_|j|j c|_|_t j|j|j|j|jf|j|j |_ t j|j|j|j|jf|j|j |_ tsRt jj|jt jj|jd|_y)a6 Lazy initialization of the keys and values tensors. This allows to get all properties (dtype, device, num_heads in case of TP etc...) at runtime directly, which is extremely practical as it avoids moving devices, dtypes etc later on for each `update` (which could break the static dynamo addresses as well). If this is unwanted, one can call `early_initialization(...)` on the Cache directly, which will call this function ahead-of-time (this is required for `torch.export` for example). Note that for `compile`, as we internally don't compile the prefill, this is guaranteed to have been called already when compiling. If compiling the prefill as well, e.g. calling `model.compile(...)` before `generate` with a static cache, it is still supported in general, but without guarantees depending on the compilation options (e.g. cuda graphs, i.e. `mode="reduce-overhead"` is known to fail). But it will in general work correctly, and prefill should not be compiled anyway for performances! rXTN)rbmax_batch_size num_headshead_dimrYr=rNzerosrrrr _dynamomark_static_addressr)rr#_s rr'zStaticLayer.lazy_initialization sAK@P@P=T^Q ","2"2J4E4E DKKK  $..$2D2Ddmm T**;;  kk  $..$2D2Ddmm T**;;  () MM - -dii 8 MM - -dkk :"rr(r)r*c|js|j|||jdnd}||n-tj|j d|j } |jjd|||jjd|||j|jfS#t$r/||jdddd|f<||jdddd|f<YOwxYw)r\Nr.r]r=) rr'getrNarangerbr=r index_copy_rNotImplementedError)rr#r(r)r.s rr-zStaticLayer.update.s$""  $ $Z 0@L?W))*:;]a,8Nell:K[K[\^K_hlhshs>t   = II ! !!^Z @ KK # #A~| D yy$++%% # =.8DIIaN* +0?!"rr#r(r)r*c|js|j|||jdnd}||n-tj|j d|j }|j}||jk\}|xj|j dz c_|r>|j ddk(r|jjdd}|jjdd}tjdgt|j } ||dddd| f<||dddd| f<|jj||jj||j|jfStj|jddddddddf|fd } tj|jddddddddf|fd } n||j d z|jkDro|d k(r|} |} ntj|jddddd|ddf|fd } tj|jddddd|ddf|fd } nS |jj!d |||jj!d |||j|jfS|jj| dddd|j dddf|jj| dddd|j dddf| | fS#t"$r/||jdddd|f<||jdddd|f<YwxYw) r\Nr.r]rrrj)dimsrXr^rr)rr'rrNrrbr=r@rrrollrrZrScopy_r`rr) rr#r(r)r.r@rnew_keys new_valuesindexrrs rr-zStaticSlidingWindowLayer.updateys)$""  $ $Z 0@L?W))*:;]a,8Nell:K[K[\^K_hlhshs>t !22#t'9'99 *"2"22"66 #q(99>>"2>6![[--br-:  bTT[[I(2Au%*6 1a;' ) !!*-yy$++--#())TYYq!QR{-CZ,PVX"Y$)IIt{{1aQ;/G.V\^$_! !1!1!!4 4t7I7I I A%",$0!"'))TYYq!=O>O=OQR7R-SU_,`fh"i$)IIt{{1aASBSASUV;V/WYe.fln$o! A %%aD ''><H 99dkk) ) 1t/A/A.A.CQ(FGH +Aq43E3E2E2G,JKL 111' A2< !Q./4@ Aq.01 As2:L225M*)M*r.c|jd}|j}|j|jk\}t|j|z dzd}|r ||zdz }||fS|j|z|kDr|j|z}||fS|}||fSr)rbrr@r)rr.rdr{rrcres rr0z'StaticSlidingWindowLayer.get_mask_sizess%++A. ++((D,>,>>..?!CQG &59I)##  # #l 2^ C..=I )##'I)##rc|jSrrrs rr2z'StaticSlidingWindowLayer.get_seq_lengthrrr)r!rJrKrLrxrSrrNrOrrPrQrrRr-r0r2rrs@rrres J#c#3#26 O2LLO2llO2tCH~. O2 u||U\\) * O2b$U\\$eCHo$&&&rrc eZdZdZ ddededededef fd Z ddejd ejd ee e e fd e ejejffd Z ed ZedZd efdZxZS)QuantizedLayera A quantized layer similar to what is described in the [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache paper](https://huggingface.co/papers/2402.02750). It allows the model to generate longer sequence length without allocating too much memory for the key and value caches by applying quantization. The cache has two types of storage, one for original precision and one for the quantized cache. A `residual length` is set as a maximum capacity for the original precision cache. When the length goes beyond maximum capacity, the original precision cache is discarded and moved into the quantized cache. The quantization is done per-channel with a set `q_group_size` for both Keys and Values, in contrast to what was described in the paper. nbitsaxis_key axis_value q_group_sizeresidual_lengthcvt|||_||_||_||_||_d|_yr})r~rrrrrrr@rrrrrrr s rrzQuantizedLayer.__init__s=    $(.!"rr#r(r)r*c|xj|jdz c_|jsu|j||j |j |j |_|j |j |j|_ ||fS|j|j}|j|j}tj||j|gd}tj||j|gd}|jjdk(r|jjddz|j k\r|j |j |j |_|j |j |j|_ tj"g|j$|j&|_ tj"g|j$|j&|_||fStj|j|gd|_ tj|j|gd|_||fS)r\r])axisr^rrX)r@rbrr' _quantize contiguousr_quantized_keysr_quantized_values _dequantizerNr`rrr_rrZrYr=)rr#r(r) dequant_keysdequant_valueskeys_to_returnvalues_to_returns rr-zQuantizedLayer.updates" *"2"22"66""  $ $Z 0#'>>*2G2G2IPTP]P]>#^D %)^^L4K4K4MTXTcTc^%dD "|+ +''(<(<= ))$*@*@AL$))Z#HbQ 99ndkk<%PVXY 99==?a DIIOOB$7!$;t?S?S$S#'>>.2K2K2MTXTaTa>#bD %)^^4D4O4O4QX\XgXg^%hD " Rz/?/? HYHYZDI,,r1A1A*J[J[\DK /// 499j"9rBDI))T[[,$?RHDK///rcyrr%)rrZrs rrzQuantizedLayer._quantize's'*rcyrr%)rq_tensors rrzQuantizedLayer._dequantize*r3rc|jSrrrs rr2zQuantizedLayer.get_seq_length-rrrrr@r)r!rJrKrLrSrrNrOrrPrQrrRr-rrrr2rrs@rrrs " ### #  #  #(26 '0LL'0ll'0tCH~. '0 u||U\\) * '0R**((&&rrc LeZdZ d dededededef fd ZdZdZxZS) QuantoQuantizedLayerrrrrrct ||||||tddr ddlm}m}m}n td|jdvrtd |j|jd vrtd |j|jd vrtd |j|jd k(r|n||_ ||_ y)Nrrrrrz0.2.5Trr) MaxOptimizerqint2qint4ziYou need optimum-quanto package version to be greater or equal than 0.2.5 to use `QuantoQuantizedCache`. )rrzA`nbits` for `quanto` backend has to be one of [`2`, `4`] but got )rrjzE`axis_key` for `quanto` backend has to be one of [`0`, `-1`] but got zG`axis_value` for `quanto` backend has to be one of [`0`, `-1`] but got r)r~rr optimum.quantorrr ImportErrorrrrrqtype optimizer) rrrrrrrrrr s rrzQuantoQuantizedLayer.__init__3s !%+   W 6 A A{  ::V #`aeakak`lmn n == 'deiererdstu u ??' )YZ^ZiZiYjk #jjAoU5 %rcddlm}|j||j||j\}}|||j||||j}|S)Nr)quantize_weight)rrrrr)rrZrrscale zeropointqtensors rrzQuantoQuantizedLayer._quantizeYsK2>>&$**dDDUDUVy!&$**dE9dN_N_`rc"|jSr) dequantize)rrs rrz QuantoQuantizedLayer._dequantize`s!!##rrr!rJrKrSrrrrrs@rrr2sT" $($($( $(  $(  $(L$rrc LeZdZ d dededededef fd ZdZdZxZS) HQQQuantizedLayerrrrrrcRt||||||ts td|jdvrt d|j|j dvrt d|j |jdvrt d|jt|_ y)Nrz4You need to install `hqq` to use `HQQQuantizedLayer`)rrrzM`nbits` for `HQQ` backend has to be one of [`1`, `2`, `3`, `4`, `8`] but got )rrzA`axis_key` for `HQQ` backend has to be one of [`0`, `1`] but got zC`axis_value` for `HQQ` backend has to be one of [`0`, `1`] but got ) r~rr rrrrr HQQQuantizer quantizerrs rrzHQQQuantizedLayer.__init__es !%+   !TU U ::_ ,_`d`j`j_kl  == &`aeanan`opq q ??& (bcgcrcrbstu u%rc|jj|||jj|jj|j |j \}}|jj|d<|jj|||jj|dj|j|d<|dj|j|d<||fS)N)rr= compute_dtyper group_sizer)metar=rzero) rquantizerr=rYrrcudar:)rrZrrrs rrzHQQQuantizedLayer._quantizes// 99##))//**(( 0 !% _ G$tyy7G7GHW ((8W F|w~~6V }rcH|\}}|jj||}|Sr)rr)rr quant_tensorrrZs rrzHQQQuantizedLayer._dequantizes'$ d**<> rrrrs@rrrdsT" &&& &  &  &@ rrc eZdZdZ d-deeedeeededefdZ dZ d.d e d efd Z d.d e d efd Z d/d ejdejd e deeeefdeejejff dZde de de dej*dej,f dZd0d e de fdZdejd e dee e ffdZd0d e de fdZdZdej8fdZde fd Zd!e fd"Zd#ejfd$Z e!de fd%Z"e!de fd&Z#e!defd'Z$e!defd(Z%e!deefd)Z&d e deejejffd*Z'd+Z(d,Z)y)1Cachean A `Cache` is mostly a list of `CacheLayerMixin` objects, one per model layer. It serves as a container for the Cache of each layer. Args: layers (`Optional`, *optional*): A list of pre-created `CacheLayerMixin`. If omitted (`None`), then `layer_class_to_replicate` will be used. layer_class_to_replicate (`type[CacheLayerMixin]`, *optional*): Only used if `layers` is omitted (`None`), in which case it will be used as the base class for each layer, and the layers will be added lazily as soon as `update` is called with a `layer_idx` greater than the current list of layers. offloading (`bool`, *optional*, defaults to `False`): Whether to perform offloading of the layers to `cpu`, to save GPU memory. offload_only_non_sliding (`bool`, *optional*, defaults to `True`): If `offloading` is `True`, this further decides if only the non-sliding layers will be offloaded (because usually the sliding layers are small in size, so there is no need to offload them, and skipping it is faster). Nlayerslayer_class_to_replicate offloadingoffload_only_non_slidingc| | td| | td||ng|_||_||_|jrE||_t rt jnt jj|_ yy)NaYou can construct a Cache either from a list `layers` of all the predefined `CacheLayer`, or from a `layer_class_to_replicate`, in which case the Cache will append a new layer corresponding to `layer_class_to_replicate` for each new call to `update` with an idx not already in the Cache.z_You should provide exactly one of `layers` or `layer_class_to_replicate` to initialize a Cache.) rrrronly_non_sliding#_is_torch_greater_or_equal_than_2_7rNStreamrprefetch_stream)rrrrrs rrzCache.__init__s  ":"Fq  >6>q !' 2f (@%$ ??$^c^h^h^o^o^qD  rcN|jjd|jdS)Nz(layers=))r r!rrs rr"zCache.__repr__s$..))*(4;;-qAAr layer_idxrc|r# ||j|djdz}n|t|jkr|nd}t r |j n(tjj|j 5|j|jdddy#t$r|jjd}YwxYw#1swYyxYw)a< Prefetch a given layer on its device. If `only_non_sliding` is True, it will try to prefetch only the layers which are non-sliding. If the `layer_idx` is outside the range, this will circle back to the first layers. Note that we use a non-default stream for this, to avoid blocking. NFr) rxrrlenrrrrNrstreamr>rrrs rr>zCache.prefetchs  9% (C(I(I%(PP &/T[[1A%A qI&IT ! !ejjN_N_`d`t`tNu . KK " + + - . .  9 OO11%8  9  . .s!B$=C$$C  C Ccb|r|j|s|j|jyy)a Offload a given `layer_idx`. If `only_non_sliding` is True, it will offload `layer_idx` only if it is a non-sliding layer. Note that we do it on the default stream, so that we ensure all earlier computation in the layer's `update` methods are finished. N)rxrr;r s rr;z Cache.offloads- !T__Y%? KK " * * ,&@rr#r(r)r*cF|jZt|j|krB|jj|jt|j|krB|jrat j j|jj|j|j|dz|j|j|j|||\}}|jr|j||j||fS)a Updates the cache with the new `key_states` and `value_states` for the layer `layer_idx`. Parameters: key_states (`torch.Tensor`): The new key states to cache. value_states (`torch.Tensor`): The new value states to cache. layer_idx (`int`): The index of the layer to cache the states for. cache_kwargs (`dict[str, Any]`, *optional*): Additional arguments for the cache subclass. These are specific to each subclass and allow new types of cache to be created. Return: A tuple containing the updated key and value states. r)rrrappendrrNrdefault_streamr= wait_streamrr>rr-r;)rr#r(rr)rrs rr-z Cache.updates2  ( ( 4dkk"i/ ""4#@#@#BCdkk"i/ ?? JJ % %j&7&7 8 D DTEYEY Z MM)a-)>)> ?{{9-44Z|\ f ?? LLD$9$9 :V|r batch_sizerrrYr=ctj||d|f||}|jD]}|j|y)z Initialize all the layers in advance (it's otherwise lazily initialized on the first `update` call). This is useful for our `export` recipes, as `export` needs everything in advance. rrXN)rNrrr')rrrrrYr=fake_keys_tensorlayers rearly_initializationzCache.early_initializationsD!;; Iq('KSXagh[[ 8E  % %&6 7 8rcn|t|jk\ry|j|jS)z=Returns the sequence length of the cache for the given layer.r)rrr2rrs rr2zCache.get_seq_lengths. DKK( ({{9%4466rr.c|t|jk\r|jddfS|j|j|S)a Return a tuple (kv_length, kv_offset) corresponding to the length and offset that will be returned for the given layer at `layer_idx`. The masks are then prepared according to the given lengths (kv_length, kv_offset) and patterns for each layer. r)rrrbr0rr.rs rr0zCache.get_mask_sizes$sE DKK( (!''*A- -{{9%44^DDrcn|t|jk\ry|j|jS)zaReturns maximum sequence length of the cache object. Dynamic caches do not have a maximum length.rj)rrr5rs rr5zCache.get_max_cache_shape0s0 DKK( ({{9%99;;rctt|jD]}|j|j!y)z$Recursively reset all layers tensorsN)rangerrrCrs rrCz Cache.reset8s4s4;;/0 +I KK " ( ( * +rrDctt|jD] }|j|j|"y)z!Reorder the cache for beam searchN)rrrrH)rrDrs rrHzCache.reorder_cache=s6s4;;/0 ;I KK " 0 0 : ;rrkctt|jD] }|j|j|"y)z"Crop the cache to the given lengthN)rrrrn)rrkrs rrnz Cache.cropBs6s4;;/0 4I KK " ' ' 3 4rroctt|jD] }|j|j|"y)zRepeat and interleave the cacheN)rrrrs)rrors rrszCache.batch_repeat_interleaveGs8s4;;/0 DI KK " : :7 C Drrtctt|jD] }|j|j|"y)zSelect indices from the cacheN)rrrrw)rrtrs rrwzCache.batch_select_indicesLs8s4;;/0 AI KK " 7 7 @ Arc|jDcgc]}|j}}tt|dkDrt d||dScc}w)z*Return the maximum batch size of the cacherz0Max batch size is not consistent across layers: r)rrrsetrrrrs rrzCache.max_batch_sizeQsV59KK@5%&&@@ s6{ a OPVxXY YayAsAch|jDcgc]}|j}}t|Scc}w)z,Return the maximum cache length of the cache)rrrr!s rrzCache.max_cache_lenYs148;;?%%%%??6{@s/clt|jdk(rytd|jDS)z'Return whether the cache is compileablerFc34K|]}|jywr)rM.0rs r z'Cache.is_compileable..esAE5''Arrallrs rrMzCache.is_compileable_s- t{{ q AT[[AAArcnt|jdkDxrtd|jDS)z,Return whether the cache data is initializedrc34K|]}|jywr)rr%s rr'z'Cache.is_initialized..js+ZUE,@,@+Zr(r)rs rrzCache.is_initializedgs,4;;!#Z+Zdkk+Z(ZZrcV|jDcgc]}t|ddc}Scc}w)z9Return whether the layers of the cache are sliding windowrxF)rgetattr)rrs rrxzCache.is_slidingls'BFM|U3MMMs&c|t|jkr2|j|j|j|jfSt dt|jd|z Support for backwards-compatible `past_key_values` indexing, e.g. `past_key_values[0][0].shape[2]` to get the sequence length. zCache only has z. layers, attempted to access layer with index )rrrrKeyErrorrs r __getitem__zCache.__getitem__qsh s4;;' ';;y).. I0F0M0MM M!#dkk"2!33abkalm rc#Ktt|D]6}|j|j|j|jf8ywz Support for backwards-compatible `past_key_values` iteration, e.g. `for x in past_key_values:` to iterate over keys and values N)rrrrrrs r__iter__zCache.__iter__}sK s4y) OI;;y).. I0F0M0MN N OsAAc,t|jS)zN This value corresponds to the number of layers in the model. )rrrs r__len__z Cache.__len__s 4;;r)NNFT)Trr)*r!rJrKrLrlistrtypeboolrr"rSr>r;rNrOrPrQrrRr-rYr=rr2r0r5rCrTrHrnrsrwpropertyrrrMrrxr2r5r7r%rrrrsx*37DH )- ro./r#+4+@"Ar r #' r0B.#..(---26 'LL'll' ' tCH~. ' u||U\\) * 'R 8 8*- 89< 8EJ[[ 8Z_ZfZf 877C7 EU\\ Ec EeTWY\T\o E>> from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache >>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct") >>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct") >>> inputs = tokenizer(text="My name is Qwen2", return_tensors="pt") >>> # Prepare a cache class and pass it to model's forward >>> past_key_values = DynamicCache(config=model.config) >>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True) >>> outputs.past_key_values # access cache filled with key/values from generation ``` ddp_cache_dataconfigrrcg}||jd}t|ddxs t|dd}t|dd}|&t|jD cgc]} |dnd }} t |dr|d|j }|D];} | d vr|j t| #|j t=|It|D];\} \} } ||j t|| j| | \} } =t|d k(rt|5t|| yt|5||| ycc} w)NTdecoderr{attention_chunk_size layer_typessliding_attentionfull_attentionnum_kv_shared_layers)rFchunked_attentionrr)rrrrrr)get_text_configr.rnum_hidden_layersrBrHr rzrV enumerater-rr~r)rr?r@rrrdecoder_configr{rEr layer_typerr#r(r s rrzDynamicCache.__init__s  #33D3AN$^5EtLPW 6QN".-FK"#>#C#CD,:+E'K[[  ~'=>)*P^-P-P,PQ ) 2 !KKMM";>"Z[MM,.1  2  %9B>9R J5 5J >MM,.1i(// LI1  J v;!  G )5%)A   G Fz\t  uEs Er*cdd}|jD]}||j|jffz } |S)z Converts the `Cache` instance into the its equivalent in the legacy cache format. Used for backward compatibility. r%)rrr)r legacy_cachers rto_legacy_cachezDynamicCache.to_legacy_caches<  [[ :E ejj%,,79 9L :rpast_key_valuesc|}|tjd|4tt|D]}||\}}|j ||||S)z Converts a cache in the legacy cache format into an equivalent `Cache`. Used for backward compatibility. 9past_key_values should not be None in from_legacy_cache())logger warning_oncerrr-)clsrScacherr#r(s rfrom_legacy_cachezDynamicCache.from_legacy_cachesg   "    [ \  &"3#78 B +:9+E( L ZyA B r)NNFF)r!rJrKrLrrrRrNrOr r;rrR classmethodrZrrs@rr>r>s(XQU-1 ). 2v % ell0J*K!LM2v)*2v 2v #' 2vhuU\\5<<-G'H!I eELL%,,c :eZdZdZ ddedededeffd ZxZS) StaticCachea Static Cache class to be used with `torch.compile(model)` and `torch.export()`. It will check the `config` for potential hybrid cache structure, and initialize each layer accordingly. See `Cache` for details on common methods that are implemented by all cache classes. Args: config (`PretrainedConfig`): The config of the model for which this Cache will be used. It will be used to check for sliding or hybrid layer structure, and initialize each layer accordingly. max_cache_len (`int`): The maximum number of tokens that this Cache should hold. offloading (`bool`, *optional*, defaults to `False`): Whether to perform offloading of the layers to `cpu`, to save GPU memory. offload_only_non_sliding (`bool`, *optional*, defaults to `True`): If `offloading` is `True`, this further decides if only the non-sliding layers will be offloaded (because usually the sliding layers are small in size, so there is no need to offload them, and skipping it is faster). Example: ```python >>> from transformers import AutoTokenizer, AutoModelForCausalLM, StaticCache >>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf") >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf") >>> inputs = tokenizer(text="My name is Llama", return_tensors="pt") >>> # Prepare a cache class and pass it to model's forward >>> # Leave empty space for 10 new tokens, which can be used when calling forward iteratively 10 times to generate >>> max_generated_length = inputs.input_ids.shape[1] + 10 >>> past_key_values = StaticCache(config=model.config, max_cache_len=max_generated_length) >>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True) >>> outputs.past_key_values # access cache filled with key/values from generation StaticCache() ``` r@rrrc |jd}t|dd}|t|dd#t|jDcgc]}d}}nRt|dd#t|jDcgc]}d}}n"t|jDcgc]}d}}t |d r|d|j }g}|D]Y} | dk(rt ||j } n)| dk(rt ||j } n t| } |j| [t |1||| ycc}wcc}wcc}w) NTrBrEr{rFrDrIrGrH)rr{rrJ) rKr.rrLrBrHrr{rDrr r~r) rr@rrrkwargsrErrrOrr s rrzStaticCache.__init__/sU'''5fmT:  v/6BJv?W?W9XYA/Y Y 61 2%&D)D)D(DEK% !J000}]c]r]rs221"/@[@[$-@ MM%  ! :Xpq/]\Ys D47 D9 D>)FT) r!rJrKrLr rSr;rrrs@rr]r]sG$V!)- $r $r$r $r #' $r$rrr]cLeZdZdZ d dededededededeffd ZxZS) QuantizedCachea A quantizer cache similar to what is described in the [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache paper](https://huggingface.co/papers/2402.02750). It allows the model to generate longer sequence length without allocating too much memory for keys and values by applying quantization. The cache has two types of storage, one for original precision and one for the quantized cache. A `residual length` is set as a maximum capacity for the original precision cache. When the length goes beyond maximum capacity, the original precision cache is discarded and moved into the quantized cache. The quantization is done per-channel with a set `q_group_size` for both keys and values, in contrast to what was described in the paper. See `Cache` for details on common methods that are implemented by all cache classes. Args: backend (`str`): The quantization backend to use. One of `("quanto", "hqq"). config (`PretrainedConfig`): The config of the model for which this Cache will be used. nbits (`int`, *optional*, defaults to 4): The number of bits for quantization. axis_key (`int`, *optional*, defaults to 0): The axis on which to quantize the keys. axis_value (`int`, *optional*, defaults to 0): The axis on which to quantize the values. q_group_size (`int`, *optional*, defaults to 64): Quantization is done per-channel according to a set `q_group_size` for both keys and values. residual_length (`int`, *optional*, defaults to 128): Maximum capacity for the original precision cache backendr@rrrrrc |dk(rt}n|dk(rt}ntd|d|jd}t |j D cgc]} ||||||} } t || ycc} w)NquantohqqzUnknown quantization backend ``TrB)r)rrrrKrrLr~r) rrbr@rrrrr layer_classrrr s rrzQuantizedCache.__init__us h .K  +K=gYaHI I'''56334  x\? S   '  sA8r) r!rJrKrLrQr rSrrrs@rraraVshD"((!( (  (  ((((rraceZdZdZd#dZdefdZdZdede e je je je jffdZ d Z de e e jfd Zed eee e j$d fddfd Zd$dedefdZdZde j,fdZdefdZdefdZdededdfdZdefdZde jfdZdefdZde jdede eeffd Zed!Z ede!fd"Z"y)%EncoderDecoderCachea Base, abstract class for all encoder-decoder caches. Can be used to hold combinations of self-attention and cross-attention caches. See `Cache` for details on common methods that are implemented by all cache classes. Args: caches (`Iterable`): Usually an iterable of length 2, containing 2 `Cache` objects, the first one for self-attention, the second one for cross-attention. Can optionally also be an iterable of length 1, containing a `tuple[tuple[torch.Tensor]]` (usually used for compatibility with torch dp and ddp). Example: ```python >>> from transformers import AutoProcessor, AutoModelForCausalLM, DynamicCache, EncoderDecoderCache >>> model = AutoModelForCausalLM.from_pretrained("openai/whisper-small") >>> processor = AutoProcessor.from_pretrained("openai/whisper-small") >>> inputs = processor(audio=YOUR-AUDIO, return_tensors="pt") >>> # Prepare cache classes for encoder and decoder and pass it to model's forward >>> self_attention_cache = DynamicCache(config=self.config) >>> cross_attention_cache = DynamicCache(config=self.config) >>> past_key_values = EncoderDecoderCache(self_attention_cache, cross_attention_cache) >>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True) >>> outputs.past_key_values # access cache filled with key/values from generation EncoderDecoderCache() ``` r*Nc  t|dk(rt|_t|_t |dD]^\}}|dd\}}|jj |||t|dkDs:|dd\}}|jj |||`nt|dk(rdt |dtrt |dts)tdt|ddt|d|d|_|d|_ntdt|i|_ tt|jD]6}t|jj|dkD|j|<8y)Nrrrz;One of the two arguments is not a Cache: type(caches[0]) = z, type(caches[1]) = zExpected 1 or 2 arguments, got )rr>self_attention_cachecross_attention_cacherMr- isinstancer TypeErrorr:r is_updatedrr;r2)rcachesrkey_value_statesr#r(s rrzEncoderDecoderCache.__init__s v;! (4D %)5D &/8/C [+ ++;BQ+?( L))00\9U'(1,/?/C,J ..55j,PYZ  [[A fQi/z&)U7S"^DQWXYQZOK__tbfgmnogpbqau vww(.q D %)/D &>s6{mLM Ms4#=#=>? hI)-d.H.H.W.WXa.bef.f)gDOOI & hrch|jjd|jd|jdS)Nz(self_attention_cache=z, cross_attention_cache=r)r r!rkrlrs rr"zEncoderDecoderCache.__repr__s;~~&&''=d>W>W=XXp))*! - rc#VKtt|D]}|jj|j|jj|j |j j|j|j j|j fywr4)rrrkrrrrlrs rr5zEncoderDecoderCache.__iter__s s4y) I))00;@@))00;BB**11)<AA**11)<CC   sB'B)rcf|t|kr|jj|j|jj|j|j j|j|j j|jfSt dt|d|r0)rrkrrrrlr1rs rr2zEncoderDecoderCache.__getitem__s s4y ))00;@@))00;BB**11)<AA**11)<CC  _SYK7efoepqr rrc,t|jS)z Support for backwards-compatible `past_key_values` length, e.g. `len(past_key_values)`. This value corresponds to the number of layers in the model. )rrkrs rr7zEncoderDecoderCache.__len__s 4,,--rc d}t|jdkDrOt|jj |jj D]\}}|||zfz }|S|jj }|S)z[Converts the `EncoderDecoderCache` instance into its equivalent in the legacy cache format.r%r)rrlziprkrR)rrQ self_attn cross_attns rrRz#EncoderDecoderCache.to_legacy_caches t)) *Q .),))99;T=W=W=g=g=i* :% :Z!7 99  :  44DDFLrrS.c`|tt}|tjd|St|D]m\}}|dd\}}|jj |||t |dkDs:|dd\}}|jj |||d|j|<o|S)zUConverts a cache in the legacy cache format into an equivalent `EncoderDecoderCache`.NrUrT) r>rVrWrMrkr-rrlro)rXrSrYrrqr#r(s rrZz%EncoderDecoderCache.from_legacy_caches LNLN3  "    [ \ 09/I 7+ ++;BQ+?( L**11*lIV'(1,/?/C,J //66zrlr__str__)rr~s rcheck_dynamic_cachez'EncoderDecoderCache.check_dynamic_cachesw t00, ?455|DF8DTE^E^EfEfEhDij''+'A'A'I'I'K&LLkm Ermaximum_lengthc|j|jj|jj|y)z Crop the past key values up to a new `maximum_length` in terms of tokens. `maximum_length` can also be negative to remove `maximum_length` tokens. This is used in assisted decoding and contrastive search (on the Hub). N)rrnr!rk)rrs rrnzEncoderDecoderCache.crop*s0   !3!34 !!&&~6rfull_batch_size split_sizezlist[EncoderDecoderCache]c"|j|jj|jj||}|jj||}g}t ||D] \}}|j t||"|S)z Split the current instance into a list of `DynamicCache` by the batch size. This will be used by `_split_model_inputs()` in `generation.utils` )r batch_splitr!rkrlrwr ri)rrrrkrloutrxrys rrzEncoderDecoderCache.batch_split2s   !1!1!:!:;#88DD_V`a $ : : F FXb c%()=?T%U C !Iz JJ*9jA B C rroc|j|jj|jj||jj|y)zaRepeat the cache `repeats` times in the batch dimension. Used in contrastive search (on the Hub).N)rrsr!rkrlrrs rrsz+EncoderDecoderCache.batch_repeat_interleave@sD   !=!=!F!FG !!99'B ""::7Crrtc|j|jj|jj||jj|y)zeOnly keep the `indices` in the batch dimension of the cache. Used in contrastive search (on the Hub).N)rrwr!rkrlrvs rrwz(EncoderDecoderCache.batch_select_indicesFsD   !:!:!C!CD !!66w? ""77@rc6|jjS)zKReturns the maximum sequence length (i.e. max capacity) of the cache object)rkr5rs rr5z'EncoderDecoderCache.get_max_cache_shapeLs((<<>>rr.c:|jj||Sr)rkr0rs rr0z"EncoderDecoderCache.get_mask_sizesPs((77 RRrc.|jjSr)rkrxrs rrxzEncoderDecoderCache.is_slidingSs((333rc.|jjSr)rkrMrs rrMz"EncoderDecoderCache.is_compileableWs((777rrIr8)#r!rJrKrLrrQr"r5rSrRrNrOr2r7rRr[rr FloatTensorrZr2rCrTrHrrnrrsrwr5r0r<rxr;rMr%rrriris@h4 #  sS sU5<<u||]b]i]i3i-j s. uU\\':!; &xe6G6G6L0M'NO "CCCC/ ;e&6&6; #737 3 C D_ DsD AELLA ?S?SU\\ScSeTWY\T\oS44888rric(eZdZdedeffd ZxZS)SlidingWindowLayerrr{cPtjdt| ||y)Nz`SlidingWindowLayer` is deprecated and will be removed in version v4.59 Use `StaticSlidingWindowLayer` instead, which is a better name for it.rVrWr~rrrr{r s rrzSlidingWindowLayer.__init__`s( U  7rr!rJrKrSrrrs@rrr_8c8388rrc(eZdZdedeffd ZxZS)ChunkedSlidingLayerrr{cPtjdt| ||y)Nz`ChunkedSlidingLayer` is deprecated and will be removed in version v4.59 Use `StaticSlidingWindowLayer` instead, which has the exact same functionalities.rrs rrzChunkedSlidingLayer.__init__is( `  7rrrs@rrrhrrrc eZdZdfd ZxZS)OffloadedCachecPtjdt| dy)Nzo`OffloadedCache` is deprecated and will be removed in version v4.59 Use `DynamicCache(offloading=True)` insteadT)rr)rr s rrzOffloadedCache.__init__rs( :  D)rrI)r!rJrKrrrs@rrrqs **rrc(eZdZdedeffd ZxZS)OffloadedStaticCacher@rcTtjdt| ||dy)Nzy`OffloadedStaticCache` is deprecated and will be removed in version v4.59 Use `StaticCache(..., offloading=True)` insteadTr@rrrrr@rargsr_r s rrzOffloadedStaticCache.__init__{s- >  mPTUrr!rJrKr rSrrrs@rrrz V/VVVrrc(eZdZdedeffd ZxZS)SlidingWindowCacher@rcRtjdt| ||y)Nz`SlidingWindowCache` is deprecated and will be removed in version v4.59 Use `StaticCache(...)` instead which will correctly infer the type of each layer.r@rrrs rrzSlidingWindowCache.__init__+ `  mDrrrs@rrr E/EEErrc(eZdZdedeffd ZxZS) HybridCacher@rcRtjdt| ||y)Nz`HybridCache` is deprecated and will be removed in version v4.59 Use `StaticCache(...)` instead which will correctly infer the type of each layer.rrrs rrzHybridCache.__init__rrrrs@rrrrrrc(eZdZdedeffd ZxZS)HybridChunkedCacher@rcRtjdt| ||y)Nz`HybridChunkedCache` is deprecated and will be removed in version v4.59 Use `StaticCache(...)` instead which will correctly infer the type of each layer.rrrs rrzHybridChunkedCache.__init__rrrrs@rrrrrrc(eZdZdedeffd ZxZS)OffloadedHybridCacher@rcTtjdt| ||dy)Nz`OffloadedHybridCache` is deprecated and will be removed in version v4.59 Use `StaticCache(..., offload=True)` instead which will correctly infer the type of each layer.Trrrs rrzOffloadedHybridCache.__init__s. n  mPTUrrrs@rrrrrrc DeZdZ ddedededededef fd ZxZS) QuantoQuantizedCacher@rrrrrc Ztjdt| d||||||y)Nz~`QuantoQuantizedCache` is deprecated and will be removed in version v4.59 Use `QuantizedCache(backend='quanto', ...)` instead.rdrrr@rrrrrr s rrzQuantoQuantizedCache.__init__s5  C  65(J Vefrrrrs@rrrs`" g  g g g  g  g g grrc DeZdZ ddedededededef fd ZxZS) HQQQuantizedCacher@rrrrrc Ztjdt| d||||||y)Nzx`HQQQuantizedCache` is deprecated and will be removed in version v4.59 Use `QuantizedCache(backend='hqq', ...)` instead.rerrs rrzHQQQuantizedCache.__init__s5  @  x\Sbcrrrrs@rrrs`" d  d d d  d  d d drrceZdZdZddZy) SinkCachea  It is now a `custom_generate` repository on the Hub: https://huggingface.co/transformers-community/sink_cache. See [these docs](https://huggingface.co/docs/transformers/generation_strategies#custom-decoding-methods) for general `custom_generate`usage. Nc td)Nz`SinkCache` has been moved as a `custom_generate` repository on the Hub: https://huggingface.co/transformers-community/sink_cache. See the repository for usage examples.)r)rr_s rrzSinkCache.__init__s! o  rrI)r!rJrKrLrr%rrrrs  rr)0abcrrcollections.abcrtypingrrrNconfiguration_utilsr utilsr r r r rhqq.core.quantizerrr get_loggerr!rVrrVrzrrrrrrr>r]rarirrrrrrrrrrrr%rrrs#$ 1;&?RV&W#   H %7Wc7WtP4?P4fN5 N5bj"/j"Zz&{z&zM&\M&`/$>/$d33lq q hv5vrLr%Lr^5(U5(pK8%K8b818828*\*V;VEEE+EEEV;Vg>g"dd"    r