L i ddlZddlmZddlmZddlmZddlmZm Z e je Z erddl Z dZ d"deed ed d eed ed effdZ d"deed ed d eed ed effdZ d"deed ed d eed ed effdZ d#ded d d eed ed effdZ d#ded d d eed ed effdZ d#ded d d eed ed effdZeeeeeedZ d$dedededeedeef dZd#dedeefdZd#dedeefdZd#dedeefdZd#dedeefdZd#dedeefdZ d#dedeefd Z!eeeee e!dZ"d#dedeefd!Z#y)%Nwraps)Optional)PretrainedConfig)is_torch_availableloggingcBddtfd}|S)ad Decorator function to update the RoPE parameters in the forward pass, if the model is using a dynamic RoPE (i.e. a RoPE implementation that may recompute its frequencies in the forward pass). Args: rope_forward (Callable): The forward pass of the RoPE implementation. Returns: The decorated forward pass. ctj|dz}t|jdr|jj}n|jj }||kDrTt|ds)|j |j||dz\|_}|jd|jdy|jj||_ |jd|jdy) zbLongrope uses long factor if sequence is larger than original pretraining length, short otherwise.r original_max_position_embeddings long_inv_freqseq_leninv_freqF persistentN) torchmaxhasattrconfigr max_position_embeddings rope_init_fnr register_bufferoriginal_inv_freqto)self position_idsdevicerr _s f/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/transformers/modeling_rope_utils.pylongrope_frequency_updatez6dynamic_rope_update..longrope_frequency_update+s))L)A- 4;; B C/3{{/[/[ ,/3{{/R/R , 5 541(,(9(9KK1QTU1U):)%"A  T-?-?E R&*%;%;%>%>v%FD "  T-C-CPU Vctj|dz}||jkDrA|j|j||\}|_|j d|d||_||jkrj|j|jkDrP|jj||_|j d|jd|j|_yyy)a dynamic RoPE layers should recompute `inv_freq` in the following situations: 1 - growing beyond the cached sequence length (allow scaling) 2 - the current sequence length is in the original scale (avoid losing precision with small sequences) rrrFrN) rrmax_seq_len_cachedrrattention_scalingroriginal_max_seq_lenrr)rrrrrs r dynamic_frequency_updatez5dynamic_rope_update..dynamic_frequency_update>s ))L)A- T,, ,/3/@/@f^e/@/f ,Hd,  X% H&-D # T.. .43J3JTMfMf3f&*%;%;%>%>v%FD "  T-C-CPU V&*&?&?D # 4g .r"cd|jvr|||jn$|jdk(r|||j|||S)Ndynamic)rlongrope) rope_typer)rxrr'r! rope_forwards r wrapperz$dynamic_rope_update..wrapperQsJ  & $T< I ^^z ) %dL JD!\22r"r)r-r.r'r!s` @@r dynamic_rope_updater/s/W&@& <33 Nr"rrz torch.devicerreturnz torch.TensorcJ|j}t|dd}t|ddxs|j|jz}t ||z}d}d|t j d|dt jj|t j|z zz }||fS) a  Computes the inverse frequencies according to the original RoPE implementation Args: config ([`~transformers.PretrainedConfig`]): The model configuration. This function assumes that the config will provide at least the following properties: * rope_theta (`float`): The base wavelength from which the inverse frequencies will be derived. * hidden_size (`int`): The numerator when deriving a head_dim, if not provided directly. * num_attention_heads (`int`): The denominator when deriving a head_dim, if not provided directly. Additionally, this function will make use of the following properties if they are found in the config: * head_dim (`int`, *optional*): The size of the key-value heads in the model. If None, this value will be derived as hidden_size // num_attention_heads. * partial_rotary_factor (`float`, *optional*): If less than 1.0, inverse frequencies will be returned for the first fraction of the head_dim. Defaults to 1.0. device (`torch.device`): The device to use for initialization of the inverse frequencies. seq_len (`int`, *optional*): The current sequence length. Unused for this type of RoPE. Returns: Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE). partial_rotary_factor?head_dimNrdtyperr7) rope_thetagetattr hidden_sizenum_attention_headsintrarangeint64rfloat) rrrbaser2r4dimattention_factorrs r _compute_default_rope_parametersrD\s>   D#F,CSIvz40dF4F4F&JdJd4dH h.. /Cdu||AsAU[[ILLTZbgbmbmLnqttuvH % %%r"cR|jd}t|||\}}||z}||fS)a Computes the inverse frequencies with linear scaling. Credits to the Reddit user /u/kaiokendev Args: config ([`~transformers.PretrainedConfig`]): The model configuration. This function assumes that the config will provide at least the following properties: * rope_theta (`float`): The base wavelength from which the inverse frequencies will be derived. * hidden_size (`int`): The numerator when deriving a head_dim, if not provided directly. * num_attention_heads (`int`): The denominator when deriving a head_dim, if not provided directly. Additionally, this function will make use of the following properties if they are found in the config: * head_dim (`int`, *optional*): The size of the key-value heads in the model. If None, this value will be derived as hidden_size // num_attention_heads. * partial_rotary_factor (`float`, *optional*): If less than 1.0, inverse frequencies will be returned for the first fraction of the head_dim. Defaults to 1.0. device (`torch.device`): The device to use for initialization of the inverse frequencies. seq_len (`int`, *optional*): The current sequence length. Unused for this type of RoPE. Returns: Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE). factor) rope_scalingrD)rrrrFrrCs r '_compute_linear_scaling_rope_parametersrHsD>  *F"B&&RY!ZH  H % %%r"c|j}t|dd}t|d|j|jz}t ||z}|j }|j d}d} ||}ngt|tjrAtj|tj||j|j}n t||}|||z|z |dz z ||dz z zz}d|tjd|dtj j#|tj$ |z zz } | | fS) a Computes the inverse frequencies with NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla Args: config ([`~transformers.PretrainedConfig`]): The model configuration. This function assumes that the config will provide at least the following properties: * rope_theta (`float`): The base wavelength from which the inverse frequencies will be derived. * hidden_size (`int`): The numerator when deriving a head_dim, if not provided directly. * num_attention_heads (`int`): The denominator when deriving a head_dim, if not provided directly. * max_position_embeddings (`int`): The default sequence length used to update the dynamic RoPE at inference time * rope_scaling (`dict[str, float]`): The standard RoPE scaling parameters, from which `factor` will be accessed. The value of `factor` is used to determine the new base frequency, along with the current sequence length (seq_len), the maximum positional embeddings (max_position_embeddings), and the computed dimensionality (dim) of the rotary embeddings. If seq_len <= max_position_embeddings, this factor has no effect. If seq_len <= max_position_embeddings, this factor effectively stretches the context window using an exponent derived from `dim`. Additionally, this function will make use of the following properties if they are found in the config: * head_dim (`int`, *optional*): The size of the key-value heads in the model. If None, this value will be derived as hidden_size // num_attention_heads. * partial_rotary_factor (`float`, *optional*): If less than 1.0, inverse frequencies will be returned for the first fraction of the head_dim. Defaults to 1.0. device (`torch.device`): The device to use for initialization of the inverse frequencies. seq_len (`int`, *optional*): The current sequence length, used to update the dynamic RoPE at inference time. If `None` or shorter than max_position_embeddings, this value will be overridden by max_position_embeddings. Returns: Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE). r2r3r4rFr7rrr5rr6r8)r9r:r;r<r=rrG isinstancerTensormaximumtensorr7rrr>r?rr@) rrrrAr2r4rBrrFrCrs r _compute_dynamic_ntk_parametersrOsBT   D#F,CSIvz6+=+=A[A[+[\H h.. /C$<<   *F) GU\\ *--  LL0 gnn ]  g67 FW$'>>6A:NTW[^ab[bTcd dDdu||AsAU[[ILLTZbgbmbmLnqttuvH % %%r"c|j}t|dd}t|d|j|jz}t ||z}|j d}|j j d}|j j d} |j j d} |j j dxs |j} dd } |)| r| rt| || | || z }n| |}|j j d xsd } |j j d xsd }dfd}d}|tjd|dj|tj|z z}d|z }d||zz }|j j dd}|| |||| |\}}d ||||dzj|tjz }|d |z z||zz}||fS)ak Computes the inverse frequencies with NTK scaling. Please refer to the [original paper](https://huggingface.co/papers/2309.00071) Args: config ([`~transformers.PretrainedConfig`]): The model configuration. This function assumes that the config will provide at least the following properties: * rope_theta (`float`): The base wavelength from which the inverse frequencies will be derived. * hidden_size (`int`): The numerator when deriving a head_dim, if not provided directly. * num_attention_heads (`int`): The denominator when deriving a head_dim, if not provided directly. * max_position_embeddings (`int`): The maximum length of the positional embeddings. * rope_scaling (`dict[str, float | int]`): The standard RoPE scaling parameters, from which the following keys will be accessed: * `attention_factor` (`float`, *optional*): The scaling factor to be applied to the computed cos/sin. If None, the value is inferred from `factor`, `mscale`, and `mscale_all_dim` as avaialble. * `beta_fast` (`float`, *optional*, defaults to 32): Parameter to set the boundary for extrapolation (only) in the linear ramp function. * `beta_slow` (`float`, *optional*, defaults to 1): Parameter to set the boundary for interpolation (only) in the linear ramp function. * `factor` (`float`, *optional*): The scaling factor applied when interpolating the position IDs to extend the possible context length. Additionally, if `attention_factor` is None, the log of this value is used to compute a value for `attention_factor`, possibly in conjunciton with `mscale` and `mscale_all_dim`, if provided. * `mscale` (`float`, *optional*): If `attention_factor` is None and both `mscale` and `mscale_all_dim` are provided, `mscale` acts scalar augmenting `log(factor)` when computing the numerator for the inferred value of `attention_factor`. If not provided, `attention_factor` will be calculated based on `factor` only. * `mscale_all_dim` (`float`, *optional*): If `attention_factor` is None and both `mscale` and `mscale_all_dim` are provided, `mscale_all_dim` acts scalar augmenting `log(factor)` when computing the denominator for the inferred value of `attention_factor`. If not provided, `attention_factor` will be calculated based on `factor` only. * `original_max_position_embeddings` (`int`, *optional*): The original max position embeddings used during pretraining. If not provided, the function falls back to `max_position_embeddings`. * `truncate` (`bool`, *optional*): Whether to truncate the correction range. Additionally, this function will make use of the following properties if they are found in the config: * head_dim (`int`, *optional*): The size of the key-value heads in the model. If None, this value will be derived as hidden_size // num_attention_heads. * partial_rotary_factor (`float`, *optional*, defaults to 1.0): If less than 1.0, inverse frequencies will be returned for the first fraction of the head_dim. device (`torch.device`): The device to use for initialization of the inverse frequencies. seq_len (`int`, *optional*): The current sequence length. Unused for this type of RoPE. Returns: Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the post-processing scaling factor applied to the computed cos/sin. r2r3r4rFrCmscalemscale_all_dimr rcJ|dkryd|ztj|zdzS)Nrr3g?)mathlog)scalerQs r get_mscalez,_compute_yarn_parameters..get_mscale:s( A:V|dhhuo-33r" beta_fast beta_slowc|tj||dztjzz zdtj|zz S)zPInverse dimension formula to find the dimension based on the number of rotationsr5)rTrUpi) num_rotationsrBrArs r find_correction_dimz5_compute_yarn_parameters..find_correction_dimLsBdhh6-!:Kdgg:UVWW\]`d`h`him`n\noor"c||||}||||}|r*tj|}tj|}t|dt ||dz fS)z.Find dimension range bounds based on rotationsrr)rTfloorceilrmin) low_rothigh_rotrBrArtruncatelowhighr^s r find_correction_rangez7_compute_yarn_parameters..find_correction_rangePs^!'36MN"8S$8OP **S/C99T?D3{CcAg...r"c||k(r|dz }tj|tj|z ||z z }tj|dd}|S)NgMbP?r6rr)rr>float32clamp)rbrrB linear_func ramp_funcs r linear_ramp_factorz4_compute_yarn_parameters..linear_ramp_factorYsL #: 5LC||Cu}}=Cc R KK Q2 r"rr5r8reT)r) r9r:r;r<r=rGgetrr@rr>r)rrrrAr2r4rBrFrCrQrRr rWrXrZrhrn pos_freqsinv_freq_extrapolationinv_freq_interpolationrerfrginv_freq_extrapolation_factorrr^s @r _compute_yarn_parametersrts0p   D#F,CSIvz6+=+=A[A[+[\H h.. /C   *F**../AB  $ $X .F((,,-=>N BCevGeGe%4  n$Z%?*VUcBd%de )&1 ##'' 4:I##'' 49Ip/aa03363UX[[\I 9_ FY$67""&&z48H%iCGgiqrIC%&(:3cQh(O(R(RZ`hmhshs(R(t$t!!&C"CD #@ @ A  % %%r"cd|j}t|dd}t|d|j|jz}t ||z}|j d}|j d}|j j d} |j j d} t|dd x} r|j| z } n |j} | I| dkrd} nAtjd tj| tj| z z} |r,|| kDr'tj|tj| } n&tj|tj| } tjd |d tj| j!|z } d| || zzz }|| fS)a Computes the inverse frequencies with LongRoPE scaling. Please refer to the [original implementation](https://github.com/microsoft/LongRoPE) Args: config ([`~transformers.PretrainedConfig`]): The model configuration. This function assumes that the config will provide at least the following properties: * rope_theta (`float`): The base wavelength from which the inverse frequencies will be derived. * hidden_size (`int`): The numerator when deriving a head_dim, if not provided directly. * num_attention_heads (`int`): The denominator when deriving a head_dim, if not provided directly. * max_position_embeddings (`int`): The maximum length of the positional embeddings. * original_max_position_embeddings (`int`, *optional*): The original max position embeddings used during pretraining. If not provided, defaults to `max_position_embeddings`. * rope_scaling (`dict[str, float]`): The standard RoPE scaling parameters, from which the following keys will be accessed: * `attention_factor` (`float`, *optional*): The scaling factor to be applied on the attention computation. If unspecified, it defaults to value recommended by the implementation, inferred from the value of `factor`. * `factor` (`float`, *optional*): The scaling factor to apply to the RoPE embeddings. If both `max_position_embeddings` and `original_max_position_embeddings` are provided, this value will be overridden s the ratio between those values. * `long_factor` (`float`, *optional*): The scale factor applied when computing the inverse frequencies if `seq_len` is provided and greater than `original_max_position_embeddings`. * `short_factor` (`float`, *optional*): The scale factor applied when computing the inverse frequencies if `seq_len` is None or less-than-or-equal-to `original_max_position_embeddings`. Additionally, this function will make use of the following properties if they are found in the config: * head_dim (`int`, *optional*): The size of the key-value heads in the model. If None, this value will be derived as hidden_size // num_attention_heads. * partial_rotary_factor (`float`, *optional*, defaults to 1.0): If less than 1.0, inverse frequencies will be returned for the first fraction of the head_dim. device (`torch.device`): The device to use for initialization of the inverse frequencies. seq_len (`int`, *optional*): The current sequence length. Returns: Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the post-processing scaling factor applied to the computed cos/sin. r2r3r4 long_factor short_factorrFrCr NrrJrr5)r9r:r;r<r=rGrorrTsqrtrUrrNrjr>r?r@)rrrrAr2r4rBrvrwrFrCr ext_factorsinv_freq_shapers r _compute_longrope_parametersr{ss^   D#F,CSIvz6+=+=A[A[+[\H h.. /C%%m4K&&~6L  $ $X .F**../AB ,36;]_c+dd'd//2RR+1+I+I( S=" #yyTXXf-=Ii@j-j)jk 7==ll;emmFS ll))*LMO&8'*::$''kH$G[[+;!;X=NPXYN$w.@EUXgEghM]*n= 1, got rGrorrrrKr@rr)rrrGr+rrrFs r (_validate_linear_scaling_rope_parametersr8s&&L  l.>.>vt.LMI (+M ))+,MM=kZ ( #F ~Z6&3,QRXQYZ[;Gr"c*|j}|jd|jdd}ddh}dh}t|j}t ||||||d}|t |t r|dkrtjd|yy)Nr+rrFr rr3rr)rrrGr+rrrrFs r )_validate_dynamic_scaling_rope_parametersrDs&&L  l.>.>vt.LMI (+M78M ))+,MM=-]hi ( #F ~Z6&3,QRXQYZ[;Gr"c |j}|jd|jdd}ddh}hd}t|j}t ||||||d}|t |t r|dkrtjd||jd}|-t |t r|d krtjd ||jd } | (t | t stjd | |jd } | (t | t stjd| | xsd| xsdkrtjd| d| d|jjd} | 5|j| z } | |k7r tjd|d| d|dyytjdy)Nr+rrF>rQrerXrZrRrCr rr3rrCrL`rope_scaling`'s attention_factor field must be a float greater than 0, got rXz6`rope_scaling`'s beta_fast field must be a float, got rZz6`rope_scaling`'s beta_slow field must be a float, got rYrzO`rope_scaling`'s beta_fast field must be greater than beta_slow, got beta_fast=z( (defaults to 32 if None) and beta_slow=z (defaults to 1 if None)r zHThe explicitly set RoPE scaling factor (config.rope_scaling['factor'] = z) does not match the ratio implicitly set by other parameters (implicit factor = post-yarn context length / pre-yarn context length = config.max_position_embeddings / config.rope_scaling['original_max_position_embeddings'] = z). Using the explicit factor (z) in YaRN. This may cause unexpected behaviour in model usage, please correct the 'max_position_embeddings' fields in the model config.a~config.rope_scaling['original_max_position_embeddings'], the pre-yarn context length, is unset. We will **assume** config.max_position_embeddings holds the pre-yarn context length. Some use cases may expect config.max_position_embeddings to hold the post-yarn context length (pre-yarn context length * factor) -- we recommend updating both fields for optimal downstream model usage.) rGrorrrrKr@rrr warning_once) rrrGr+rrrrFrCrXrZr implicit_factors r _validate_yarn_parametersrRs &&L  l.>.>vt.LMI (+MM ))+,MM=-]hi ( #F ~Z6&3,QRXQYZ[#''(:;#Z8H%-PTdghThZ[kZl m   -IZ 5%AOPY{[\  -IZ 5%AOPY{[\RIN+]^g]hi66?[@X Z (.':':'>'>?a'b$'3 88;[[ f $   Z[aZbcn###A&Ju u  %  _ r"c|j}|jd|jdd}hd}hd}t|j}t |||||t |dd}t |d|j |jz}t||z} |jd } t| ts*td | Drtjd | t| | d zk7r'tjd | d zdt| |jd} t| ts*td| Drtjd| t| | d zk7r'tjd| d zdt| t|drtj!dy|jd} | tjdn-t| t"r| dkrtjd| |jd} | /t| t"r| dkrtjd| yyy)Nr+r>r+rvrw>rFrCr rr2r3r4rwc3HK|]}t|ttfywNrKr=r@.0r,s r z0_validate_longrope_parameters..s1dRS*Qe 2M1d "zC`rope_scaling`'s short_factor field must be a list of numbers, got r5z5`rope_scaling`'s short_factor field must have length z, got rvc3HK|]}t|ttfywrrrs r rz0_validate_longrope_parameters..s0bQRAU|1L0brzB`rope_scaling`'s long_factor field must be a list of numbers, got z4`rope_scaling`'s long_factor field must have length r aYThis model has set a `original_max_position_embeddings` field, to be used together with `max_position_embeddings` to determine a scaling factor. Please set the `factor` field of `rope_scaling`with this ratio instead -- we recommend the use of this field over `original_max_position_embeddings`, as it is compatible with most model architectures.rFz1Missing required keys in `rope_scaling`: 'factor'rrCgr)rGrorrrr:r;r<r=rKlistallrrlenrrr@)rrrGr+rrrr2r4rBrwrvrFrCs r _validate_longrope_parametersrs@&&L  l.>.>vt.LMI@MVM ))+,MM=-]hi#F,CSIvz6+=+=A[A[+[\H h.. /C##N3L lD )c1dWc1d.d\]i\jkl <C1H$NsVWxjX^_bco_p^qrs""=1K k4 (S0bVa0b-b[\g[hij ;3!8#McUVhZW]^abm^n]opq v9: A !!(+ > NNN OFE*fsl NNUV\U]^ _'++,>?  '.6:JS:Pbcsbtu;Q (r"c|j}|jd|jdd}hd}t|j}t |||||d}|t |t r|dkrtjd||d}|d }|t |t stjd ||t |t stjd |||krtjd |d ||d} | t | tstjd| | |jk\r&tjd| d|jyy)Nr+r>rFr+r}r~r rrFr3rr}r~z<`rope_scaling`'s low_freq_factor field must be a float, got z=`rope_scaling`'s high_freq_factor field must be a float, got zc`rope_scaling`'s high_freq_factor field must be greater than low_freq_factor, got high_freq_factor=z and low_freq_factor=r zP`rope_scaling`'s original_max_position_embeddings field must be an integer, got zg`rope_scaling`'s original_max_position_embeddings field must be less than max_position_embeddings, got z and max_position_embeddings=) rGrorrrrKr@rrr=r) rrrGr+rrrFr}r~r s r _validate_llama3_parametersrs&&L  l.>.>vt.LMIvM ))+,MM=kZ ( #F ~Z6&3,QRXQYZ["#45O#$67j%&HUVeUfghz2BE'JVWgVhij?* q 5o5F H (44V'W$'/zBbdg7h ^/0 2 (6+I+II u/00MfNlNlMm o Jr"ct|dd}|y|jd|jdd}tj|}| |||ytj d|dy) zO Validate the RoPE config arguments, given a `PretrainedConfig` object rGNr+rrrzTMissing validation function mapping in `ROPE_VALIDATION_FUNCTIONS` for 'rope_type'='')r:roROPE_VALIDATION_FUNCTIONSrr)rrrGr+ validation_fns r rope_config_validationrsw6>48L  l.>.>vy.QRI-11)rs# 1.   H %;~*.'+!(& % &(& ^ $(&c](& >5 ! (&X*.'+!(& % &(& ^ $(&c](& >5 ! (&X*.'+!A& % &A& ^ $A&c]A& >5 ! A&JPTz& z&&4z&?G}z& >5 !z&|PTO& O&&4O&?G}O& >5 !O&fPT>, >,&4>,?G}>, >5 !>,J05. $,( $(!% llllC= l # l:[.>[XVY][ \5E \T\]`Ta \ \6F \U]^aUb \? &6? Xc]? D/*:/RU/d! (8! xPS}! L168 %-)  #3 (3- r"