L iNAdZddlmZmZmZddlZ ddlmZddl m Z ee jj dr!e jj j"Zne jj ZGdd ej$Z dd ed ed ed ededededeedeedededeeefdZGddeZGddZy#e e f$r ddl mZYwxYw)z?Functions and classes related to optimization (weight updates).)CallableOptionalUnionN)Adam)keraslearning_rate_schedulec PeZdZdZ d dededededeef fd Z dZ d Z xZ S) WarmUpa Applies a warmup schedule on a given learning rate decay schedule. Args: initial_learning_rate (`float`): The initial learning rate for the schedule after the warmup (so this will be the learning rate at the end of the warmup). decay_schedule_fn (`Callable`): The schedule function to apply after the warmup for the rest of training. warmup_steps (`int`): The number of steps for the warmup part of training. power (`float`, *optional*, defaults to 1.0): The power to use for the polynomial warmup (defaults is a linear warmup). name (`str`, *optional*): Optional name prefix for the returned tensors during the schedule. initial_learning_ratedecay_schedule_fn warmup_stepspowernamecht|||_||_||_||_||_yN)super__init__r rrr r)selfr r rrr __class__s b/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/transformers/optimization_tf.pyrzWarmUp.__init__7s6 %:"( !2 ctjjxsd5}tjtj}tjj tj}||z }j tjj|jztj||kfdfd|cdddS#1swYyxYw)Nr cSr)warmup_learning_ratesrz!WarmUp.__call__..Ps,rc@jjz Sr)r r)rstepsrrz!WarmUp.__call__..Qs..td6G6G/GHrr) tf name_scopercastfloat32rr mathpowrcond)rrrglobal_step_floatwarmup_steps_floatwarmup_percent_doners`` @r__call__zWarmUp.__call__Fs ]]4990 1 T!#bjj 9 !#):):BJJ!G "36H"H #'#=#= L_aeakak@l#l 77!$66,H    s B1C""C+cv|j|j|j|j|jdS)Nr r rrrr-rs r get_configzWarmUp.get_configUs5%)%?%?!%!7!7 --ZZII   r)?N) __name__ __module__ __qualname____doc__floatrintrstrrr+r/ __classcell__rs@rr r %sS," $ $     sm   rr init_lrnum_train_stepsnum_warmup_steps min_lr_ratio adam_beta1 adam_beta2 adam_epsilon adam_clipnormadam_global_clipnormweight_decay_raterinclude_in_weight_decayc tj|||z ||z| } |rt|| |} | dkDrt| | |||||gd|  } | | fStj j | |||||} | | fS)a Creates an optimizer with a learning rate schedule using a warmup phase followed by a linear decay. Args: init_lr (`float`): The desired learning rate at the end of the warmup phase. num_train_steps (`int`): The total number of training steps. num_warmup_steps (`int`): The number of warmup steps. min_lr_ratio (`float`, *optional*, defaults to 0): The final learning rate at the end of the linear decay will be `init_lr * min_lr_ratio`. adam_beta1 (`float`, *optional*, defaults to 0.9): The beta1 to use in Adam. adam_beta2 (`float`, *optional*, defaults to 0.999): The beta2 to use in Adam. adam_epsilon (`float`, *optional*, defaults to 1e-8): The epsilon to use in Adam. adam_clipnorm (`float`, *optional*, defaults to `None`): If not `None`, clip the gradient norm for each weight tensor to this value. adam_global_clipnorm (`float`, *optional*, defaults to `None`) If not `None`, clip gradient norm to this value. When using this argument, the norm is computed over all weight tensors, as if they were concatenated into a single vector. weight_decay_rate (`float`, *optional*, defaults to 0): The weight decay to use. power (`float`, *optional*, defaults to 1.0): The power to use for PolynomialDecay. include_in_weight_decay (`list[str]`, *optional*): List of the parameter names (or re patterns) to apply weight decay to. If none is passed, weight decay is applied to all parameters except bias and layer norm parameters. )r decay_stepsend_learning_rater)r r r) LayerNorm layer_normbias) learning_raterCbeta_1beta_2epsilonclipnormglobal_clipnormexclude_from_weight_decayrD)rLrMrNrOrPrQ) schedulesPolynomialDecayr AdamWeightDecayr optimizersr)r:r;r<r=r>r?r@rArBrCrrD lr_schedule optimizers rcreate_optimizerrY_s\++%#&66!L0 ,K "))) 3#%/ "0&I$;  , k !!$$))% "0 *  k !!rceZdZdZ ddeeejfdededededede e e d e e e d e ffd Z e fd Zfd ZdZdfd ZdZdfd Zdfd ZfdZdZxZS)rUam Adam enables L2 weight decay and clip_by_global_norm on gradients. Just adding the square of the weights to the loss function is *not* the correct way of using L2 regularization/weight decay with Adam, since that will interact with the m and v parameters in strange ways as shown in [Decoupled Weight Decay Regularization](https://huggingface.co/papers/1711.05101). Instead we want to decay the weights in a manner that doesn't interact with the m/v parameters. This is equivalent to adding the square of the weights to the loss with plain (non-momentum) SGD. Args: learning_rate (`Union[float, LearningRateSchedule]`, *optional*, defaults to 0.001): The learning rate to use or a schedule. beta_1 (`float`, *optional*, defaults to 0.9): The beta1 parameter in Adam, which is the exponential decay rate for the 1st momentum estimates. beta_2 (`float`, *optional*, defaults to 0.999): The beta2 parameter in Adam, which is the exponential decay rate for the 2nd momentum estimates. epsilon (`float`, *optional*, defaults to 1e-07): The epsilon parameter in Adam, which is a small constant for numerical stability. amsgrad (`bool`, *optional*, defaults to `False`): Whether to apply AMSGrad variant of this algorithm or not, see [On the Convergence of Adam and Beyond](https://huggingface.co/papers/1904.09237). weight_decay_rate (`float`, *optional*, defaults to 0.0): The weight decay to apply. include_in_weight_decay (`list[str]`, *optional*): List of the parameter names (or re patterns) to apply weight decay to. If none is passed, weight decay is applied to all parameters by default (unless they are in `exclude_from_weight_decay`). exclude_from_weight_decay (`list[str]`, *optional*): List of the parameter names (or re patterns) to exclude from applying weight decay to. If a `include_in_weight_decay` is passed, the names in it will supersede this list. name (`str`, *optional*, defaults to `"AdamWeightDecay"`): Optional name for the operations created when applying gradients. kwargs (`dict[str, Any]`, *optional*): Keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by norm; `clipvalue` is clip gradients by value, `decay` is included for backward compatibility to allow time inverse decay of learning rate. `lr` is included for backward compatibility, recommended to use `learning_rate` instead. rLrMrNrOamsgradrCrDrRrc  Zt ||||||| fi| ||_||_||_yr)rrrC_include_in_weight_decay_exclude_from_weight_decay) rrLrMrNrOr[rCrDrRrkwargsrs rrzAdamWeightDecay.__init__s: $YRXY!2(?%*C'rc6dti}t| ||S)z?Creates an optimizer from its config with WarmUp custom object.r )custom_objects)r r from_config)clsconfigrars rrbzAdamWeightDecay.from_configs$#F+w"6."IIrczt||||tj|jd|||fd<y)Nadam_weight_decay_rater rC)r_prepare_localr!constantrC)r var_device var_dtype apply_staters rrgzAdamWeightDecay._prepare_locals? z9kBDFKK  " ")AE Z+,-@Arc|j|j}|rI|j||z||j|jj fdz|j StjS)NrC) use_locking) _do_use_weight_decayr assign_subdevicedtype base_dtype _use_lockingr!no_op)rvarrLrkdo_decays r_decay_weights_opz!AdamWeightDecay._decay_weights_opsq,,SXX6 >>#k3::syy?S?S2T&UVi&jj --" xxzrc dtt|\}}t| t||fd|i|S)Nr)listziprapply_gradients)rgrads_and_varsrr_gradstvarsrs rr{zAdamWeightDecay.apply_gradientss7C01 uw&s5%'8NtNvNNrc||j|ifS|xsi}|j||f}||j||}||||f<|dd|ifS)z1Retrieves the learning rate with the given state.lr_trk) _decayed_lr_tget_fallback_apply_state)rrirjrk coefficientss r_get_lrzAdamWeightDecay._get_lrsw  %%i0"4 4!'R " I'>?  55j)LL3?KY/ 0F#m[%AAArc |j|j|jj|\}}|j |||}t j |g5t| ||fi|cdddS#1swYyxYwr) rrprqrrrwr!control_dependenciesr_resource_apply_dense)rgradrurkrr_decayrs rrz%AdamWeightDecay._resource_apply_densesx||CJJ 0D0DkR f&&sD+>  $ $eW - F70sEfE F F Fs A::Bc|j|j|jj|\}}|j |||}t j |g5t| |||fi|cdddS#1swYyxYwr) rrprqrrrwr!rr_resource_apply_sparse) rrruindicesrkrr_rrs rrz&AdamWeightDecay._resource_apply_sparsesz||CJJ 0D0DkR f&&sD+>  $ $eW - P71$WOO P P Ps A;;Bc^t|}|jd|ji|S)NrC)rr/updaterC)rrdrs rr/zAdamWeightDecay.get_configs-#% *D,B,BCD rc|jdk(ry|jr|jD]}||vsy|jr|jD]}||vsyy)z0Whether to use L2 weight decay for `param_name`.rFT)rCr]r^)r param_namers rrnz$AdamWeightDecay._do_use_weight_decay#sj  ! !Q &  ( (22  ?   * *44 ! ?  !r) gMbP??+?gHz>FrHNNrUr)r1r2r3r4rr5rSLearningRateScheduleboolrryr7r classmethodrbrgrwr{rrrr/rnr8r9s@rrUrUs$PGL#&7;9=%DUI$B$BBCDD D  D  D!D"*$s)!4D$,DI#6DD$JJ  O BF P  rrUcBeZdZdZdZedZedZdZdZ y)GradientAccumulatoraR Gradient accumulation utility. When used with a distribution strategy, the accumulator should be called in a replica context. Gradients will be accumulated locally on each replica and without synchronization. Users should then call `.gradients`, scale the gradients if required, and pass the result to `apply_gradients`. c g|_d|_y)zInitializes the accumulator.N) _gradients _accum_stepsr.s rrzGradientAccumulator.__init__@s rc0|jqtjtjdtjdtj j tjj|_|jjS)zNumber of accumulated steps.r)rqF trainablesynchronization aggregation) rr!Variablerhint64VariableSynchronizationON_READVariableAggregationONLY_FIRST_REPLICAvaluer.s rrzGradientAccumulator.stepEsk    $ "  ARXX. " : : B B22EE !D   &&((rc|js td|jDcgc]}||jn|c}Scc}w)z1The accumulated gradients on the current replica.zBThe accumulator should be called first to initialize the gradients)r ValueErrorrrgradients r gradientszGradientAccumulator.gradientsRsBab bW[WfWfg8H$8 hFgggsAc |js|j}|jj|Dcgc]b}|\tjtj |dtj jtjjn|dc}t|t|jk7r-tdt|jdt|t|j|D]\}}| | |j||jjdycc}w)z/Accumulates `gradients` on the current replica.NFrz Expected z gradients, but got r)rrextendr!r zeros_likerrrrlenrrz assign_addr)rr_raccum_gradients rr+zGradientAccumulator.__call__Ys A OO " "%. ! + KK h/"'(*(B(B(J(J$&$:$:$M$M ""   y>S1 1yT__)=(>>RSVW`SaRbcd d(+DOOY(G 4 $NH)h.B))(3 4 $$Q'' sA'D?c|jsy|jjd|jD])}||jtj|+y)z8Resets the accumulated gradients on the current replica.Nr)rrassignr!rrs rresetzGradientAccumulator.resetssN    # 9H# h 78 9rN) r1r2r3r4rpropertyrrr+rrrrrr5s@!  ) )hh (49rr) rHrrg:0yE>NNrHr0N)r4typingrrr tensorflowr!tf_keras.optimizers.legacyr ImportErrorModuleNotFoundError"tensorflow.keras.optimizers.legacymodeling_tf_utilsrhasattrrVrSr rr r5r6ryr7rYrUrrrrrs`F,,8/% 5   % %'?@  **AAI  **I7 Y + +7 |%),0"37Q" Q"Q"Q" Q"  Q"  Q"Q"E?Q"#5/Q"Q" Q"&d3i0Q"h~d~DE9E9{ ()878sC CC