L i#dZddlZddlZddlmZddlmZmZddlZddl m Z ddl m Z m Z ddlmZmZdd lmZdd lmZej*eZd>d Zd?d e d efdZd e fdZdedefdZd?d e ded efdZdededefdZd?dZdededede fdZ! d@d e dedede d ef dZ"dedededefdZ# dAd e dededed ef dZ$dededede de def dZ% dBd Z&dd!deded"eefd#Z' dCd e ded"eed efd$Z(d%d&dededede d'e f d(Z! dDd e dedede d ed)ee d'ee fd*Z)d%dd+dededede d'e d,ee f d-Z* dEd e dedede d ed)ee d'ee d,ee fd.Z+deded/ed0ed1e,d2e,d3e de fd4Z- dFd e ded0edeed/eed1e,d2e,d3e de d efd5Z.ej^eej`e"ejbe$ejde&ejfeejheejje(ejleejne)ejpe+ejre.i Z: dGd6ee,efd e deedeed7ee;f d8Z<Gd9d:e Z=Gd;ddHd=Z?y)Iz$PyTorch optimization for BERT model.N)partial)OptionalUnion) Optimizer)LambdaLRReduceLROnPlateau)LayerWiseDummyOptimizerLayerWiseDummyScheduler) SchedulerType)loggingcyNr )_s _/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/transformers/optimization.py_get_constant_lambdar!s optimizer last_epochc&t|t|S)a Create a schedule with a constant learning rate, using the learning rate set in optimizer. Args: optimizer ([`~torch.optim.Optimizer`]): The optimizer for which to schedule the learning rate. last_epoch (`int`, *optional*, defaults to -1): The index of the last epoch when resuming training. Return: `torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule. r)rr)rrs rget_constant_scheduler%s I3 KKrc t|fi|S)a Create a schedule with a constant learning rate that decreases when a metric has stopped improving. Args: optimizer ([`~torch.optim.Optimizer`]): The optimizer for which to schedule the learning rate. kwargs (`dict`, *optional*): Extra parameters to be passed to the scheduler. See `torch.optim.lr_scheduler.ReduceLROnPlateau` for possible parameters. Return: `torch.optim.lr_scheduler.ReduceLROnPlateau` with the appropriate schedule. )r)rkwargss rget_reduce_on_plateau_scheduler6s Y 1& 11r current_stepnum_warmup_stepscP||kr!t|ttd|z Sy)N?floatmax)rrs r,_get_constant_schedule_with_warmup_lr_lambdar$Hs+&&\"U3s4D+E%FFF rc@tt|}t|||S)ad Create a schedule with a constant learning rate preceded by a warmup period during which the learning rate increases linearly between 0 and the initial lr set in the optimizer. Args: optimizer ([`~torch.optim.Optimizer`]): The optimizer for which to schedule the learning rate. num_warmup_steps (`int`): The number of steps for the warmup phase. last_epoch (`int`, *optional*, defaults to -1): The index of the last epoch when resuming training. Return: `torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule. rr)rr$r)rrr lr_lambdas r!get_constant_schedule_with_warmupr(Ns!"DWghI IyZ @@rnum_training_stepsc ||kr!t|ttd|z Stdt||z ttd||z z S)Nr r!)rrr)s r*_get_linear_schedule_with_warmup_lr_lambdar,csX&&\"U3q2B+C%DDD sE,|;k8llH sC3$''E*4E*E*Kh*V!WWX YYrcBtt|||}t|||S)a Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer. Args: optimizer ([`~torch.optim.Optimizer`]): The optimizer for which to schedule the learning rate. num_warmup_steps (`int`): The number of steps for the warmup phase. num_training_steps (`int`): The total number of training steps. num_cycles (`float`, *optional*, defaults to 0.5): The number of waves in the cosine schedule (the defaults is to just decrease from the max value to 0 following a half-cosine). last_epoch (`int`, *optional*, defaults to -1): The index of the last epoch when resuming training. Return: `torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule. rr)r0)rr:rrrr)r0rr's rget_cosine_schedule_with_warmupr>s+22)- I Iy* 55rc 2||kr!t|ttd|z St||z ttd||z z }|dk\rytdddtjtjt||zdzzzzS)Nr r r+r2r4r8s r=_get_cosine_with_hard_restarts_schedule_with_warmup_lr_lambdar@s&&\"U3q2B+C%DDD\$445c!EWZjEj>k8llH3 sC3$''eJ6G(6RVY5Y*Z![[\ ]]rcBtt|||}t|||S)a Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, with several hard restarts, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer. Args: optimizer ([`~torch.optim.Optimizer`]): The optimizer for which to schedule the learning rate. num_warmup_steps (`int`): The number of steps for the warmup phase. num_training_steps (`int`): The total number of training steps. num_cycles (`int`, *optional*, defaults to 1): The number of hard restarts to use. last_epoch (`int`, *optional*, defaults to -1): The index of the last epoch when resuming training. Return: `torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule. r<)rr@rr=s r2get_cosine_with_hard_restarts_schedule_with_warmuprBs+0E)- I Iy* 55rlr_endpowerlr_initc||kr!t|ttd|z S||kDr||z S||z }||z }d||z |z z }|||zz|z} | |z Srr!) rrr)rCrDrElr_range decay_steps pct_remainingdecays r4_get_polynomial_decay_schedule_with_warmup_lr_lambdarKs&&\"U3q2B+C%DDD * *V#(+;; \,<< KK =%//&8wrc|jd}||kDstd|d|dtt|||||}t |||S)a Create a schedule with a learning rate that decreases as a polynomial decay from the initial lr set in the optimizer to end lr defined by *lr_end*, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer. Args: optimizer ([`~torch.optim.Optimizer`]): The optimizer for which to schedule the learning rate. num_warmup_steps (`int`): The number of steps for the warmup phase. num_training_steps (`int`): The total number of training steps. lr_end (`float`, *optional*, defaults to 1e-7): The end LR. power (`float`, *optional*, defaults to 1.0): Power factor. last_epoch (`int`, *optional*, defaults to -1): The index of the last epoch when resuming training. Note: *power* defaults to 1.0 as in the fairseq implementation, which in turn is based on the original BERT implementation at https://github.com/google-research/bert/blob/f39e881b169b9d53bea03d2d341b31707a6c052b/optimization.py#L37 Return: `torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule. lrzlr_end (z#) must be smaller than initial lr ())rr)rCrDrE)defaults ValueErrorrrKr)rrr)rCrDrrEr's r)get_polynomial_decay_schedule_with_warmuprQse>  &G f 8F8+NwiWXYZZ<)- I Iy* 55r) timescalerRc||kr!t|ttd|z S||z }dtj||z|z z }|S)Nr r )r"r#r5sqrt)rrrRshiftrJs r$_get_inverse_sqrt_schedule_lr_lambdarVsU&&\"U3q2B+C%DDD ( (E $))\E1Y>? ?E LrcR||xsd}tt||}t|||S)a Create a schedule with an inverse square-root learning rate, from the initial lr set in the optimizer, after a warmup period which increases lr linearly from 0 to the initial lr set in the optimizer. Args: optimizer ([`~torch.optim.Optimizer`]): The optimizer for which to schedule the learning rate. num_warmup_steps (`int`): The number of steps for the warmup phase. timescale (`int`, *optional*, defaults to `num_warmup_steps`): Time scale. last_epoch (`int`, *optional*, defaults to -1): The index of the last epoch when resuming training. Return: `torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule. i')rrRr)rrVr)rrrRrr's rget_inverse_sqrt_schedulerX&s4.$. k8llH C$((477U:->#>#Dx#OPP QF q; '+ 5F q&>rmin_lrc| | td|||jdz }n | tdtt||||}t |||S)a Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to min_lr, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer. Args: optimizer ([`~torch.optim.Optimizer`]): The optimizer for which to schedule the learning rate. num_warmup_steps (`int`): The number of steps for the warmup phase. num_training_steps (`int`): The total number of training steps. num_cycles (`float`, *optional*, defaults to 0.5): The number of waves in the cosine schedule (the defaults is to just decrease from the max value to 0 following a half-cosine). last_epoch (`int`, *optional*, defaults to -1): The index of the last epoch when resuming training. min_lr (`float`, *optional*): The minimum learning rate to reach after the cosine schedule. min_lr_rate (`float`, *optional*): The minimum learning rate as a ratio of the initial learning rate. If set, `min_lr` should not be set. Return: `torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule. /Only one of min_lr or min_lr_rate should be setrMLOne of min_lr or min_lr_rate should be set through the `lr_scheduler_kwargs`)rr)r0rY)rPrOrr:r)rrr)r0rr\rYr's r+get_cosine_with_min_lr_schedule_with_warmupr`OssFk5JKK  y11$77  ghh2)- I Iy* 55r)rYwarmup_lr_rateract|}t|}t|}||kr:||dztd|z St|}|d|z |ztd|dz z zS||z dztd||z z }ddtjtj|zdz|zzz}|d|z z|z}td|S)Nr r r2r3rr4)rrr)r0rYrar9r[s r;_get_cosine_with_min_lr_schedule_with_warmup_lr_rate_lambdarcs&L-.12&&  ! 3&#c3C*DD D">2N!S>%9l$KsSTVfijVjOk$ll l//#5#cCUXhCh:ijH C$((477Z#7##=#HII JF q; '+ 5F q&>rc| | td|||jdz }n | tdtt|||||}t |||S)a Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to min_lr, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer. Args: optimizer ([`~torch.optim.Optimizer`]): The optimizer for which to schedule the learning rate. num_warmup_steps (`int`): The number of steps for the warmup phase. num_training_steps (`int`): The total number of training steps. num_cycles (`float`, *optional*, defaults to 0.5): The number of waves in the cosine schedule (the defaults is to just decrease from the max value to 0 following a half-cosine). last_epoch (`int`, *optional*, defaults to -1): The index of the last epoch when resuming training. min_lr (`float`, *optional*): The minimum learning rate to reach after the cosine schedule. min_lr_rate (`float`, *optional*): The minimum learning rate as a ratio of the initial learning rate. If set, `min_lr` should not be set. warmup_lr_rate (`float`, *optional*): The minimum learning rate as a ratio of the start learning rate. If not set, `warmup_lr_rate` will be treated as float(1/num_warmup_steps). Return: `torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule. r^rMr_)rr)r0rYra)rPrOrrcr) rrr)r0rr\rYrar's r3get_cosine_with_min_lr_schedule_with_warmup_lr_rateresvLk5JKK  y11$77  ghhC)-% I Iy* 55rnum_stable_stepsnum_decay_steps warmup_type decay_type min_lr_ratioc||krt|ttd|z }|dk(r|} nR|dk(r-ddtjtj|zz z} n |dk(rdtj d|z z }  d|z z|z} td| S|||zkry|||z|zkrt||z |z ttd|z }|dk(rd|z } n^|dk(r[\\%*:*F rs880 =fghh77/ |;deff-0@@?R!))'! I Iy* 55rnamescheduler_specific_kwargsc t|}t|}|t|tro|j}i |D]}t |||||| |< fd}|D] }|j s|j|"t||jdS|tjk(r||S|i}|tjk(r ||fi|S|t|d|tjk(r |||S|tjk(r |||S|tjk(r ||f||d|S|t|d||f||d|S) a Unified API to get any scheduler from its name. Args: name (`str` or `SchedulerType`): The name of the scheduler to use. optimizer (`torch.optim.Optimizer`): The optimizer that will be used during training. num_warmup_steps (`int`, *optional*): The number of warmup steps to do. This is not required by all schedulers (hence the argument being optional), the function will raise an error if it's unset and the scheduler type requires it. num_training_steps (`int``, *optional*): The number of training steps to do. This is not required by all schedulers (hence the argument being optional), the function will raise an error if it's unset and the scheduler type requires it. scheduler_specific_kwargs (`dict`, *optional*): Extra parameters for schedulers such as cosine with restarts. Mismatched scheduler types and scheduler parameters will cause the scheduler function to raise a TypeError. )rrr)ruc,|jyN)step)paramscheduler_dicts rscheduler_hookz%get_scheduler..scheduler_hook}s 5 ! & & (rrM)optimizer_dictrMz; requires `num_warmup_steps`, please provide that argument.r&r.z= requires `num_training_steps`, please provide that argument.)r TYPE_TO_SCHEDULER_FUNCTION isinstancer r} get_scheduler requires_grad"register_post_accumulate_grad_hookr rOCONSTANTREDUCE_ON_PLATEAUrPCONSTANT_WITH_WARMUP INVERSE_SQRTWARMUP_STABLE_DECAY) rtrrr)ru schedule_funcr}rzr|r{s @rrrRs2  D.t4MI7N!O"11# E$1(/!1#5*C %N5 !  ) $ IE""88H I'nI[I[\`Iabb }%%%Y'' ($&! }...YD*CDDD6!\]^^ }111Y9IJJ })))Y9IJJ }000  -1 (   !D6!^_``  )-  $  rceZdZdZ dfd ZedZedZedZedZ e jd dZ xZ S) Adafactora9 AdaFactor pytorch implementation can be used as a drop in replacement for Adam original fairseq code: https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py Paper: *Adafactor: Adaptive Learning Rates with Sublinear Memory Cost* https://huggingface.co/papers/1804.04235 Note that this optimizer internally adjusts the learning rate depending on the `scale_parameter`, `relative_step` and `warmup_init` options. To use a manual (external) learning rate schedule you should set `scale_parameter=False` and `relative_step=False`. Arguments: params (`Iterable[nn.parameter.Parameter]`): Iterable of parameters to optimize or dictionaries defining parameter groups. lr (`float`, *optional*): The external learning rate. eps (`tuple[float, float]`, *optional*, defaults to `(1e-30, 0.001)`): Regularization constants for square gradient and parameter scale respectively clip_threshold (`float`, *optional*, defaults to 1.0): Threshold of root mean square of final gradient update decay_rate (`float`, *optional*, defaults to -0.8): Coefficient used to compute running averages of square beta1 (`float`, *optional*): Coefficient used for computing running averages of gradient weight_decay (`float`, *optional*, defaults to 0.0): Weight decay (L2 penalty) scale_parameter (`bool`, *optional*, defaults to `True`): If True, learning rate is scaled by root mean square relative_step (`bool`, *optional*, defaults to `True`): If True, time-dependent learning rate is computed instead of external learning rate warmup_init (`bool`, *optional*, defaults to `False`): Time-dependent learning rate computation depends on whether warm-up initialization is being used This implementation handles low-precision (FP16, bfloat) values, but we have not thoroughly tested. Recommended T5 finetuning settings (https://discuss.huggingface.co/t/t5-finetuning-tips/684/3): - Training without LR warmup or clip_threshold is not recommended. - use scheduled LR warm-up to fixed LR - use clip_threshold=1.0 (https://huggingface.co/papers/1804.04235) - Disable relative updates - Use scale_parameter=False - Additional optimizer operations like gradient clipping should not be used alongside Adafactor Example: ```python Adafactor(model.parameters(), scale_parameter=False, relative_step=False, warmup_init=False, lr=1e-3) ``` Others reported the following combination to work well: ```python Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None) ``` When using `lr=None` with [`Trainer`] you will most likely need to use [`~optimization.AdafactorSchedule`] scheduler as following: ```python from transformers.optimization import Adafactor, AdafactorSchedule optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None) lr_scheduler = AdafactorSchedule(optimizer) trainer = Trainer(..., optimizers=(optimizer, lr_scheduler)) ``` Usage: ```python # replace AdamW with Adafactor optimizer = Adafactor( model.parameters(), lr=1e-3, eps=(1e-30, 1e-3), clip_threshold=1.0, decay_rate=-0.8, beta1=None, weight_decay=0.0, relative_step=False, scale_parameter=False, warmup_init=False, ) ```c z| | r td| r | s td|||||||| | d } t | || y)Nz;Cannot combine manual `lr` and `relative_step=True` optionsz0`warmup_init=True` requires `relative_step=True`) rMepsclip_threshold decay_ratebeta1 weight_decayscale_parameter relative_step warmup_init)rPsuper__init__) selfparamsrMrrrrrrrrrO __class__s rrzAdafactor.__init__s^ >mZ[ [ }OP P,$(.*&   *rc|d}|dr4|drd|dznd}t|dtj|dz }d}|drt|d d |d }||zS) NrMrrgư>ryg{Gz?r rrr RMS)minr5rTr#) param_group param_state rel_step_szmin_step param_scales r_get_lrzAdafactor._get_lr$s!$'  '5@5Otk&11UYHhdii F8K.L(LMK ( )k%03[5GHK[((rc4t|dk\}|ddu}||fS)Nr)len)r param_shapefactoreduse_first_moments r _get_optionszAdafactor._get_options/s-{#q(&w/t;)))rcL|jd|jdzz S)Nrr2)normnumel)tensors r_rmszAdafactor._rms5s {{1~3!677rc||jddz jjd}|jdj}t j ||S)NT)dimkeepdim)meanrsqrt_ unsqueezersqrttorchmul)exp_avg_sq_rowexp_avg_sq_colr_factorc_factors r_approx_sq_gradzAdafactor._approx_sq_grad9s]#^%8%8R%8%NNVVXbbcef!++B/557yy8,,rcd}||}|jD]#}|dD]}|j|j}|jtjtj hvr|j }|jr td|j|}|j}|j||\}} t|dk(rd|d<| rtj||d<|r[tj|ddj||d<tj|dd |ddzj||d <ntj||d <d|d <na| r|dj||d<|r/|dj||d<|d j||d <n|d j||d <|} |jtjtj hvr| j } |dxxd z cc<|j!| |d <|j#||} dt%j&|d|dz } |dz|ddz} |r|d}|d }|j)| j+| j-dd| z |j)| j+| j-d d| z |j/||} | j)|nI|d }|j)| j+| d| z |j1j)|} | j3|j!| |dz j5d| j)| | r2|d}|j)|dj+| d |dz |} |ddk7r| j+| |d | z| j+| |jtjtj hvs|j7| &|S)z Performs a single optimization step Arguments: closure (callable, optional): A closure that reevaluates the model and returns the loss. Nrz,Adafactor does not support sparse gradients.rryexp_avgrrrr exp_avg_sqrr r rrr)r)alphar)rrr) param_groupsgraddtyperfloat16bfloat16r" is_sparse RuntimeErrorstateshaperr zeros_likezerostorrr5powmul_add_rrrdiv_clamp_copy_)rclosurelossgroupprr grad_shaperr p_data_fp32rMbeta2tupdaterrrrs rryzAdafactor.stepAs#  9D&&M )E8_L )66>vv::%--!@@::>&'UVV 1 !ZZ -1->->uj-Q**u:?$%E&M'+0+;+;D+Ai(27++j"o2N2Q2QRV2W./27++j"oPZ[][^P_>_2`2c2cdh2i./.3.>.>t.Dl+#$E%L'+0+;+>+>t+Di(278H2I2L2LT2R./278H2I2L2LT2R./.3L.A.D.DT.Jl+ 77u}}enn=="-"3"3"5Kf " #yy5e \\%/txxf u\7JKK'U5\!_4%*+;%))** 88--U]]_[[rrc*eZdZdZdfd ZdZxZS)AdafactorSchedulea8 Since [`~optimization.Adafactor`] performs its own scheduling, if the training loop relies on a scheduler (e.g., for logging), this class creates a proxy object that retrieves the current lr values from the optimizer. It returns `initial_lr` during startup and the actual `lr` during stepping. cfd}|jD]}|d< t| |||jD]}|d=y)NcSrxr)r initial_lrs rr'z-AdafactorSchedule.__init__..lr_lambdas  rr)rrr)rrrr'rrs ` rrzAdafactorSchedule.__init__sU ++ -E",E,  - I.++ $El# $rc |j}|jDcgc]9}|ddj%|j||j|dd;}}t |dk(r |j }|Scc}w)Nrr)rrrrrrbase_lrs)roptrlrss rget_lrzAdafactorSchedule.get_lrsnn)) Xq!&&2 KKsyyx);< =  s8q=--C  s>A7r+)rrrrrrrrs@rrrs$ rrct||S)aX Get a proxy schedule for [`~optimization.Adafactor`] Args: optimizer ([`~torch.optim.Optimizer`]): The optimizer for which to schedule the learning rate. initial_lr (`float`, *optional*, defaults to 0.0): Initial lr Return: [`~optimization.Adafactor`] proxy schedule object. )r)rrs rget_adafactor_schedulers Y 33rrx)r)r2r)r r)gHz>r r)Nr)r2rNN)r2rNNN)NNrlrmrr2r)NNNr)@rr5rq functoolsrtypingrrr torch.optimrtorch.optim.lr_schedulerrrtrainer_pt_utilsr r trainer_utilsr utilsr get_loggerrloggerrintrrr$r(r,r/r"r:r>r@rBrKrQrVrXr`rcrestrrorsLINEARCOSINECOSINE_WITH_RESTARTS POLYNOMIALrrrrCOSINE_WITH_MIN_LRCOSINE_WARMUP_WITH_MIN_LRrr~dictrrrrrrrrs\+ " !@N(   H % LYLCL"2i2$sY\ AAcA_bA*uSuWZupsu 66ZZ,/ZEHZV[Zvx66,/6EH6V[6or6D^^,/^EH^VY^rt66,/6EH6VY6kn6B     ,Y[+6\rvsQTaijmanegAA,/Asv,/EHV[jo"#'16161616 16  16 UO 16%16t&*     UO:"#'&*56565656 56  56 UO 56%56UO56p### #  #  ####T)-&*F6F6F6F6! F6 sm F6  F6F6F6F6F6T99&&(ZG1&&(I 9##%C$$&Q++-`%%'7 $'+(,04 [ ]" #[[sm[! [ (~ [|m m`<4r