L iwddlmZddlZddlZddlmZmZddlmZddl m Z m Z m Z m Z mZmZddlZerddlmZddgZGd d ZGd deZdd ZGd dZy)) annotationsN)abc defaultdict)Enum)AnycastOptionaloverload TYPE_CHECKINGUnion)IterableOptState GradScalerc eZdZdZddZddZy)_MultiDeviceReplicatorz^Lazily serves copies of a tensor to requested devices. Copies are cached per-device. c ||_i|_yN)master_per_device_tensors)self master_tensors [/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/torch/amp/grad_scaler.py__init__z_MultiDeviceReplicator.__init__s# EG c|jj|d}|-|jj|dd}||j|<|S)NT)device non_blockingcopy)rgetrto)rrretvals rrz_MultiDeviceReplicator.getsL))--fd; >[[^^64^PF/5D $ $V , rN)r torch.TensorreturnNone)r torch.devicer#r")__name__ __module__ __qualname____doc__rrrrrrs HrrceZdZdZdZdZy)rrN)r&r'r(READYUNSCALEDSTEPPEDr*rrrr+s EHGrc(tjidS)N)stagefound_inf_per_device)rr.r*rr_refresh_per_optimizer_stater41s^^R @@rceZdZdZ d d!dZ d"dZd#dZed$dZed%dZed&dZed'dZ d(d Z d)d Z d*d Z d+d Z d,d Z d-d.dZ d/dZd0dZd0dZd1dZd0dZd1dZd2dZd3dZd2dZd4dZd5dZd6dZd5dZd7dZd8dZd8dZy)9raXAn instance ``scaler`` of :class:`GradScaler`. Helps perform the steps of gradient scaling conveniently. * ``scaler.scale(loss)`` multiplies a given loss by ``scaler``'s current scale factor. * ``scaler.step(optimizer)`` safely unscales gradients and calls ``optimizer.step()``. * ``scaler.update()`` updates ``scaler``'s scale factor. Example:: # Creates a GradScaler once at the beginning of training. scaler = GradScaler() for epoch in epochs: for input, target in data: optimizer.zero_grad() output = model(input) loss = loss_fn(output, target) # Scales loss. Calls backward() on scaled loss to create scaled gradients. scaler.scale(loss).backward() # scaler.step() first unscales gradients of the optimizer's params. # If gradients don't contain infs/NaNs, optimizer.step() is then called, # otherwise, optimizer.step() is skipped. scaler.step(optimizer) # Updates the scale for next iteration. scaler.update() See the :ref:`Automatic Mixed Precision examples` for usage (along with autocasting) in more complex cases like gradient clipping, gradient accumulation, gradient penalty, and multiple losses/optimizers. ``scaler`` dynamically estimates the scale factor each iteration. To minimize gradient underflow, a large scale factor should be used. However, ``float16`` values can "overflow" (become inf or NaN) if the scale factor is too large. Therefore, the optimal scale factor is the largest factor that can be used without incurring inf or NaN gradient values. ``scaler`` approximates the optimal scale factor over time by checking the gradients for infs and NaNs during every ``scaler.step(optimizer)`` (or optional separate ``scaler.unscale_(optimizer)``, see :meth:`unscale_`). * If infs/NaNs are found, ``scaler.step(optimizer)`` skips the underlying ``optimizer.step()`` (so the params themselves remain uncorrupted) and ``update()`` multiplies the scale by ``backoff_factor``. * If no infs/NaNs are found, ``scaler.step(optimizer)`` runs the underlying ``optimizer.step()`` as usual. If ``growth_interval`` unskipped iterations occur consecutively, ``update()`` multiplies the scale by ``growth_factor``. The scale factor often causes infs/NaNs to appear in gradients for the first few iterations as its value calibrates. ``scaler.step`` will skip the underlying ``optimizer.step()`` for these iterations. After that, step skipping should occur rarely (once every few hundred or thousand iterations). Args: device (str, optional, default="cuda"): Device type to use. Possible values are: 'cuda' and 'cpu'. The type is the same as the `type` attribute of a :class:`torch.device`. Thus, you may obtain the device type of a tensor using `Tensor.device.type`. init_scale (float, optional, default=2.**16): Initial scale factor. growth_factor (float, optional, default=2.0): Factor by which the scale is multiplied during :meth:`update` if no inf/NaN gradients occur for ``growth_interval`` consecutive iterations. backoff_factor (float, optional, default=0.5): Factor by which the scale is multiplied during :meth:`update` if inf/NaN gradients occur in an iteration. growth_interval (int, optional, default=2000): Number of consecutive iterations without inf/NaN gradients that must occur for the scale to be multiplied by ``growth_factor``. enabled (bool, optional): If ``False``, disables gradient scaling. :meth:`step` simply invokes the underlying ``optimizer.step()``, and other methods become no-ops. Default: ``True`` c||_||_|jdk(rR|rPtjjj j rtjddd|_|jr^|dkDsJd|dksJd||_ d|_ ||_ ||_ ||_ d |_d|_t!t"|_yy) NcudazLtorch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.r-) stacklevelF?z The growth factor must be > 1.0.z!The backoff factor must be < 1.0.r)_device_enabledtorchr7ampcommonamp_definitely_not_availablewarningswarn _init_scale_scale_growth_factor_backoff_factor_growth_interval_init_growth_tracker_growth_trackerrr4_per_optimizer_states)rr init_scale growth_factorbackoff_factorgrowth_intervalenableds rrzGradScaler.__init__{s  <<6 !5::>>00MMO b !& == 3& J(J J&!C' L)L L')D 26DK"/D #1D $3D !()D %;?D DO,ED & rcd}|jJd|d|z|jJd|d|z|j|jfS)NzaThis may indicate your script did not use scaler.scale(loss or outputs) earlier in the iteration.z Attempted z but _scale is None. z but _growth_tracker is None. )rCrH)rfuncnamefixs r_check_scale_growth_trackerz&GradScaler._check_scale_growth_trackerswr{{&  "8 9C ? &##/  "A BS H / T1122rc|jJdtjd|jtj||_tjd|j tj||_y)Nz)_growth_tracker initialized before _scaler*dtyper)rHr<fullrBfloat32rCrGint32)rdevs r_lazy_init_scale_growth_trackerz*GradScaler._lazy_init_scale_growth_trackers_##+X-XX+jjT%5%5U]]SVW $zz ))S rcyrr*routputss rscalezGradScaler.scales.apply_scales#u||,u:?{{*<>PT^UU U   S7##rc t|}t|}td}tj5|jD]}|dD] } t | tj sJ| j-|s2| jjtjk(r td| jjr`| jjtjur| jj| _| jj} n | j} || j| jj| |j!D]O\} } | j#D]7} tj$| |j'| |j'| 9Q ddd|j(S#1swY|j(SxYw)Nc ttSr)rrjr*rrz,GradScaler._unscale_grads_..s D 1rparamsz%Attempting to unscale FP16 gradients.)rrr<no_grad param_groupsrerfgradrUfloat16rm is_sparsecoalesce_valuesrrhitemsvalues*_amp_foreach_non_finite_check_and_unscale_rr)r optimizer inv_scale found_inf allow_fp16per_device_inv_scaleper_device_found_infper_device_and_dtype_gradsgroupparam to_unscalerper_dtype_gradsgradss r_unscale_grads_zGradScaler._unscale_grads_s 6i@5i@ 1 2 #]]_ "// )"8_)E%eU\\:::zz) &EJJ,<,< ,M()PQQzz++ !::++u}}<).)<)<)>EJ%*ZZ%7%7%9 %*ZZ /z/@/@A"((fZ()) ).,F+K+K+M ',335EDD,008,008 1 @$777A @$777s F GG,c|jsy|jd|jt|}|dtj ur t d|dtjur t d|jJ|jjtjdk7r6|jjjjn|jj}tjddtj|jj }|j!|||d |d <tj |d<y) as Divides ("unscales") the optimizer's gradient tensors by the scale factor. :meth:`unscale_` is optional, serving cases where you need to :ref:`modify or inspect gradients` between the backward pass(es) and :meth:`step`. If :meth:`unscale_` is not called explicitly, gradients will be unscaled automatically during :meth:`step`. Simple example, using :meth:`unscale_` to enable clipping of unscaled gradients:: ... scaler.scale(loss).backward() scaler.unscale_(optimizer) torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm) scaler.step(optimizer) scaler.update() Args: optimizer (torch.optim.Optimizer): Optimizer that owns the gradients to be unscaled. .. note:: :meth:`unscale_` does not incur a CPU-GPU sync. .. warning:: :meth:`unscale_` should only be called once per optimizer per :meth:`step` call, and only after all gradients for that optimizer's assigned parameters have been accumulated. Calling :meth:`unscale_` twice for a given optimizer between each :meth:`step` triggers a RuntimeError. .. warning:: :meth:`unscale_` may unscale sparse gradients out of place, replacing the ``.grad`` attribute. Nunscale_r2zMunscale_() has already been called on this optimizer since the last update().z(unscale_() is being called after step().zmps:0r*rTFr3)r;rRrIidrr/ RuntimeErrorr0rCrr<double reciprocalfloatrVrWr)rroptimizer_staterrs rrzGradScaler.unscale_ s2@}}  ((444R ]C 7 #x'8'8 8_ W %)9)9 9IJ J{{&&&{{!!U\\'%:: KK   + + - 3 3 5'')  JJr3emmDKKDVDVW 262F2F y)U3 ./$,#4#4 rctd}td|djDs|j|i|}|S)Nc3<K|]}|jywr)item).0vs r z-GradScaler._maybe_opt_step..dsV1668Vsr3)sumrstep)rrrargskwargsr!s r_maybe_opt_stepzGradScaler._maybe_opt_step\s@#'V_5K%L%S%S%UVV#Y^^T4V4F rc |js|j|i|Sd|vr td|jd|jt |}|dt jur tdd}t|ddr`|}d tj|jjv}|r.tjd t|jd |in|dt j ur|j#||j%}|Jt't(j*t-|d j/D cgc]} | j1|j2d !c} } |dt j4k(r t|ddn|t|ddz|_| |_|j|i|}t j|d<|s|`|`|S|dt j ur|j;|t=|d dkDsJd|j>||g|i|}t j|d<|Scc} w)aInvoke ``unscale_(optimizer)`` followed by parameter update, if gradients are not infs/NaN. :meth:`step` carries out the following two operations: 1. Internally invokes ``unscale_(optimizer)`` (unless :meth:`unscale_` was explicitly called for ``optimizer`` earlier in the iteration). As part of the :meth:`unscale_`, gradients are checked for infs/NaNs. 2. If no inf/NaN gradients are found, invokes ``optimizer.step()`` using the unscaled gradients. Otherwise, ``optimizer.step()`` is skipped to avoid corrupting the params. ``*args`` and ``**kwargs`` are forwarded to ``optimizer.step()``. Returns the return value of ``optimizer.step(*args, **kwargs)``. Args: optimizer (torch.optim.Optimizer): Optimizer that applies the gradients. args: Any arguments. kwargs: Any keyword arguments. .. warning:: Closure use is not currently supported. closurez@Closure use is not currently supported if GradScaler is enabled.rr2z7step() has already been called since the last update().N_step_supports_amp_scalingF grad_scalerzGradScaler is going to stop passing itself as a keyword argument to the passed optimizer. In the near future GradScaler registers `grad_scale: Tensor` and `found_inf: Tensor` to the passed optimizer and let the optimizer use them directly.r3T)r grad_scaler,rz/No inf checks were recorded for this optimizer.) r;rrrRrIrrr0getattrinspect signature parametersr@rA FutureWarningupdater._check_inf_per_device_get_scale_asyncrr<rfrrr rr/rrrrgr) rrrrrr!kwargs_has_grad_scaler_kwargscalertrs rrzGradScaler.stephso0}}!9>>4262 2  R  ((044R ]C 7 #x'7'7 7I #' 9:E BG!2!29>>!B!M!MM "% k"   t45"7+x~~=..y9..0))) LL&55K%L%S%S%U !DDTDB 'w/83D3DDI|T:')\1"EE$ '0 ##Y^^T5W5F'/'7'7OG $(('M 7 #x~~ 5 MM) $?#9:;a? = ?&%%iR4R6R#+#3#3  ?s $INc |jsy|jd\}}||jJt|tr|jj |nUd}|j j|jk(sJ||jdk(sJ||jdusJ||jj|n|jjDcgc]7}|djD]}|j|j d!9}}}t|d kDsJd |d }t|dkDr"t!dt|D] } ||| z } t#j$||||j&|j(|j*t-t.|_ ycc}}w) a?Update the scale factor. If any optimizer steps were skipped the scale is multiplied by ``backoff_factor`` to reduce it. If ``growth_interval`` unskipped iterations occurred consecutively, the scale is multiplied by ``growth_factor`` to increase it. Passing ``new_scale`` sets the new scale value manually. (``new_scale`` is not used directly, it's used to fill GradScaler's internal scale tensor. So if ``new_scale`` was a tensor, later in-place changes to that tensor will not further affect the scale GradScaler uses internally.) Args: new_scale (float or :class:`torch.Tensor`, optional, default=None): New scale factor. .. warning:: :meth:`update` should only be called at the end of the iteration, after ``scaler.step(optimizer)`` has been invoked for all optimizers used this iteration. .. warning:: For performance reasons, we do not check the scale factor value to avoid synchronizations, so the scale factor is not guaranteed to be above 1. If the scale falls below 1 and/or you are seeing NaNs in your gradients or loss, something is likely wrong. For example, bf16-pretrained models are often incompatible with AMP/fp16 due to differing dynamic ranges. Nrzpnew_scale should be a float or a 1-element torch.cuda.FloatTensor or torch.FloatTensor with requires_grad=False.r,Fr3Trcrz,No inf checks were recorded prior to update.)r;rRrCrerfill_rrlr:numel requires_gradcopy_rIrr rgranger<_amp_update_scale_rDrErFrr4) r new_scalerCrHreasonstater found_infsfound_inf_combinedis rrzGradScaler.updates2}} "&"B"B8"L  ;;* **)U+ !!),B!'',, <DfD< (A-5v5- ..%7??7 !!), "77>>@!&'=!>!E!E!G FMM EEJ z?Q& V(V V&!+A :"q#j/28A&*Q-7&8  $ $"##$$%%  &11M%N"/s3>#3E- ,0,D,D,FE( )"E(O'+E# $ rc:|jj|yr)rrrs r __setstate__zGradScaler.__setstate__s U#rc~|jd\}}tjddtj|j}tjddtj|j}|j |||d|j t|d<|j t|dS)Nrr*r9rTrTr3)rRr<rVrWrrrIr)rrrC_dummy_inv_scalers rrz GradScaler._check_inf_per_devices445LM **REMM&--XJJr3emmFMMR   OY M ""2i=12HI))"Y-89OPPrc8|jt|dS)Nr3)rIr)rrs r_found_inf_per_devicez GradScaler._found_inf_per_devices))"Y-89OPPr)r7g@g@g?iT)rstrrJrrKrrLrrMrrNboolr#r$)rPrr#z!tuple[torch.Tensor, torch.Tensor])rYr%r#r$)r]r"r#r")r]list[torch.Tensor]r#r)r]tuple[torch.Tensor, ...]r#r)r]Iterable[torch.Tensor]r#r)r]rrr#rr) rtorch.optim.Optimizerrr"rr"rrr#z dict[torch.device, torch.Tensor])rrr#r$) rrrdict[str, Any]rrrrr#Optional[float])rrrrrrr#rr)rz$Optional[Union[float, torch.Tensor]]r#r$)r#zOptional[torch.Tensor])r#r)rrr#r$)r#r)rrr#r$)r#rr#r)rrr#r$)rrr#r$)rrr#r)r&r'r(r)rrRrZr r^rrrrrrrrrrrrrrrrrrrrrr*rrrr5sCN#" ##""" "  "  "" "H 3 3 * 3 ?? KK WW SS+$<+$ 5+$Z38(38 38 38  38 * 38j:5x ( (      j.j7:jFIj jXFOP #)$*%-6F6 $ QQrr) __future__rrr@ collectionsrrenumrtypingrrr r r r r<collections.abcr __all__rrr4rr*rrrs["(FF ( | $.t A Q Qr