L ilUddlmZddlZddlZddlmZmZmZmZmZddl m Z m Z m Z m Z ddlZddlmZerddlmZddlmZgd Ze d Ze d Zeej0d s]ed ej0j2d <edej0j2d<edej0j2d<ddlmZmZmZddZddZGddej0j8Z GddZ!edede"ffZ#de$d<e d d!dZ%e d d"dZ% d d#dZ%y)$) annotationsN)CallableOptionaloverload TYPE_CHECKINGUnion) ParamSpecSelf TypeAliasTypeVar)Tensor) _POOL_HANDLE) _dummy_type)is_current_stream_capturinggraph_pool_handle CUDAGraphgraphmake_graphed_callables_R_P_CudaStreamBase _CUDAGraph_graph_pool_handle_cuda_isCurrentStreamCapturing)rrrctS)zReturn True if CUDA graph capture is underway on the current CUDA stream, False otherwise. If a CUDA context does not exist on the current device, returns False without initializing the context. )rW/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/torch/cuda/graphs.pyrr/s * ++rcPtjjtS)zReturn an opaque token representing the id of a graph memory pool. See :ref:`Graph memory management`. .. warning:: This API is in beta and may change in future releases. )torchcudarrrrrrr8s :: " "#5#7 88rceZdZdZd dfd Z d dfd Zdfd Zdfd Zdfd Zdfd Z dfd Z dfd Z dfd Z dfd Z dfd ZxZS)ra-Wrapper around a CUDA graph. Arguments: keep_graph (bool, optional): If ``keep_graph=False``, the cudaGraphExec_t will be instantiated on GPU at the end of ``capture_end`` and the underlying cudaGraph_t will be destroyed. Users who want to query or otherwise modify the underlying cudaGraph_t before instantiation can set ``keep_graph=True`` and access it via ``raw_cuda_graph`` after ``capture_end``. Note that the cudaGraphExec_t will not be instantiated at the end of ``capture_end`` in this case. Instead, it will be instantiated via an explicit called to ``instantiate`` or automatically on the first call to ``replay`` if ``instantiate`` was not already called. Calling ``instantiate`` manually before ``replay`` is recommended to prevent increased latency on the first call to ``replay``. It is allowed to modify the raw cudaGraph_t after first calling ``instantiate``, but the user must call ``instantiate`` again manually to make sure the instantiated graph has these changes. Pytorch has no means of tracking these changes. .. warning:: This API is in beta and may change in future releases. c$t|||SN)super__new__)cls keep_graph __class__s rr'zCUDAGraph.__new___swsJ//rc(t|||y)aBegin capturing CUDA work on the current stream. Typically, you shouldn't call ``capture_begin`` yourself. Use :class:`~torch.cuda.graph` or :func:`~torch.cuda.make_graphed_callables`, which call ``capture_begin`` internally. Arguments: pool (optional): Token (returned by :func:`~torch.cuda.graph_pool_handle` or :meth:`other_Graph_instance.pool()`) that hints this graph may share memory with the indicated pool. See :ref:`Graph memory management`. capture_error_mode (str, optional): specifies the cudaStreamCaptureMode for the graph capture stream. Can be "global", "thread_local" or "relaxed". During cuda graph capture, some actions, such as cudaMalloc, may be unsafe. "global" will error on actions in other threads, "thread_local" will only error for actions in the current thread, and "relaxed" will not error on these actions. Do NOT change this setting unless you're familiar with `cudaStreamCaptureMode `_ )poolcapture_error_modeN)r& capture_begin)selfr,r-r*s rr.zCUDAGraph.capture_beginbs& 4`_ and `cuda-python Graph Management bindings `_ )r&raw_cuda_graphr2s rr@zCUDAGraph.raw_cuda_graphs w%''rc t|S)aReturns the underlying cudaGraphExec_t. ``instantiate`` must have been called if ``keep_graph`` is True, or ``capture_end`` must have been called if ``keep_graph`` is False. If you call ``instantiate()`` after ``raw_cuda_graph_exec()``, the previously returned cudaGraphExec_t will be destroyed. It is your responsibility not to use this object after destruction. See the following for APIs for how to manipulate this object: `Graph Execution `_ and `cuda-python Graph Execution bindings `_ )r&raw_cuda_graph_execr2s rrBzCUDAGraph.raw_cuda_graph_execs w*,,r)F)r)boolreturnr )Nglobal)r,Optional[_POOL_HANDLE]r-strrDNonerDrHrDr)r>rGrDrH)rDint)__name__ __module__ __qualname____doc__r'r.r1r4r6r8r,r;r=r@rB __classcell__)r*s@rrrDsg40NVP*PGJP P* +.(--rrcNeZdZUdZdZded< d d dZd dZd dZy) raContext-manager that captures CUDA work into a :class:`torch.cuda.CUDAGraph` object for later replay. See :ref:`CUDA Graphs ` for a general introduction, detailed use, and constraints. Arguments: cuda_graph (torch.cuda.CUDAGraph): Graph object used for capture. pool (optional): Opaque token (returned by a call to :func:`~torch.cuda.graph_pool_handle()` or :meth:`other_Graph_instance.pool()`) hinting this graph's capture may share memory from the specified pool. See :ref:`Graph memory management`. stream (torch.cuda.Stream, optional): If supplied, will be set as the current stream in the context. If not supplied, ``graph`` sets its own internal side stream as the current stream in the context. capture_error_mode (str, optional): specifies the cudaStreamCaptureMode for the graph capture stream. Can be "global", "thread_local" or "relaxed". During cuda graph capture, some actions, such as cudaMalloc, may be unsafe. "global" will error on actions in other threads, "thread_local" will only error for actions in the current thread, and "relaxed" will not error on actions. Do NOT change this setting unless you're familiar with `cudaStreamCaptureMode `_ .. note:: For effective memory sharing, if you pass a ``pool`` used by a previous capture and the previous capture used an explicit ``stream`` argument, you should pass the same ``stream`` argument to this capture. .. warning:: This API is in beta and may change in future releases. .. _cudaStreamCaptureMode: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g9d0535d93a214cbf126835257b16ba85 NOptional[torch.cuda.Stream]default_capture_streamct|jj-tjj |j_|dn|f|_||n|jj|_|j Jtjj|j |_||_ ||_ y)Nr) r*rSr!r"Streamr,capture_streamstream stream_ctx cuda_graphr-)r/rYr,rWr-s r__init__zgraph.__init__s >> 0 0 849JJ4E4E4GDNN 1,BTG (Fdnn.S.S ""...**++D,?,?@$"4rctjjtjjj rt jtjj|jj|jj|jd|jiy)Nr-)r!r" synchronizecompilerconfigforce_cudagraph_gcgccollect empty_cacherX __enter__rYr.r,r-)r/s rrczgraph.__enter__s}  >> 3 3 JJL   !!#%%% YY  $66 rcj|jj|jj|yr%)rYr1rX__exit__)r/argss rrezgraph.__exit__s& ##%   $'r)NNrE)rYrr,rFrWrRr-rGrI)rfobjectrDrH) rLrMrNrOrS__annotations__rZrcrerrrrrsU:;?7> (,.2"* 55%5, 5  50 0(rrtorch.nn.Module.r _ModuleOrCallablecyr%r callables sample_argsnum_warmup_itersallow_unused_inputr,s rrrsrcyr%rrls rrrs%(rc btjrtjr tdd}t |t s*d}|f}t jt tdf|f}n,t jt t tdfdf|}g}t||D]\}} t |tjjrvt|jdk(r0t|jdk(rt|jdk(sJdt!d|j#DsJdtj$j&j(| } |j+t | t!d | DrJd |D cgc] } t| } } |Dcgc]A}t |tjjrt |j-nd C} }t/t|D cgc] } || | | z}} t/t|Dcgc] }tj0j3"}}t/t|Dcgc] }tj0j3"}}| t5n|}tj0j7tj0j9tj0j;5t|||D]\}} }d \}}}t/|D]}tj$j&j=|| }t d|D}t|dkDsPtj>jA|t d|Dt d|Dd|}|||fD]}~ d d d tj0j7g}g}t|||D]\}} }tj0jC||5|| }d d d tj$j&jE\}}|j+t ||j+|g}g} ttG|tG|tG|D]\}}!}"t d|!D}#t d|!D}d }t|dkDrntj0jC|"|5tj>jA|t d|Dt d|#Dd|}d d d g}$d}%|D];}&|&jHr||$j+||%|%dz }%+|$j+d =t |$}$|j+|#| j+|$|jK| jK dd}'g}(tM|D]\} }|'|| || | | | | || || || || | |  })t |tjjrD dd}*|*||jN|)|jP|_(|(j+||(j+|)|r|(dSt |(Scc} wcc}wcc} wcc}wcc}w#1swY.xYw#1swYxYw#1swYxYw)aAccept callables (functions or :class:`nn.Module`\ s) and returns graphed versions. Each graphed callable's forward pass runs its source callable's forward CUDA work as a CUDA graph inside a single autograd node. The graphed callable's forward pass also appends a backward node to the autograd graph. During backward, this node runs the callable's backward work as a CUDA graph. Therefore, each graphed callable should be a drop-in replacement for its source callable in an autograd-enabled training loop. See :ref:`Partial-network capture` for detailed use and constraints. If you pass a tuple of several callables, their captures will use the same memory pool. See :ref:`Graph memory management` for when this is appropriate. Arguments: callables (torch.nn.Module or Python function, or tuple of these): Callable or callables to graph. See :ref:`Graph memory management` for when passing a tuple of callables is appropriate. If you pass a tuple of callables, their order in the tuple must be the same order they'll run in the live workload. sample_args (tuple of Tensors, or tuple of tuples of Tensors): Samples args for each callable. If a single callable was passed, ``sample_args`` must be a single tuple of argument Tensors. If a tuple of callables was passed, ``sample_args`` must be tuple of tuples of argument Tensors. num_warmup_iters (int): The number of warmup iterations. Currently, ``DataDistributedParallel`` needs 11 iterations for warm up. Default: ``3``. allow_unused_input (bool): If False, specifying inputs that were not used when computing outputs (and therefore their grad is always zero) is an error. Defaults to False. pool (optional): Token (returned by :func:`~torch.cuda.graph_pool_handle` or :meth:`other_Graph_instance.pool()`) that hints this graph may share memory with the indicated pool. See :ref:`Graph memory management`. .. note:: The ``requires_grad`` state of each Tensor in ``sample_args`` must match the state that's expected for the corresponding real input in the training loop. .. warning:: This API is in beta and may change in future releases. .. warning:: ``sample_args`` for each callable must contain only Tensors. Other types are not allowed. .. warning:: Returned callables do not support higher order differentiation (e.g., double backward). .. warning:: In any :class:`~torch.nn.Module` passed to :func:`~make_graphed_callables`, only parameters may be trainable. Buffers must have ``requires_grad=False``. .. warning:: After you pass a :class:`torch.nn.Module` through :func:`~make_graphed_callables`, you may not add or remove any of that Module's parameters or buffers. .. warning:: :class:`torch.nn.Module`\s passed to :func:`~torch.cuda.make_graphed_callables` must not have module hooks registered on them at the time they are passed. However, registering hooks on modules *after* passing them through :func:`~torch.cuda.make_graphed_callables` is allowed. .. warning:: When running a graphed callable, you must pass its arguments in the same order and format they appeared in that callable's ``sample_args``. .. warning:: The automatic mixed precision is supported in :func:`~torch.cuda.make_graphed_callables` only with disabled caching. The context manager `torch.cuda.amp.autocast()` must have `cache_enabled=False`. z_make_graphed_callables does not support the autocast caching. Please set `cache_enabled=False`.FT.rzModules must not have hooks registered at the time they are passed. However, registering hooks on modules after passing them through make_graphed_callables is allowed.c38K|]}|jduyw)FN requires_grad.0bs r z)make_graphed_callables..sEAq%/EszIn any :class:`~torch.nn.Module` passed to :func:`~make_graphed_callables`, only parameters may be trainable. All buffers must have ``requires_grad=False``.c3PK|]}t|tj ywr%) isinstancer!r )rwargs rryz)make_graphed_callables..sHS:c5<<0Hs$&zfIn the beta API, sample_args for each callable must contain only Tensors. Other types are not allowed.rN)NNNc3:K|]}|js|ywr%rtrwos rryz)make_graphed_callables..s$K11??Q$Kc3:K|]}|js|ywr%rtrwis rryz)make_graphed_callables..s%"#qA%rc3`K|]&}|jstj|(ywr%rur! empty_liker~s rryz)make_graphed_callables..s&+45AOOE,,Q/+s..)outputsinputs grad_outputs only_inputs allow_unused)r,c3bK|]'}|jrtj|nd)ywr%rr~s rryz)make_graphed_callables..s+$ AB1??E  Q  <$ s-/c3:K|]}|js|ywr%rtr~s rryz)make_graphed_callables..sJ1!//QJrc3:K|]}|js|ywr%rtrs rryz)make_graphed_callables..s TqAOO Trc3&K|] }|| ywr%rr~s rryz)make_graphed_callables..s&WQq&Wsc  Gfddtjj d fd } | S)NceZdZedfd Zeej jjdfd Z y)Omake_graphed_callables..make_graphed_autograd_function..Graphedc tD]A}|j||jk7s+|j||Cjt t sJt dDS)Nc3<K|]}|jywr%detachr~s rryzjmake_graphed_callables..make_graphed_autograd_function..Graphed.forward..s@AQXXZ@s)rangedata_ptrcopy_r6r{tuple)ctxrr fwd_graph len_user_argsstatic_input_surfacestatic_outputss rforwardzWmake_graphed_callables..make_graphed_autograd_function..Graphed.forward s}-AA+A.779VAY=O=O=QQ,Q/55fQi@A  "!.%888@@@@rc2t|tk(sJt|D];\}}| |j|jk7s+|j|=j t t sJt dDS)Nc3DK|]}||jn|ywr%rrvs rryzkmake_graphed_callables..make_graphed_autograd_function..Graphed.backward..%s$;.make_graphed_autograd_function..Graphed.backwards5zS)<%===="#6>*GAt}::<4==?:GGDM *   """4e<<<@RrN)rrgrr rDtuple[Tensor, ...])rrgrr rDr) rLrMrN staticmethodrr!autogradfunctiononce_differentiabler)rrrrrrrsrGraphedr sC  A A ^^ $ $ 8 8 9 rrctjjj|}jt |z}tjjj |Sr%)r!utils_pytreearg_tree_leavesapplyrtree_unflatten) user_argsflatten_user_argsoutr module_paramsoutput_unflatten_specs rfunctionalizedzVmake_graphed_callables..make_graphed_autograd_function..functionalized)sY!& 3 3 C CY O '--%(9":]"JLC;;&&55c;PQ Qr)rrgrDrg)r!rFunction) rrrrrrrrrrrs ````````` @rmake_graphed_autograd_functionz>make_graphed_callables..make_graphed_autograd_functions-  enn-- : Rrc dfd }|S)NcBjk(r|i|S|i|Sr%)training)r user_kwargsfuncgraph_training_stategraphedorig_fwds rnew_fwdzEmake_graphed_callables..make_graphed_forward..new_fwdJs2}}(<<& A[AA'BkBBr)rz_P.argsrz _P.kwargsrDrr)rrrrrs```` rmake_graphed_forwardz4make_graphed_callables..make_graphed_forwardDs CCr)rrrrrztuple[torch.nn.Parameter, ...]rrKrztorch.utils._pytree.TreeSpecrrrrrztuple[Optional[Tensor], ...]rrrDzCallable[..., object]) rrirrCrCallable[_P, _R]rrrDr))r!is_autocast_enabledis_autocast_cache_enabled RuntimeErrorr{rtypingcastr rnnModuler_backward_hooks_forward_hooks_forward_pre_hooksallbuffersrrrappend parametersrr"rrr\rWrU tree_leavesrrr tree_flattenreversedrureverse enumeraterr)+rmrnrorpr,just_one_callable _sample_argsflatten_sample_argscrf flatten_argper_callable_len_user_argsper_callable_module_paramsr"per_callable_static_input_surfaces_ fwd_graphs bwd_graphsmempoolrr grad_inputsr outputs_gradvper_callable_static_outputs"per_callable_output_unflatten_specr func_outputsflatten_outputsspec per_callable_static_grad_outputsper_callable_static_grad_inputsrrrrgrad_idxr|rretrrs+ rrr%s|R   "u'F'F'H m   i ' L  E&#+$6 DF {{5vs{);S)@#A;O y,/ 4 a )A%%&!+(()Q.,,-2  ]  3EEE - E kk))994@ ""5#56HKHH  Z H# 09L!L#d)!L!L" ",Auxx!?allnRG"" s9~&*  A!;A!>>*&* 38I2GHQ%**&&(HJH27I2GHQ%**&&(HJH%)\!tG  JJ   5::,,. /03 |%G1   ,D$,2B .K,+, ++--99$+F$$K$KK |$q("'.."5"5 ,$%';% &++9@+&%)%7#6 #K  |[9  ' . JJ#%)+&!$Y j!I8dI ZZ  ig  6 ';L '!& 3 3 @ @ N#**5+AB*11$7 8(*$&(#;>34,-<%C7ni$$ FT$  JJJ  | q !!)'!: #nn11( T,@ TT!&&W2E&W!W $!3 2   ' 0C  [%<"))+h*?@A "))$/  0 ##56(//0CD'../ABK%CP%,,.#++-00060 0 < 0 1 0+0:0/0 0f$&CY'$ 40 qM qM &q ) &q ) .q 1 .q 1 ' * ,Q / +A .   dEHHOO , % &* * +   "   0dmmWdllDL JJt  JJw I$ L1v :O"M"* IHB ' '0  sL)[1A[6[; %\%\A5\  A\ =\-A\$ \\! $\. )rDrCrJ)FN) rmrjrnrrorKrprCr,rFrDrj) rmtuple[_ModuleOrCallable, ...]rnztuple[tuple[Tensor, ...], ...]rorKrprCr,rFrDr) rm7Union[_ModuleOrCallable, tuple[_ModuleOrCallable, ...]]rnz9Union[tuple[Tensor, ...], tuple[tuple[Tensor, ...], ...]]rorKrprCr,rFrDr)& __future__rr`rrrrrrtyping_extensionsr r r r r!r torch.cudar_utilsr__all__rrhasattr_C__dict__torch._Crrrrrrrrgrjrhrrrrrs" EEAA '   T]t_uxx*+&1,&?EHHl#.9:N.OEHH*+:E(;EHH67,9q-##q-hR(R(l %%6f8M%MN9N $#'  #  !     $#' (,(/(( ( ! ( # ( ($#' yFyJyy y ! y = yr