K i4OddlZddlZddlZddlZddlZddlZddlmZddlm Z m Z m Z ddl m Zddl mZdZdZd Zdd Zdd Zdd ZGd dZGddZdZddZddZdZeddZddZy)N)contextmanager)AnyDictList)language)runtimecdj|}dddd|zdg}tj|}|jtj j jd}|Dcgc] }t|}}|Scc}w)N, nvidia-smi-i0z --query-gpu=z--format=csv,noheader,nounits) join subprocess check_outputdecodesysstdoutencodingsplitint)attrscmdoutretxs T/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/triton/testing.pynvsmir sy HHUOE sNU$:<[ \C  ! !# &C **SZZ(( ) / / 4C a3q6 C  J s-Bctttfd}|Dcgc] }|| c}Scc}w)Ncd|cxkrdkstdtd|dz z}tj|}tj|}||z }d|z |z||zzS)Nrrz%Quantiles must be in the range [0, 1]) ValueErrormathfloorceil)qpointloweruppertans r get_quantilez_quantile..get_quantiles}Q ! DE EDE EQU  5! %  EMA5!A%L00)lensorted)r*r%r,r+s` @r _quantiler0s4 AAq A1&' 'LO '' 's5c|!t||}t|dk(r|d}|S|dk(r|S|dk(r t|S|dk(r t|S|dk(rt j |S|dk(rt j |Sy)Nrrallminmaxmeanmedian)r0r.r3r4 statisticsr5r6)times quantiles return_moders r_summarize_statisticsr;*sy) s8q=a&C e  5z  5z  u%%    '' !r-c ddl}|dvsJ|jj|jj5||/|D]*}|j |j dd|_,|jjd}|jjd}|jtdD] } | |j|jj|j|dz } | dk(rd} ntdt|| z } |jj} |jj| 5t| D]} ||D] }d|_ | ddd|jjg} d }t|D]} |jjd}|jjd}|j| j!|j|jj| |j|| z gz } t#| ||cdddS#1swYxYw#1swYyxYw) a Benchmark the runtime of the provided function. :param fn: Function to benchmark :type fn: Callable :param rep: Repetition time (in ms) :type rep: int :param grad_to_none: Reset the gradient of the provided tensor to None :type grad_to_none: torch.tensor, optional :param return_mode: The statistical measure to return. Options are "min", "max", "mean", "median", or "all". Default is "mean". :type return_mode: str rNr3r4r5r6r2T enable_timingir )torchcudastreamStreamdetach_requires_grad_gradEventrecordrange synchronize elapsed_timer4r CUDAGraphgraphreplayr;)fnrep grad_to_noner9r:rBr start_event end_event_ estimate_msn_repeatgr n_retriess rdo_bench_cudagraphr[<s` A AA A   5::,,. /0B   #!    & jj&&T&: JJ$$4$8 q A D   !..y9A= ! H1c# "345H JJ " ZZ  a  8_ +)&!%&       y! DA*****>K ((t(r@rN)r driveractiveget_device_interfacerLget_empty_cache_for_benchmarkrIrJrK clear_cacherMr4rrHzipr;)rQwarmuprRrSr9r:dicacherTrUrVrWn_warmuprXirser8s rdo_benchrjs*$ A AA A    3 3 5BDNN NN ! ! ? ? AE(((.Kt,I 1X ))%0  NN**959K1c&;./0H1c# +,-H9>xIA288$8/IKI7>1*=VVvV;+.(FE5VH$E7"E7" ##1g07:WDBFF3r7O' E* ?? JJLBajG!%"2"23 V1!!f*~r!f*~u,1LLell1oa(d,1LLell1oa(d7 RU!33G||~))+ELLN4F4F4H!LL/E!LL/EOOBwKTQTOU V IIK MM%,,1' 2 MM%,, ' MM5;;%H = MM5;;%H =  BGGLLu6Gt4LMN %*** +  q(**,JD$DBtH,BvJ  %//C' ( ",,. !  IIbggll90A.FGXZ[iZjjkVl!  # y76 )!;+.d5EF;s# Q Q#, Q(Q--R?Rc t|jt}|r |jgn |j}g} |D]'} |j|j| |||fi|) |rt j |dtt jj|dd5} | jd|dt|D]!} | jd| jd#| jdddd|r |r|d S|Sy#1swYxYw#|rt j |dtt jj|dd5} | jd|dt|D]!} | jd| jd#| jddddw#1swYwxYwwxYw) NT)exist_okz results.htmlwz z z r) rrrrappendrrmakedirsopenrrwriter.r) rrrr return_dfkwargshas_single_benchr result_dfsrhtmls rrunzMark.runs%dooyA*:doo&   3# a!!)$))E9j*"_X^"_` a I5"'',,y.A3G34JJ/0!+,%chC)CCCSWCJr-rcfd}|S)z Mark a function for benchmarking. The benchmark can then be executed by using the :code:`.run` method on the return value. :param benchmarks: Benchmarking configurations. :type benchmarks: List of :class:`Benchmark` ct|Sr)rrs rzperf_report..sb*-r-r)rwrappers` r perf_reportrs.G Nr-cddl}ddlm}|s|jj }|j j j|d}|j j j|d}||zdzdz d z }|S) z return DRAM bandwidth in GB/s rNrr]mem_clock_rate mem_bus_widthrg.A)rBr r]rCcurrent_devicer^utilsget_device_properties)devicerBr] mem_clock_khz bus_widthbw_gbpss r get_dram_gbpsrsz **,MM''==fEFVWM ##99&A/RIi'!+c1A5G Nr-cJddl}ddlm}|s|jj }|j j j|ddz}|jj|}|ddkr||jk(sJd}n||j|jfvrd}nr||j|j|jfvrd}nJ||jtj tj"tj$fvrd }n t'd ||z|zd z}|S) Nrrrmultiprocessor_countriidtype not supported& .>)rBr r]rCrr^rrget_device_capabilityfloat16float32int32rwint16int8tl float8e4nv float8e4b15float8e5 RuntimeError rv clock_raterrBr] num_subcores capabilityops_per_sub_coretflopss rget_max_tensorcore_tflopsrs **,==&&<.decorator..wrappers! rzz|499;I - 3 3 5 G  Y/%Aww''(;(;J(GH!zz&1UXY F*n,nn* +0099<<b!1!1 2!G9A> nnox%L]agjk~~*e,ee*0C OCCC((r-) functoolswraps)r<rr;s` r decoratorz cuda_memcheck..decorators%  ! ) " )"r-r)r;r?s` r cuda_memcheckr@s, r-c #K tjgdtjdddd|d|gtjdddd|d|gtdgd }td gd }t||z d ks Jd |d t||z d ks Jd |d d|z}d|zdz}||ftjgdtjgdtjgdy#tjgdtjgdtjgdwxYww)N)r r r-pmr$r r rz--lock-gpu-clocks=r z--lock-memory-clocks=zclocks.current.smrzclocks.current.memoryrAzGPU SMs must run at z MHzg3O?igMbP?)r r rrBr)r r rz-rgc)r r rz-rmc)rrrabs) ref_sm_clock ref_mem_clock cur_sm_clock cur_mem_clockrgbpss r set_gpu_clockrIsmC EF    a ~ > !     #M?!M? C !  123A6 678; <,./"4_8L\NZ^6__4==01B6b:N}o]a8bb6)L8&-dl EF AB AB  EF AB ABsEB>DAEAEEcddl}ddlm}|s|jj }|j j j|ddz}|jj}|ddkr/||jk(rd}nW||jk(rd}nEtd ||jk(rd}n(||j|jfvrd}n td ||z|zd z}|S) Nrrrrr r @r r ) rBr r]rCrr^rrr rrrrwrs rget_max_simd_tflopsrMs **,==&&<rSs  %"" ( ($@BF?@D0^f@@F``F :6CC8r-