L i#5dZddlZddlmZddlmZddlZddlmZddlm Z m Z m Z m Z m Z dgZd Zd Zd Zd Zd Zdedeeeefdededef dZdedeedej2defdZGdde Zdde dededededed ze_d eed!eed"eeded#ed$ed%edeeeefdededeed&eddfd'Ze e(dd)d eed!eed"eed*eeded#ed$ed%edeeeefdededeed&efd+Zy),z%Implementation of the Muon optimizer.N)MutableMapping)Optional)Tensor)_disable_dynamo_if_unsupported _params_doc _to_scalar OptimizerParamsTMuongHz>guV @ggn@@gradns_coefficientsns_stepsepsreturnc|dk\r tdt|jdk7r tdt|dk7r td|\}}}|j}|j d|j dkDr |j }|j |jj| t|D]D}||j z} tj| | | || } tj|| || }F|j d|j dkDr |j }|S) a. Newton-Schulz iteration to compute the zeroth power / orthogonalization of G. We opt to use a quintic iteration whose coefficients are selected to maximize the slope at zero. For the purpose of minimizing steps, it turns out to be empirically effective to keep increasing the slope at zero even beyond the point where the iteration no longer converges all the way to one everywhere on the interval. This iteration therefore does not produce UV^T but rather something like US'V^T where S' is diagonal with S_{ii}' ~ Uniform(0.5, 1.5), which turns out not to hurt model performance at all relative to UV^T, where USV^T = G is the SVD. Implementation reference: https://github.com/KellerJordan/Muon/blob/master/muon.py with suggestions by @jxbz, @leloykun, and @YouJiacheng. dzBNumber of steps must be less than 100 for computational efficiencyz)Input tensor gradient must be a 2D matrixz0Coefficients must be a tuple of exactly 3 valuesrr)min)betaalpha)r) ValueErrorlenshapebfloat16sizeTdiv_normclamprangetorchaddmm) rrrrabc ortho_grad_ gram_matrix gram_updates W/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/torch/optim/_muon.py_zeropower_via_newtonschulzr. s%3 P   4::!DEE ?q KLLGAq!J yy|diil"\\ OOJOO%+++45 8_N :<</ kk k [[[*1M N yy|diil"\\ lr adjust_lr_fn param_shapec|dd\}}||dk(r'tjtd||z }||zS|dk(r'dtjt||z}||zSd}||zS)z.Default learning rate adjustment used by Muon.Nroriginalrmatch_rms_adamwg?g?)mathsqrtmax)r0r1r2ABadjusted_ratios r- _adjust_lrr<Js r?DAq|z93q!a%=1   * *tyyQ33    r/ceZdZddddeeefeedfdede de d e d e d e e e e fd e d e de eddffd ZdedeedeedeefdZej*ddZxZS)r gMbP?g?gffffff?TNparamsr0 weight_decaymomentumnesterovrrrr1rc t|tr|jdk7r tdd|kstd|d|kstd|d|kstd|| | dvrtd| d |||||||| d } t ||| |j D]7} | d D]-} | jd k7std | j9y)NrzTensor lr must be 1-elementgz%Learning rate should be >= 0 but is: z momentum should be >= 0 but is: z$weight decay should be >= 0 but is: )r4r5zAdjust learning rate function z is not supported)r0r?r@rArrrr1r>rzIMuon only supports 2D parameters whereas we found a parameter with size: ) isinstancernumelrsuper__init__ param_groupsndimr)selfr>r0r?r@rArrrr1defaultsgroupp __class__s r-rFz Muon.__init__Zs+ b& !bhhjAo:; ;byDRDIJ Jh?zJK Kl"CL>RS S  # = ) 0>OP  (  . (   *&& E8_ 66Q;$cdedjdjdlcmn  r/rKparams_with_gradgradsmuon_momentum_bufsc|dD]}|jtj|r td|jjr td|j ||j |j|j |}d|vr2tj|jtj|d<|j |dy)Nr>z(Muon does not support complex parametersz&Muon does not support sparse gradientsmomentum_buffer) memory_formatF) rr$ is_complex RuntimeError is_sparseappendstate zeros_likepreserve_format)rIrKrNrOrPrLrXs r- _init_groupzMuon._init_groupsx @Avv~""#MNNvv"#KLL  # #A & LL JJqME -+0+;+;FF%*?*?,'(  % %e,=&> ?% @(r/c.d}|$tj5|}ddd|jD]Q}|d}|d}|d}g}g}g} |j|||| } t ||| ||||d|d|d|d|d | S|S#1swYkxYw) z$Performs a single optimization step.Nr0r?r@rArrrr1) r0r?r@rArrrr1 has_complex)r$ enable_gradrGr[muon) rIclosurelossrKr0r?r@rNrOrPr]s r-stepz Muon.steps  ""$ !y !&& EtB 0LZ(H-/ "$E/1 ** " K  ")!z* %&7 8%Lz*">2' ! < C ! !s B  B)N)__name__ __module__ __qualname__ DEFAULT_A DEFAULT_B DEFAULT_CEPSDEFAULT_NS_STEPSr floatbooltupleintrstrrFrlistrr[r$no_gradrb __classcell__)rMs@r-r r Ys!7@)Y6W(&*-- - -  -  -ueU23---sm- -^v,F|  !L :U]]_%%r/u Implements Muon algorithm. .. math:: \begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \gamma \text{ (lr)},\ \lambda \text{ (weight decay)},\ \mu \text{ (momentum)},\ \textit{nesterov}\in\{True,False\},\\ &\hspace{13mm}(a,b,c)\ \text{ (NS coefficients)},\ \varepsilon \text{ (epsilon)},\ k \text{ (NS steps)},\ \theta_0 \text{ (params)},\ f(\theta) \text{ (objective)} \\ &\textbf{initialize} : B_0 \leftarrow 0 \text{ (momentum buffer)} \\[-1.ex] &\rule{110mm}{0.4pt} \\ &\textbf{for}\ t=1\ \textbf{to}\ \ldots\ \textbf{do} \\[0.25ex] &\hspace{5mm} g_t \leftarrow \nabla_{\theta} f_t(\theta_{t-1}) \\[0.25ex] &\hspace{5mm} B_t \leftarrow \mu B_{t-1} + g_t \\[0.25ex] &\hspace{5mm} \widetilde{B}_t \leftarrow \begin{cases} g_t + \mu B_t, & \text{if nesterov}=True \\ B_t, & \text{if nesterov}=False \end{cases} \\[1.0ex] &\hspace{5mm} O_t \leftarrow \mathrm{NS}^{(a,b,c)}_{k}\!\big(\widetilde{B}_t;\ \varepsilon\big) \\[0.5ex] &\hspace{5mm} \theta_t \leftarrow \theta_{t-1} - \gamma\,\lambda\,\theta_{t-1} \quad\text{(decoupled weight decay)} \\[0.25ex] &\hspace{5mm} \gamma \leftarrow \mathrm{AdjustLR}\!\big(\gamma;\ \mathrm{shape}\!\big(\theta_t \big) \big) \\[0.25ex] &\hspace{5mm} \theta_t \leftarrow \theta_t - \gamma\, O_t \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\mathbf{return}\ \theta_t \\[-1.ex] &\rule{110mm}{0.4pt}s \end{aligned} Here, :math:`\mathrm{NS}^{(a,b,c)}_{k}(\cdot;\varepsilon)` denotes :math:`k` iterations of the Newton–Schulz orthogonalization operator parameterized by coefficients :math:`(a,b,c)` with numerical stabilization :math:`\varepsilon`. The purpose for :math:`\mathrm{AdjustLR}\!\big(\gamma;\ \mathrm{shape}\!\big(\theta_t \big) \big)` is to make the orthogonalized update have a consistent :math:`RMS` across rectangular matrices. Keller's original implementation scales the update by :math:`\sqrt{\max\!\left(1, \frac{A}{B}\right)}`, where :math:`A` and :math:`B` are dimension of the matrix being optimized. Moonshot's implementation also focuses on matching :math:`RMS` of AdamW. The adjustment is computed as: :math:`\gamma \leftarrow {0.2}\gamma\,\sqrt{\max\!\left({A}, {B}\right)}` The method is adopted from `Muon is Scalable for LLM Training`_. Research results show that with this adjustment Muon can directly reuse the learning rate and weight decay tuned for AdamW. We provide two options for the learning rate adjustment: "original", which follows Keller's implementation, and "match_rms_adamw", which refers to Moonshot's implementation. This gives users the flexibility to choose between the two. If `adjust_lr_fn` is not specified, the default is "original". For further details regarding the algorithm we refer to `Muon: An optimizer for hidden layers in neural networks`_ and `Muon is Scalable for LLM Training`_. z Args: u. Note that Muon is an optimizer for 2D parameters of neural network hidden layers. Other parameters, such as bias, and embedding, should be optimized by a standard method such as AdamW. lr (float, Tensor, optional): learning rate (default: 1e-3). weight_decay (float, optional): weight decay (L2 penalty). (default: 0.1) momentum (float, optional): momentum factor (default: 0.95) nesterov (bool, optional): enables Nesterov momentum. Only applicable when momentum is non-zero ns_coefficients (tuple of three floats, optional): coefficients \(a,b,c\) for the Newton–Schulz orthogonalization polynomial (default: (z, zc)) eps (float, optional): term added to the denominator for numerical stability. (default: uY) ns_steps (int, optional): number of Newton–Schulz iteration steps. (default: a) adjust_lr_fn (str, optional): function to adjust learning rate. One of "original" and "match_rms_adamw". If not specified, we will default to use "original". (default: None) .. _Muon\: An optimizer for hidden layers in neural networks: https://kellerjordan.github.io/posts/muon/ .. _Muon is Scalable for LLM Training: https://arxiv.org/pdf/2502.16982 r>rOrPr?r@rAr]c t|}| r tdt|D]\} } || }|jdk7r td|| }|j |d|z |r|j ||n|}t |||| }t|| | j}| jd||zz | j|| y)Nz$Complex parameters are not supportedrz"Param gradient must be a 2D matrixr)r) r r enumeraterHlerp_lerpr.r<rmul_add_)r>rOrPr0r?r@rArrrr1r]iparamrbufupdate adjusted_lrs r-_single_tensor_muonr~s BB?@@f%/5Qx 99>AB B # $H %-53)3,V_hPST \5;;?  1rL(() 6+ ./r/)single_tensor_fn)foreachrc V| |r tdt} | ||||||||| | | |  y)znFunctional API that performs Muon algorithm computation. See :class:`~torch.optim.Muon` for details. Nz%Foreach is not supported for Muon yet) r0r?r@rArrrr1r])rUr~)r>rOrPrr0r?r@rArrrr1r]funcs r-r_r_CsI*wBCC D  !' ! r/) __doc__r6collections.abcrtypingrr$r optimizerrrr r r __all__rirfrgrhrjrmrkrnr.roSizer<r rprlr~r_r/r-rs, *  (     ' '#(u)<#='IL'SX' 'T   %c] 9>   s9sn5l  EFOKrR[Q\\^_h^ijaad`efXXhWi jmK `!/ L!/ <!/V !/  !/  !/!/!/5%./!/!/ !/3-!/!/ !/H 1DE # & L& <&V & d^ &  &&&&5%./&& &3-&&F&r/