L iY/ ddlZddlZddlmZmZddlmZmZmZm Z ddl Z ddl m cm cmcmZddlmZddlmZmZddlmZddlmZddlmZmZmZdd lmZdd l m!Z!dd l"m#Z#dd l$m%Z%dd l&m'Z'gdZ(e%e)Z*eGddZ+GddZ,de ee-dfde.ede-fdZ/de!de0ee-ee1ffdZ2de+de ee-dfde.ede3e1effdZ4y)N) dataclassfield)AnyCallableOptionalUnion)get_default_numa_options)eventsmetrics) WorkerSpec)LocalElasticAgent)DefaultLogsSpecs LogsSpecsSignalException)ChildFailedError)RendezvousParameters)parse_rendezvous_endpoint) get_logger) NumaOptions) LaunchConfigelastic_launch launch_agentcPeZdZUdZeed<eed<eed<dZeeed<dZ e ed<d Z e ed <dZ e ed <d Z e ed <eeZee efed<dZeed<dZeed<dZeed<dZe ed<dZee ed<eeZee e fed<dZee ed<dZe ed<dZeeed<dZy)ra Creates a rendezvous config. Args: min_nodes: Minimum amount of nodes that the user function will be launched on. Elastic agent ensures that the user function start only when the min_nodes amount enters the rendezvous. max_nodes: Maximum amount of nodes that the user function will be launched on. nproc_per_node: On each node the elastic agent will launch this amount of workers that will execute user defined function. rdzv_backend: rdzv_backend to use in the rendezvous (zeus-adapter, etcd). rdzv_endpoint: The endpoint of the rdzv sync. storage. rdzv_configs: Key, value pair that specifies rendezvous specific configuration. rdzv_timeout: Legacy argument that specifies timeout for the rendezvous. It is going to be removed in future versions, see the note below. The default timeout is 900 seconds. run_id: The unique run id of the job (if not passed a unique one will be deduced from run environment - flow workflow id in flow - or auto generated). role: User defined role of the worker (defaults to "trainer"). max_restarts: The maximum amount of restarts that elastic agent will conduct on workers before failure. monitor_interval: The interval in seconds that is used by the elastic_agent as a period of monitoring workers. start_method: The method is used by the elastic agent to start the workers (spawn, fork, forkserver). metrics_cfg: configuration to initialize metrics. local_addr: address of the local node if any. If not set, a lookup on the local machine's FQDN will be performed. local_ranks_filter: ranks for which to show logs in console. If not set, show from all. event_log_handler: name of the event logging handler as registered in `elastic/events/handlers.py `_. .. note:: `rdzv_timeout` is a legacy argument that will be removed in future. Set the timeout via `rdzv_configs['timeout']` min_nodes max_nodesnproc_per_nodeN logs_specsrun_id default_rolerole rdzv_endpointetcd rdzv_backend)default_factory rdzv_configs rdzv_timeout max_restartsg?monitor_intervalspawn start_methodlog_line_prefix_template metrics_cfg local_addrnullevent_log_handler numa_optionscd}|jdk7r|j|jd<nd|jvr||jd<|jt|_|j{t j jr\t j j|jk(r0t|_tjd|jyyyy)Nir'timeoutzUsing default numa options = %r) r(r&rrr3torchcuda is_available device_countrr loggerinfo)selfdefault_timeouts d/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py __post_init__zLaunchConfig.__post_init__cs    "+/+<+rr%s'RNN&*J#*FCD#M3L##(#>L$sCx.>L#L#!e!L#.2hsm2"'"=Kc3h= $J $#s#*.L(;'.Nr@rc2eZdZdZdedeeedffdZdZ y)ra Launches an torchelastic agent on the container that invoked the entrypoint. 1. Pass the ``entrypoint`` arguments as non ``kwargs`` (e.g. no named parameters)/ ``entrypoint`` can be a function or a command. 2. The return value is a map of each worker's output mapped by their respective global rank. Usage :: def worker_fn(foo): # ... def main(): # entrypoint is a function. outputs = elastic_launch(LaunchConfig, worker_fn)(foo) # return rank 0's output return outputs[0] # entrypoint is a command and ``script.py`` is the python module. outputs = elastic_launch(LaunchConfig, "script.py")(args) outputs = elastic_launch(LaunchConfig, "python")("script.py") config entrypointNc ||_||_yN)_config _entrypoint)r<rLrMs r>__init__zelastic_launch.__init__s  %r@cVt|j|jt|SrO)rrPrQlist)r<argss r>__call__zelastic_launch.__call__sDLL$*:*:DJGGr@) rArBrCrDrrrrGrRrVrJr@r>rrxs04&&(C-.&Hr@rrMrUreturnct|tr |jSt|tr(|tj k(rt d|DdS|Sy)aRetrieve entrypoint name with the rule: 1. If entrypoint is a function, use ``entrypoint.__qualname__``. 2. If entrypoint is a string, check its value: 2.1 if entrypoint equals to ``sys.executable`` (like "python"), use the first element from ``args`` which does not start with hifen letter (for example, "-u" will be skipped). 2.2 otherwise, use ``entrypoint`` value. 3. Otherwise, return empty string. c32K|]}|ddk7s |yw)r-NrJ).0args r> z'_get_entrypoint_name..s>A# >s r) isinstancerrArGsys executablenext)rMrUs r>_get_entrypoint_namerbsL*h'""" J $  '>>C C r@rdzv_parametersc|jdk7ry|j}|j}|s tdt |d\}}|dk(rtd|d||fS)Nstatic)NNzKEndpoint is missing in endpoint. Try to add --master-addr and --master-portr') default_portzport is missing in endpoint: z. Try to specify --master-port)backendendpointstrip ValueErrorr)rcrh master_addr master_ports r>_get_addr_and_portrms(*''H~~H  Y   9PRSKb+H:5S T    %%r@rLc|jsDttjj}t j d|||_t||}t jd||j|j|j|j|j|j|j|j|j |j"j$|j&|j(|j*dt-d |j|j|j|j|j|j.d|j}t1|\}}t3|j4|j|t7|t9j:||j|j |||j.|j(|j* }t=||j"|j>|j@} d} tCjDtCjF|j&| jI} tKjL| jO|j(| jQrtS|| jT| jV| r|jXj[SS#tR$rt\$r2d } tKjL| j_|j(t`$r0tKjL| j_|j(wxYw#| r|jXj[wwxYw) Nz3config has no run_id, generated a random run_id: %saWStarting elastic_operator with launch configs: entrypoint : %(entrypoint)s min_nodes : %(min_nodes)s max_nodes : %(max_nodes)s nproc_per_node : %(nproc_per_node)s run_id : %(run_id)s rdzv_backend : %(rdzv_backend)s rdzv_endpoint : %(rdzv_endpoint)s rdzv_configs : %(rdzv_configs)s max_restarts : %(max_restarts)s monitor_interval : %(monitor_interval)s log_dir : %(log_dir)s metrics_cfg : %(metrics_cfg)s event_log_handler : %(event_log_handler)s numa_options : %(numa_options)s )rMrrrrr$r"r&r*r+log_dirr/r2r3)rgrhrrrr0) r!local_world_sizerMrU rdzv_handlerr*r+rkrlr0r2r3)specrr-r.T)namefailuresFrJ)1rrGuuiduuid4rEr:warningrbr;rrrr$r"r&r*r+r root_log_dirr/r2r3rr0rmr r!tuple rdzv_registryget_rendezvous_handlerr r-r.r initialize_metrics MetricsConfigrunr recordget_event_succeeded is_failedrrt return_valuesrqshutdownrget_event_failed Exception) rLrMrUrentrypoint_namercrkrlrragent shutdown_rdzvresults r>rrs ==TZZ\%%&LfU *:trs (11 EE:5AX NEP>* = H  ONON ONd$H$HNhT)*26s),&)& 8C=(3- '(&&q) q)hT)*q) s)q) #s(^ q)r@