L i:\ddlmZddlZddlmZmZddlmZddlZddl m Z m Z ddlm Z ddl mZmZmZmZmZe j&gZej*eZed d Zed d Zed d Zedd Zedd Zedd Zedd Zedd Zedd Z ejBjEZ#dZ$d"d#dZ%Gdde jLZ'd$d%dZ(d&d'dZ) d(dZ* d)dZ+ddd*dZ,d+dZ-d,dZ.eed Z/d!Z0y)-) annotationsN lru_cachewraps)Callable) storage_ptr storage_size)nn)is_torch_greater_or_equalis_torch_xla_availableis_torch_xpu_availableis_torchdynamo_compilingloggingz2.8T) accept_devz2.6z2.4z2.3z2.2z2.1z2.0z1.13z1.12cLddlm}||||j|jS)z A function that calls the internal `_softmax_backward_data` PyTorch method and that adjusts the arguments according to the torch version detected. r)_softmax_backward_data)torchrdimdtype)parent grad_outputoutputrs `/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/transformers/pytorch_utils.pysoftmax_backward_datar5s - !+vvzz6<< PPc|j|jj}|jj||j j }|j Y|dk(r)|j j j }n+|j |j j }t|jj}t|||<tj|d|d|j duj|jj}d|j_ |jj|jd|j_ |j Kd|j _ |j jjd|j _ |S)a Prune a linear layer to keep only entries in index. Used to remove heads. Args: layer (`torch.nn.Linear`): The layer to prune. index (`torch.LongTensor`): The indices to keep in the layer. dim (`int`, *optional*, defaults to 0): The dimension on which to keep the indices. Returns: `torch.nn.Linear`: The pruned layer as a new layer with `requires_grad=True`. Nr r)biasFT)toweightdevice index_selectdetachclonerlistsizelenr Linear requires_gradcopy_ contiguouslayerindexrWbnew_size new_layers rprune_linear_layerr3@se HHU\\(( )E !!#u-446<<>A zz !8 !!#))+A 5!((*002AELL%%'(HJHSM (1+x{49OPSSTYT`T`TgTghI%*I" 1<<>*%)I" zz', $Q\\^,'+ $ rc0eZdZdZfdZddZdZxZS)Conv1Da 1D-convolutional layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2). Basically works like a linear layer but the weights are transposed. Args: nf (`int`): The number of output features. nx (`int`): The number of input features. cJt|||_||_t j t j|||_t j t j||_ tjj|jdy)Ng{Gz?)std) super__init__nfnxr Parameterremptyr zerosrinitnormal_)selfr:r; __class__s rr9zConv1D.__init__msg ll5;;r2#67 LLR1   .rc:djdi|jS)NzConv1D(nf={nf}, nx={nx}))format__dict__)rAs r__repr__zConv1D.__repr__us0)00A4==AArc |jdd|jfz}tj|j|j d|jd|j }|j |}|S)N)r&r:raddmmrviewr )rAxsize_outs rforwardzConv1D.forwardxs^668CR=DGG:- KK 166"affRj#94;; G FF8 r)returnstr)__name__ __module__ __qualname____doc__r9rGrN __classcell__)rBs@rr5r5bs/Brr5cj|j|jj}|jj||j j }|dk(r)|j j j }n+|j |j j }t|jj}t|||<t|d|dj|jj}d|j_ |jj|jd|j_ d|j _ |j j|jd|j _ |S)a Prune a Conv1D layer to keep only entries in index. A Conv1D work as a Linear layer (see e.g. BERT) but the weights are transposed. Used to remove heads. Args: layer ([`~pytorch_utils.Conv1D`]): The layer to prune. index (`torch.LongTensor`): The indices to keep in the layer. dim (`int`, *optional*, defaults to 1): The dimension on which to keep the indices. Returns: [`~pytorch_utils.Conv1D`]: The pruned layer as a new layer with `requires_grad=True`. rr FT)rr r!r"r#r$rr%r&r'r5r)r*r+r,s rprune_conv1d_layerrWs? HHU\\(( )E !!#u-446<<>A ax JJ    % % ' JJu  $ $ & , , .ELL%%'(HJHSMx{HQK033ELL4G4GHI%*I" 1<<>*%)I"#(INN  NN(#'INN rct|tjrt|||dS|St|trt |||dS|St d|j)a Prune a Conv1D or linear layer to keep only entries in index. Used to remove heads. Args: layer (`Union[torch.nn.Linear, Conv1D]`): The layer to prune. index (`torch.LongTensor`): The indices to keep in the layer. dim (`int`, *optional*): The dimension on which to keep the indices. Returns: `torch.nn.Linear` or [`~pytorch_utils.Conv1D`]: The pruned layer as a new layer with `requires_grad=True`. rrr zCan't prune layer of class ) isinstancer r(r3r5rW ValueErrorrB)r-r.rs r prune_layerr\sj%#!%ANN#NN E6 "!%ANN#NN6u6GHIIrc t|dkDs J|dttjj}|t|k7rt d|dt|d|dkDr|dj }|D]2}|j |k7st d|d|j |dj |zdk7r!t d|dj d ||dj |z t  fd |D}t fd t|D}tj| S|S) aZ This function chunks the `input_tensors` into smaller input tensor parts of size `chunk_size` over the dimension `chunk_dim`. It then applies a layer `forward_fn` to each chunk independently to save memory. If the `forward_fn` is independent across the `chunk_dim` this function will yield the same result as directly applying `forward_fn` to `input_tensors`. Args: forward_fn (`Callable[..., torch.Tensor]`): The forward function of the model. chunk_size (`int`): The chunk size of a chunked tensor: `num_chunks = len(input_tensors[0]) / chunk_size`. chunk_dim (`int`): The dimension over which the `input_tensors` should be chunked. input_tensors (`tuple[torch.Tensor]`): The input tensors of `forward_fn` which will be chunked Returns: `torch.Tensor`: A tensor with the same shape as the `forward_fn` would have given if applied`. Examples: ```python # rename the usual forward() fn to forward_chunk() def forward_chunk(self, hidden_states): hidden_states = self.decoder(hidden_states) return hidden_states # implement a chunked forward function def forward(self, hidden_states): return apply_chunking_to_forward(self.forward_chunk, self.chunk_size_lm_head, self.seq_len_dim, hidden_states) ```rz" has to be a tuple/list of tensorszforward_chunk_fn expects z arguments, but only z input tensors are givenz/All input tenors have to be of the same shape: z, found shape zThe dimension to be chunked z( has to be a multiple of the chunk size c3DK|]}|jyw)rYN)chunk).0 input_tensor chunk_dim num_chunkss r z,apply_chunking_to_forward..s"$uWc\%7%7 %7%R$us c3(K|] }| ywNrD)r`input_tensors_chunk forward_fns rrdz,apply_chunking_to_forward..suCVj*=>usrY) r'inspect signature parametersr[shapetupleziprcat) rh chunk_sizerb input_tensorsnum_args_in_forward_chunk_fn tensor_shaperainput_tensors_chunks output_chunksrcs ` ` @rapply_chunking_to_forwardrvsR }  !Wm_4V#WW !$'w'8'8'D'O'O#P #s='99'(D'EEZ[^_l[mZno    A~$Q'--i8 ) L!!), < El^T##/#5#5i#@"AC    ! !) ,z 9Q >.}Q/?/E/Ei/P.QR"|%  #1%++I6*D  %$ugt$uuuZ]_sZtuu yyI66 } %%rcTtj||}t||z }|D]tfd|Dz d|< |j dj j d}tjt||j}||fS)a3 Finds the heads and their indices taking `already_pruned_heads` into account. Args: heads (`list[int]`): List of the indices of heads to prune. n_heads (`int`): The number of heads in the model. head_size (`int`): The size of each head. already_pruned_heads (`Set[int]`): A set of already pruned heads. Returns: `tuple[Set[int], torch.LongTensor]`: A tuple with the indices of heads to prune taking `already_pruned_heads` into account and the indices of rows/columns to keep in the layer weight. c30K|] }|krdndyw)r rNrD)r`hheads rrdz3find_pruneable_heads_and_indices..sM1q4x!Q.MsrrIr ) ronessetsumrKr+eqaranger'long)headsn_heads head_sizealready_pruned_headsmaskr.rzs @r find_pruneable_heads_and_indicesrs ::gy )D J- -EcM8LMMMT  99R= # # % ( ( +D#ll3t95d;@@BE %<r)indexingc,tj|d|iS)z Wrapper around torch.meshgrid to avoid warning messages about the introduced `indexing` argument. Reference: https://pytorch.org/docs/1.13/generated/torch.meshgrid.html r)rmeshgrid)rtensorss rrrs >>7 6X 66rctrbtdrWddlm}t ||rE|j }|j |jj|jfS|j jdk(r*tr ddl }|jj|}n t|}|j |t!|fS)a Unique identifier to a tensor storage. Multiple different tensors can share the same underlying storage. For example, "meta" tensors all share the same storage, and thus their identifier will all be equal. This identifier is guaranteed to be unique and constant for this tensor's storage during its lifetime. Two tensor storages with non-overlapping lifetimes may have the same id. z2.5r)DTensorxlaN)_torch_distributed_availabler torch.distributed.tensorrrZto_localr!storagedata_ptrnbytestyper torch_xla_XLAC_xla_get_tensor_idrr )tensorr local_tensorr unique_ids rid_tensor_storager(s$(A%(H4 fg &!??,L==,"6"6"8"A"A"CV]]R R }}U"'='? OO66v> ' ==)\&%9 99rc|jjdk(rtstj|}|j dk(r|j d}|j|jddj|j djdjjStj||S)a Same as `torch.isin` without flags, but MPS-friendly. We can remove this function when we stop supporting torch <= 2.3. See https://github.com/pytorch/pytorch/issues/77764#issuecomment-2067838075 Args: elements (`torch.Tensor`): Input elements test_elements (`torch.Tensor` or `int`): The elements to check against. Returns: `torch.Tensor`: A boolean tensor of the same shape as `elements` that is True for `elements` in `test_elements` and False otherwise mpsrr rY)r!r"is_torch_greater_or_equal_than_2_4rrndim unsqueezetilerlr~r}boolsqueezeisin)elements test_elementss risin_mps_friendlyrDsu$-O ]3    ")33A6M}}]003Q7::=;R;RST;UVZZ_`Zaffhpprrzz(M22rcfd}|S)z LRU cache decorator from standard functools library, but with a workaround to disable caching when torchdynamo is compiling. Expected to work with class methods. cVtitfd}|S)Nc8tr|i|S|i|Srf)r)argskwargsfuncfunc_with_caches rwrapperzGcompile_compatible_method_lru_cache..decorator..wrapperfs)')T,V,,&777rr)rrrlru_args lru_kwargss` @r decoratorz6compile_compatible_method_lru_cache..decoratorcs6<)X<rs#& 7 ~   H %%>uQU%V"%>uQU%V"%>uQU%V"%>uQU%V"&?uQU%V"%>uQU%V"%>uQU%V"&?SW&X#&?SW&X# %00==?QDRYY:BJ,K&+K&K&K&  K&\ "/2JR&6RV7:830y* r