L iTnddlZddlmZddlmZmZmZddlZddlm Z m Z ddl m Z m Z mZmZmZmZmZmZmZmZgdZeddZGd d ej0Z dd ej0d ed efdZGddeZGddeZy)N) namedtuple)AnyCallableOptional))sparse_semi_structured_from_dense_cutlass'sparse_semi_structured_to_dense_cutlass) fallback_dispatchersemi_sparse_addmmsemi_sparse_detachsemi_sparse_indicessemi_sparse_linearsemi_sparse_mmsemi_sparse_scaled_mm semi_sparse_tsemi_sparse_valuessemi_sparse_view)SparseSemiStructuredTensor!SparseSemiStructuredTensorCUTLASS$SparseSemiStructuredTensorCUSPARSELTto_sparse_semi_structured_SEMI_STRUCTURED_SPARSE_CONFIGz=sparse_min_rows sparse_min_cols dense_min_rows dense_min_colsc$eZdZUdZdZeed<eeje fed<dZ e ed<dZ e ed<dZe ed<eed <eeefed <eej$ed <eej$ed <eej$ed <eej$ed<eej$ed<e ed<eed<gdZe d(dej*d eej$d eej$d eej$deej$deej$de dede fdZdefdZdeeeeej*e ee fffdZedeej*e ee fdej$fdZej:j<Zede fdZ!ed)d*dZ"edej$ddfdZ#ed ej$dej$fd!Z$d"Z%edej$ddfd#Z&dd$d%ej$d&eej$dej$fd'Z'y)+ra This class implements semi-structured sparsity as a Tensor subclass. Semi-structured sparsity describes a sparsity pattern where n in every 2n elements are sparse, depending on the datatype. It is also referred to as 2:4 sparsity or fine-grained structured sparsity. There are two backends available for semi_structred sparsity, either cuSPARSELt or CUTLASS. This class is meant to serve as a base class for both implementations. SparseSemiStructuredCUTLASS and SparseSemiStructuredCUSPARSELT both inherit from this class and define three backend-specific items. Note that as such, this class cannot be instantiated directly. -`_DTYPE_SHAPE_CONSTRAINTS` - A dictionary holding backend specific dense/sparse min shape constraints - `def from_dense()` - backend specific compression routines - `def _mm()` - backend specific mm op (either torch._cslt_sparse_mm or torch._sparse_semi_structured_(mm|addmm)) r_DEFAULT_ALG_ID_DTYPE_SHAPE_CONSTRAINTSF_FORCE_CUTLASS_FUSE_TRANSPOSE_PROTOTYPE_WARNING_SHOWNBACKENDSPARSE_DISPATCHpackedmetapacked_tmeta_tcompressed_swizzled_bitmaskfuse_transpose_cusparseltalg_id_cusparselt)r r!r"r#r$shape requires_gradc |jsPtjdtd|_|j t j j|||} n||} n tdt jj||| j| j| j| } || _|| _|| _|| _|| _|| _|| _| S)a0 Create a new instance of the tensor subclass from the compressed sparse representation. We have the option to create the subclass with the compressed representations of both X and X', for training. For inference, we only need a single representation (either X or X'), while the corresponding other set will be None. Depending on the backend selected, certain fields will be set to None. (CUSPARSELT vs CUTLASS) Args: shape: The shape of the original dense tensor packed: The compressed representation of the original dense tensor meta: The metadata of the original dense tensor, if it is stored separately packed_t: The compressed representation of the transposed original dense tensor meta_t: The metadata of the transposed original dense tensor, if it is stored separately compressed_swizzled_bitmask: The masks used by the CUTLASS backend to determine which threads should participate in the computation. Used for pointwise ops. fuse_transpose_cusparselt: When running with cuSPARSELt, we have the option to fuse a transposition with a matmul, which is useful in the case of 2:4 sparse training. alg_id_cusparselt: The algorithm id to use when using cuSPARSELT, will have effect on performance Returns: torch.Tensor: A torch.Tensor wrapper subclass. Raises: ValueError: If all of the tensor arguments are None. zThe PyTorch API of SparseSemiStructuredTensor is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.sparse module for further information about the project.Tz3At least one of packed or packed_t must be provided)devicedtypelayoutr()rwarningswarn UserWarning_load_dispatch_tabletorch_dynamoallow_in_graph ValueErrorTensor_make_wrapper_subclassr*r+r,r r!r"r#r$r%r&) clsr'r r!r"r#r$r%r&r(previous_tensortensors b/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/torch/sparse/semi_structured.py__new__z"SparseSemiStructuredTensor.__new__KsN++ MMH  ,0C (  $ $ & MM ( ( -  $O  !&ORS S44  "))!''"))' 5   " -H*+D(#4  returncjt|dsJ|jjd|jdS)Nr'z(shape=))hasattr __class____name__r')selfs r:__repr__z#SparseSemiStructuredTensor.__repr__s4tW%%%..))*'$**Q??r<cttfdj}jjj j f}||fS)Nc t|duSN)getattr)xrCs r:z?SparseSemiStructuredTensor.__tensor_flatten__..sWT1-T9r<)listfilter __slots__r'r%r&r()rC inner_tensors tensor_metas` r:__tensor_flatten__z-SparseSemiStructuredTensor.__tensor_flatten__sV 94>> J  JJ  * *  " "     k))r<rOc |\}}}}|||jdd|jdd|jdd|jdd|jdd||| S)Nr r!r"r#r$ r'r r!r"r#r$r%r&r()get) r7rNrO outer_size outer_strider'r%r&r(s r:__tensor_unflatten__z/SparseSemiStructuredTensor.__tensor_unflatten__sNYJ(*;] $$Xt4""640"&&z48 $$Xt4(5(9(9-t)'@/'  r<c|j|jvr%t|jd|jd|j|j||||S)NzI only supports a specific set of operations, can't perform requested op (r?)_overloadpacketrNotImplementedErrorrB)r7functypesargskwargss r:__torch_dispatch__z-SparseSemiStructuredTensor.__torch_dispatch__sh   s':': :%<<.!//3}}oQ@ 9s""4#7#78udFSSr<Nc|t|dd.tjjjt tjjj ttjjjttjjjttjjjttjjjttjjjt tjjj"t$tjjj&t$tjjj(t*tjjj,t.tjjj0ttjjj2t4i |_||j6j9|yyy)zT Loads the op overload sparse dispatch table for the current class. rN)rHr1opsatenvaluesrindicesr is_same_sizer detach_detachr trviewrmmrmatmuladdmmr linearr _to_copy _scaled_mmrrupdate)r7custom_dispatch_tables r:r0z/SparseSemiStructuredTensor._load_dispatch_tablesF 3)4 0 8 %%'9 &&(; ++-@ &&(; %%'9   - ##%5 !!> %%~ $$&7 %%'9 '')< ))+@#C %0##**+@A1! 9r<original_tensorc \|jstd|jd|jdk7rtd|jd|j s td|j |j vrtd|j d|d |j\}}|j |j j}|j |j j}||ks||zs ||ks||zrtd |jd |d |d y)z_ Assert that the given tensor is valid for semi-structured sparse compression. zError original_tensor.device= z= is not supported! Only CUDA tensors are currently supported.zError original_tensor.dim = z; is not supported! Only 2d tensors are currently supported.zXError original_tensor is not contiguous!Only contiguous tensors are currently supported.zError original_tensor.dtype z is not a supported dtype for !zError original_tensor.shape zS is not supported! Both dimensions must be larger or equal than and a multiple of (z, r?N) is_cuda RuntimeErrorr*dim is_contiguousr+rr'sparse_min_rowssparse_min_cols)r7rqmnmin_rowsmin_colss r: _validate_device_dim_dtype_shapez;SparseSemiStructuredTensor._validate_device_dim_dtype_shapesr &&01G1G0HI==     A %./B/B/D.EF;;  ,,.C   (D(D D./D/D.EEcdgchhij  $$1//0E0EFVV//0E0EFVV x<1x<1x<1x<./D/D.EFSS[R\\^_g^hhik >> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA) >>> A = torch.Tensor([0, 0, 1, 1]).tile((128, 32)).half().cuda() tensor([[0., 0., 1., ..., 0., 1., 1.], [0., 0., 1., ..., 0., 1., 1.], [0., 0., 1., ..., 0., 1., 1.], ..., [0., 0., 1., ..., 0., 1., 1.], [0., 0., 1., ..., 0., 1., 1.], [0., 0., 1., ..., 0., 1., 1.]], device='cuda:0', dtype=torch.float16) >>> A_sparse = to_sparse_semi_structured(A) SparseSemiStructuredTensor(shape=torch.Size([128, 128])) >>> A_sparse.values() tensor([[1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], ..., [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.]], device='cuda:0', dtype=torch.float16), >>> A_sparse.indices() tensor([[-4370, -4370, -4370, ..., -4370, -4370, -4370], [-4370, -4370, -4370, ..., -4370, -4370, -4370], [-4370, -4370, -4370, ..., -4370, -4370, -4370], ..., [-4370, -4370, -4370, ..., -4370, -4370, -4370], [-4370, -4370, -4370, ..., -4370, -4370, -4370], [-4370, -4370, -4370, ..., -4370, -4370, -4370]], device='cuda:0', dtype=torch.int16)) zSetting transpose from `to_sparse_semi_structured` is deprecated and will be removed in a future release. `SparseSemiStructuredTensor` only support contiguous input tensors.rs) stacklevel) r-r. FutureWarningrrr1sparserrr)rqrSPARSE_SUBCLASSs r:rr>sab  R    & 4 4  66 \\ > >  % %o 66r<c eZdZdZdZej eddddejeddddejeddddejeddddiZ e d ejd dfd Zfd Ze dd ejd d fdZdddejdeejd ejfdZxZS)ra This class implements semi-structured sparsity for the CUTLASS backend. In this implementation, the specified elements and metadata are stored separately, in packed and meta respectively. When _FORCE_CUTLASS is set, or when cuSPARSELt is not available, this subclass calls into _sparse_semi_structured_(mm|addmm) and sparse_semi_structured_from_dense for conversion to the compressed format. cutlass @rqr=c |j|t|\}}||j||ddd|jS)Nr r!r"r#r$r()rrr'r()r7rqsparse_tensor_cutlassmeta_tensor_cutlasss r:rz,SparseSemiStructuredTensorCUTLASS.from_densesU ,,_= 6o F !   ! !($(,)77  r<c|j |jJ|jjdk(r t|j|jSt|S)Nrs)r!r ndimrsuperr)rCrAs r:rz*SparseSemiStructuredTensorCUTLASS.to_denses]yy$)@@@ yy~~" 4    !#  r<rc ptj||d\}}}}}||j|||||dS)a~ This function takes in a unpruned dense tensor and runs a (branchless) static sort across a 4x4 tile. It greedily picks the largest values in the tile, upholding the 2:4 sparsity constraint across both rows and columns. The algorithm used to prune the matrix is implemented in `_sparse_semi_structured_tile`. Then it creates the packed and meta tensors for the compressed sparse representation of the pruned dense tensor. It also calculates the packed_t and meta_t tensors for the compressed sparse representation of the transposed pruned dense tensor. Since we cannot transpose the compressed representations, we store both for the fw/bw pass respectively. Finally, this function also computes a compressed swizzled bitmask that encodes the sparsity pattern This can be used in the backward pass to mask the gradients. [9 1 7 4] [9 0 7 0] [1 2 3 0] [0 2 0 0] [8 3 5 4] -> prune 4x4 tile -> [8 0 0 4] -> pack to CUTLASS semi-structured -> packed [1 2 6 2] [0 0 6 2] -> metadata -> pack to transposed CUTLASS -> packed_t semi-structured representation -> metadata_t -> compute swizzled bitmask -> compressed_swizzled_bitmask The equivalent PyTorch code to create the same five outputs from the dense tensor can be found below: ``` from torch.sparse import SparseSemiStructuredTensorCUTLASS from torch.sparse._semi_structured_conversions import ( _sparse_semi_structured_tile, _compute_compressed_swizzled_bitmask, ) pruned = _sparse_semi_structured_tile(dense) packed_cutlass, meta_cutlass = sparse_semi_structured_from_dense_cutlass(pruned) packed_t_cutlass, meta_t_cutlass = sparse_semi_structured_from_dense_cutlass( pruned.t().contiguous() ) bitmask = _compute_compressed_swizzled_bitmask(pruned) SparseSemiStructuredTensorCUTLASS( dense.shape, packed_cutlass, meta_cutlass, packed_t_cutlass, meta_t_cutlass, bitmask, ) ``` T algorithm use_cutlassFrr1_sparse_semi_structured_tiler'r7rqrr r!r"r#r$s r:prune_dense_static_sortz9SparseSemiStructuredTensorCUTLASS.prune_dense_static_sortsXz  . . yd      '   ! !(C  r<Nrrrc t|tr td|jj}|j dk7s|j dk7rt d|d|j |jt d|d|,tj|j|j|}n,tj||j|j|}|d|jdS)NZ`SparseSemiStructuredTensor @ SparseSemiStructuredTensor` is not supported by the hardwarers`)` matmul: Broadcasting is not implemented$` matmul: operation is not supportedr) isinstancerr4rArBrrYr r!r1_sparse_semi_structured_mm_sparse_semi_structured_addmmr')rCrrr]cls_nameress r:rz%SparseSemiStructuredTensorCUTLASS._mms a3 4l >>** 99>QVVq[%H:FG  ;; $))"3%H:AB |66t{{DIIqQ99$++tyy!A' 'r<)rBrrrrr1int8rfloat16bfloat16float32rrr5rrrrr __classcell__)rAs@r:rrs G 22sBC 5b"aC 6r2q!D 5b"aC   #ll ,  $  68H #llH %H H VBF(((0(>( (r<rc |eZdZdZdZej eddddejeddddejeddddejeddddiZ e dejddfdZe ddejdd fd Zd d d ejdeejdejfdZy )ra The cuSPARSELt backend expects the specified elements and the metadata to be stored in a single tensor: packed = [ specified elements of original tensor | metadata ] For an original tensor of size (m, k) we expect the first m * k // 2 elements to be the kept elements The rest of the tensor is metadata. Since there is only one tensor, we only use the packed and packed_t attributes respectively. cuSPARSELt also supports transposition fusion, which is necessary for performant 2:4 sparse training, as well as specifying alg_id, a config that affects the performance of the matmul depending on matmul sizes. cusparseltrrrrqr=c |j|||jtj|ddddtj tj |j S)NrR)rr'r1_cslt_compressrrrr(rs r:rz/SparseSemiStructuredTensorCUSPARSELT.from_dense-s] ,,_=!''''8(,&@&P&P8HH)77  r<rc ptj||d\}}}}}||j|||||dS)a= This function does the same thing as described in SparseSemiStructuredCUTLASS, but uses the cuSPASRELt metadata layout and sparse matmul. The only functional difference is that cuSPARSELt stores `metadata` and `packed` together into a single tensor. [9 1 7 4] [9 0 7 0] [1 2 3 0] [0 2 0 0] [8 3 5 4] -> prune 4x4 tile -> [8 0 0 4] -> pack to cuSPARSELT semi-structured -> packed [1 2 6 2] [0 0 6 2] -> pack to transposed cuSPARSELt -> packed_t semi-structured representation -> compute swizzled bitmask -> compressed_swizzled_bitmask The equivalent PyTorch code to create the same three outputs from the dense tensor can be found below: ``` from torch.sparse import SparseSemiStructuredTensorCUSPARSELT from torch.sparse._semi_structured_conversions import ( _sparse_semi_structured_tile, _compute_compressed_swizzled_bitmask, ) pruned = _sparse_semi_structured_tile(dense) packed_cusparselt = torch._cslt_compress(pruned) packed_t_cusparselt = torch._cslt_compress(pruned.t().contiguous()) bitmask = _compute_compressed_swizzled_bitmask(pruned) SparseSemiStructuredTensorCUSPARSELT( dense.shape, packed_cutlass, None, packed_t_cutlass, None, bitmask ) ``` Frrrrs r:rzsXZ  . . ye      '   ! !(C  r<Nrrrc t|tr td|jdk7s|jdk7r#t d|j j d|j|jk7rit d|j j dt|jdt|jd|jd|jd ||j|jk7rit d|j j dt|jdt|jd |jd |jd |jtjk(r\t d|j j dt|jdt|jd |jd |j#t d|j j dtj|j|||j|j}|jr|j!S|S)Nrrsrrz` matmul: trying to do `A=z @ B=z`, with A.dtype=z and B.dtype=zH. This operation is only supported when A and B have the same data type.z + C`, with A.dtype=B.dtype=z and C.dtype=zK. This operation is only supported when A, B and C have the same data type.z`, with A.dtype=B.dtype=zO. mm is not supported for float8_e4m3fn, please use `torch._scaled_mm` instead.r)rtranspose_resultalg_id)rrr4rrYrArBr+rr'r1 float8_e4m3fnr _cslt_sparse_mmr%r&rg)rCrrr]rs r:rz(SparseSemiStructuredTensorCUSPARSELT._mmysK a3 4l  99>QVVq[%DNN++,,UV  77djj %DNN++,,FuTZZGXFYY^_defelel_m^no $ |= BYY    djj 8%DNN++,,FuTZZGXFYY^_defelel_m^no((, |= J\\  ::,, ,%DNN++,,FuTZZGXFYY^_defelel_m^no((, |4``  ;; %DNN++,,PQ '' !%!?!?-- C#<<3557 E# Er<r)rBrrrrr1rrrrrrrr5rrrrrr<r:rrs G ;BBK 22r2rB 5b"aC 6r2q!D   #ll /  688 #ll8 %8 8 vBF*F*F(0(>*F *Fr<r)F)r- collectionsrtypingrrrr1)torch.sparse._semi_structured_conversionsrr!torch.sparse._semi_structured_opsr r r r r rrrrr__all__rr5rrrrrrr<r:rs"**     ",$C" U"U"tA7\\A7A7 A7HT((BT(nJF+EJFr<