L iUSVdZddlZddlmZddlmZmZmZddlZddlm Z ddl m Z ddl m Z dd lmZmZdd lmZdd lmZmZmZd d lmZej2eZeGddeZGdde j:Z d$de j:dej>dej>dej>deej>de de fdZ!Gdde j:Z"Gdde j:Z#Gdde Z$Gd d!e j:Z%Gd"d#e j:Z&y)%zTPyTorch IdeficsVision model: a copy of CLIPVisionModel using a simpler config objectN) dataclass)CallableOptionalUnion)nn)ACT2FN)GradientCheckpointingLayer)BaseModelOutputBaseModelOutputWithPooling)ALL_ATTENTION_FUNCTIONS) ModelOutputcan_return_tuplelogging)IdeficsVisionConfigceZdZUdZdZeejed<dZ eejed<dZ ee ejdfed<dZ ee ejdfed<y)IdeficsVisionModelOutputa Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states. Args: image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): The image embeddings obtained by applying the projection layer to the pooler_output. last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): Sequence of hidden-states at the output of the last layer of the model. hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. N image_embedslast_hidden_state. hidden_states attentions) __name__ __module__ __qualname____doc__rrtorch FloatTensor__annotations__rrtuplerh/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/transformers/models/idefics/vision.pyrr'sr*15L(5,,-459x 1 129=AM8E%"3"3S"89:A:>Ju00#567>r"rceZdZdeffd Zdej dededej fdZd dejd e dej fd Z xZ S) IdeficsVisionEmbeddingsconfigct|||_|j|_|j |_|j |_tjtj|j|_ tj|j|j|j |j d|_|j |j zdz|_|jdz|_tj"|j |j|_|j'dtj(|j j+ddy)NF) in_channels out_channels kernel_sizestridebiasr position_ids)r) persistent)super__init__r& hidden_size embed_dim image_size patch_sizer Parameterrrandnclass_embeddingConv2d num_channelspatch_embedding num_patches num_positions Embeddingposition_embeddingregister_bufferarangeexpandselfr& __class__s r#r2z IdeficsVisionEmbeddings.__init__Fs   ++ ++ ++!||EKK,GH!yy++??  !OOt>1D!--1"$,,t/A/A4>>"R ^U\\$:L:L-M-T-TU\-]jopr" embeddingsheightwidthreturnc|jddz }|j|j}|jddz }||k(r||k(r|S|dddf}|ddddf}|jd} ||jjz} ||jjz} | dz| dz} } t j |} |jdt| t| | }|jdddd}|jtjk(} | r4tjd|jtj }t"j$j'|| | z | | z fd d }| r|jtj}t| |jd k7st| |jdk7rBt)d t| t| fd|jd |jdfd|jddddj+dd| }tj,|j/d|fdS)a# This method allows to interpolate the pre-trained position encodings, to be able to use the model on higher resolution images. Source: https://github.com/facebookresearch/dino/blob/de9ee3df6cf39fac952ab558447af1fa1365362a/vision_transformer.py#L174 rNrr/g?rr-zUpcasting patch_pos_embed to fp32 for interpolation since `upsample_bicubic2d_out_frame` in nn.functional.interpolate is not implemented for 'torch.bfloat16' dtype. This will result in a slight overhead.bicubicF) scale_factormode align_cornerszNumber of patches for images (z/) don't match the shape of position embedding ()dim)shaper@r.r&r6mathsqrtreshapeintpermutedtyperbfloat16logger warning_oncetofloatr functional interpolate ValueErrorviewcat unsqueeze)rErGrHrIr= pos_embedr>class_pos_embedpatch_pos_embedr4 num_h_patches num_w_patchessqrt_num_positionsfp32_upcastings r#interpolate_pos_encodingz0IdeficsVisionEmbeddings.interpolate_pos_encoding]sf!&&q)A- ++D,=,=> !*Q. - 'FeO #AqD/#AqrE*$$R( $++"8"88 !7!77 (5s':MC`a 4  -00@O } !6!6r!: :c->PTcTiTijlTm>m0]1CSEW1W0XY00?0E0Eb0I?K`K`acKd0d/eefh *11!Q1=BB1b)Tyy/33A6HaPPr" pixel_valuesrmc `|j\}}}}|sJ||jk7s||jk7r,td|d|d|jd|jd |jjj }|j|j |}|jdjdd}|jj|dd} tj| |gd } |r| |j| ||z} | S| |j|jz} | S) NzInput image size (*z) doesn't match model (z8). You should try to set `interpolate_pos_encoding=True`)rZr-rr/rR)rTr5rbr<weightrZr^flatten transposer9rCrrdrmr@r.) rErnrm batch_sizer;rHrI target_dtype patch_embeds class_embedsrGs r#forwardzIdeficsVisionEmbeddings.forwards82>2D2D/ L&%'(ET__,D (%9)4??*;;su ++2288 ++LOO,O,OP #++A.88A> ++22:q"E YY l;C  $#d&C&CJPVX]&^^J$d&=&=d>O>O&PPJr"F) rrrrr2rTensorrXrmrboolrx __classcell__rFs@r#r%r%Esmq2q./Q5<</Q/QUX/Q]b]i]i/QbE$5$5QUbgbnbnr"r%modulequerykeyvalueattention_maskscalingdropoutc tj||jdd|z}|||z}tjj |dtj j|j}tjj|||j}tj||} | jddj} | |fS)Nr/rP)rSrZ)ptrainingrr-) rmatmulrsrr`softmaxfloat32r^rZrr contiguous) r~rrrrrrkwargs attn_weights attn_outputs r#eager_attention_forwardrs<<s}}R'<=GL!#n4 ==((2U]](SVVW\WbWbcL==((6??([L,,|U3K''1-88:K  $$r"ceZdZdZdeffd Z d dejdeejdeejdee de ejeejff d Z xZ S) IdeficsVisionAttentionz=Multi-headed attention from 'Attention Is All You Need' paperr&ct|||_|j|_|j |_|j|j z|_|j|j z|jk7r&td|jd|j d|jdz|_ |j|_ d|_ tj|j|j|_tj|j|j|_tj|j|j|_tj|j|j|_y)Nz;embed_dim must be divisible by num_heads (got `embed_dim`: z and `num_heads`: z).gF)r1r2r&r3r4num_attention_heads num_headshead_dimrbscaleattention_dropoutr is_causalrLineark_projv_projq_projout_projrDs r#r2zIdeficsVisionAttention.__init__s  ++33$..8 ==4>> )T^^ ;MdnnM]^NN#2' ]]D( // ii? ii? ii?  $..$..A r"rrcausal_attention_maskoutput_attentionsrJc |j\}}}|j|}|j|} |j|} |j |||j |j jdd}| j |||j |j jdd} | j |||j |j jdd} |jjdk7r||||z}n| |}n |du|_ t} |jjdk7rt|jj} | ||| | ||j|j|jsdn |j\} } | j!|||j#} |j%| } |sd} | | fS)z#Input shape: Batch x Time x Channelrr-flash_attention_2Neager)rrr)rTrrrrcrrrsr&_attn_implementationrrr rrrrWrr)rErrrrrt seq_lengthr4querieskeysvaluesattention_interfacerrs r#rxzIdeficsVisionAttention.forwards-:,?,?) J ++m,{{=)]+,,z:t~~t}}U__`acdeyyZOYYZ[]^_ZT^^T]]S]]^_abc ;; + +/B B).C.O!/2G!G&2!62$>DN(? ;; + +w 6"9$++:Z:Z"[ $7     nnJJ#}}C$,, % ! \"))*j)LWWY mmK0  LL((r")NNF) rrrrrr2rrzrr{r rxr|r}s@r#rrsGB2B.268<,1 /)||/)!./) ( 5 /) $D> /) u||Xell33 4 /)r"rcVeZdZfdZdej dej fdZxZS)IdeficsVisionMLPct|||_t|j|_t j|j|j|_ t j|j|j|_ yN) r1r2r&r hidden_act activation_fnrrr3intermediate_sizefc1fc2rDs r#r2zIdeficsVisionMLP.__init__ sd  #F$5$5699V//1I1IJ99V55v7I7IJr"rrJcl|j|}|j|}|j|}|Sr)rrr)rErs r#rxzIdeficsVisionMLP.forwards4/ **=9 / r")rrrr2rrzrxr|r}s@r#rr s$KU\\ellr"rc eZdZdeffd Z d dej dej dej deede ejf dZ xZ S) IdeficsVisionEncoderLayerr&cDt||j|_t ||_t j|j|j|_ t||_ t j|j|j|_ yN)eps) r1r2r3r4r self_attnr LayerNormlayer_norm_eps layer_norm1rmlp layer_norm2rDs r#r2z"IdeficsVisionEncoderLayer.__init__sm ++/7<<F & u  ! &r"rceZdZdZdeffd Ze d deejdeejdee dee dee d e e e ff d ZxZS) IdeficsVisionEncoderz Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a [`IdeficsVisionEncoderLayer`]. Args: config: IdeficsVisionConfig r&ct|||_tjt |j Dcgc] }t|c}|_d|_ ycc}w)NF) r1r2r&r ModuleListrangenum_hidden_layersrlayersgradient_checkpointing)rEr&_rFs r#r2zIdeficsVisionEncoder.__init__VsQ  mmPUV\VnVnPo$p1%>v%F$pq &+#%qsA#rrroutput_hidden_states return_dictrJcj||n|jj}||n|jj}||n|jj}|rdnd}|rdnd}|} t |j D]*\} } |r|| fz}| | |||} | d} |s"|| dfz},|r|| fz}t | ||S)a Args: inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix. attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. [What are attention masks?](../glossary#attention-mask) causal_attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): Causal mask for the text model. Mask values selected in `[0, 1]`: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. [What are attention masks?](../glossary#attention-mask) output_attentions (`bool`, *optional*): Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. output_hidden_states (`bool`, *optional*): Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for more detail. return_dict (`bool`, *optional*): Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. Nr!)rrr)rrr)r&rruse_return_dict enumeraterr ) rE inputs_embedsrrrrrencoder_statesall_attentionsridx encoder_layer layer_outputss r#rxzIdeficsVisionEncoder.forward\sN2C1N-TXT_T_TqTq$8$D $++JjJj &1%>N+>Vd  r")NNNNN)rrrrrr2rrrrzr{rr r rxr|r}s@r#rrMs,2, 268<,0/3&*D !.D  ( 5 D $D> D 'tn D d^D  uo% &D D r"rceZdZdeffd Z d deejdeedeedeedeede e e ff d Z xZ S) IdeficsVisionTransformerr&c t|||_|j}t ||_t j||j|_ t||_ t j||j|_ yr) r1r2r&r3r%rGrrr pre_layrnormrencoderpost_layernorm)rEr&r4rFs r#r2z!IdeficsVisionTransformer.__init__sj  && 1&9LL8M8MN+F3  ll9&:O:OPr"rnrrrmrrJc||n|jj}||n|jj}||n|jj}| t d|j ||}|j |}|j||||}|d}|dddddf} |j| } |s || f|ddzSt|| |j|jS)z Returns: Nz You have to specify pixel_values)rm)rrrrrr)r pooler_outputrr) r&rrrrbrGrrrr rr) rErnrrrmrrencoder_outputsr pooled_outputs r#rxz IdeficsVisionTransformer.forwards2C1N-TXT_T_TqTq$8$D $++JjJj &1%+ 'tn + #+4. + d^ +  u00 1+ r"r)r)'rrU dataclassesrtypingrrrrr activationsr modeling_layersr modeling_outputsr r modeling_utilsr utilsrrrconfiguration_ideficsr get_loggerrr\rModuler%rzr_rrrrrrr!r"r#rs3[ !,, !9K5 7   H % ?{? ?:`bii`V% II% <<% % << % U\\* %  %%.F)RYYF)T ryy  / :/fT 299T p7 ryy7 r"