L i$dZddlmZddlZddlmZddlmZGddejZ Gdd ejZ Gd d ejZ y) a Generic interface to various configurations of the Perceiver Resampler, that simply takes in a series of (potentially time-indexed) contextual embeddings, and "resamples" (compresses) them down to a pre-specified number of latents! Note that the Perceiver in general resamples based solely off the *long-range* context; there's a nice opportunity here to prime the Perceiver Resampler with say a single layer's worth of language embeddings (the target domain), and use that to softly "retrieve & compress" what we need --> this would be a novel contribution we should explore. References: - DeepMind's Flamingo: https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model - Code borrowed w/ love from: https://github.com/lucidrains/flamingo-pytorch )OptionalN) IdeficsConfigcteZdZdededededededdffd Zd ejdejfd ZxZ S) IdeficsPerceiverResamplerconfig embed_dimdepthn_headshead_dim n_latentsreturnNc t|||||f\|_|_|_|_|j j|_tjtj|j |jd|_ t|jds|jdzn|jjdz|_tj"t%|Dcgc]a}tj"t'|j|j|j|jt)|j |gcc}|_tj,|j|_ycc}w)ao Instantiates a Perceiver Resampler that operates over a sequence of embeddings (say from a ResNet or ViT or MAE) of a given dimension, performs `depth` blocks of cross-attention with a fixed `n_latents` inputs, then returns a Tensor of shape [bsz, n_latents, embed_dim]. :param embed_dim: Dimensionality of embeddings being fed to the Perceiver Resampler (also dimensionality of latent embeddings *returned* by the Perceiver Resampler. Could be e.g., VIT embed_dim, ResNet pool dim, and so on. Args: config (`IdeficsConfig`): config object embed_dim (`int`): The size of each embedding vector depth (`int`): Depth of the Perceiver Resampler (Transformer w/ cross attention). Should be shallow (< 3). n_heads (`int`): Number of heads in each Transformer block (for multi-headed self-attention). head_dim (`int`): Dimensionality of each head projection in the Transformer block. n_latents (`int`): Number of latent embeddings to resample ("compress") the input sequence to (usually < 128). T) requires_gradr N)super__init__r r r r perceiver_configqk_layer_norms_perceiverqk_layer_normsnn Parametertorchrandnlatentshasattr vision_configintermediate_dim ModuleListrangeIdeficsPerceiverAttention IdeficsMLPblocks LayerNorm layer_norm) selfrr r r r r _ __class__s k/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/transformers/models/idefics/perceiver.pyrz"IdeficsPerceiverResampler.__init__1s7( FOQXZbdmFmC dmT^$55NN||EKK$O_cd 6//= NNQ %%//!3  mmu   1$..$,,PTP]P]_c_r_rs"4#8#8&A    ,,t~~6 s-A&Fcontextc|jj|jddd}|jD]\}}||||z}|||z}|j |S)zWResample arbitrary length context & *compress* down to self.n_latents latent embeddingsrr)rrepeatshaper#r%)r&r*rattnffs r)forwardz!IdeficsPerceiverResampler.forward_sn,,%%gmmA&61=  ,HD"7G,w6GkG+G ,w'') __name__ __module__ __qualname__rintrrTensorr0 __classcell__r(s@r)rr0s\,7#,703,7 let long-form inputs be `context`, resampled embeddings be `latents`gFbiasN)rrr r r rrr$context_layer_normlatents_layer_norm q_layer_norm k_layer_normqk_scaleLinearq_projk_projv_proj output_proj)r&r r r rr(s r)rz"IdeficsPerceiverAttention.__init__ms0 6?(3 dm,"$,,t~~">"$,,t~~">    " T]] ;D  " T]] ;D  t+ ii t}}0LSXY ii t}}0LSXY ii t}}0LSXY 99T\\DMM%A9SXYr1r*rc |j|}|j|}|jdd\}}}|j|}|j t j ||gd}|jt j ||gd}|||fD cgc]G} | j|| jd|j|jjddIc} \}}}|jr"|j|}|j|}t jd||j z|} | | j#dd j%z } | j'd} t jd | |} |j)| jddj+dScc} w) aF Runs Perceiver Self-Attention, with special (context, latents) appended along the `seq` dimension! Args: context (`torch.Tensor`): Tensor of shape `[bsz, seq, embed_dim]` representing long-form context to resample. latents (`torch.Tensor`): Tensor of shape `[bsz, n_latents, embed_dim]` representing fixed length latents to compress to. Returns: `torch.Tensor`: Tensor of shape `[bsz, n_latents, embed_dim]` representing attention over latents w/ cross from context. N)dimrz... i d, ... j d -> ... i jT)rJkeepdimz... i j, ... j d -> ... i d)r=r>r-rCrDrcatrEreshaper r transposerr?r@einsumrAamaxdetachsoftmaxrFflatten)r&r*r batch_size seq_lengthr qkvxscoresstabilized_scoresr. resampleds r)r0z!IdeficsPerceiverAttention.forwards))'2))'2,3MM"1,=) J  KK  KK 7G"4"= > KK 7G"4"= > mnoprsktufg199ZT\\4==Q[[\]_`au1a   !!!$A!!!$A;Q=NPQR"fkkb$k&G&N&N&PQ ((R(0LL!>aH  3 3Aq 9 A A" EFFvsA G) r2r3r4r5boolrrr6r0r7r8s@r)r!r!ls]Z#ZZsZTXZ]aZ*(Gu||(Gell(Gu||(Gr1r!cheZdZdeffd ZdeeejdejfdZ xZ S)r"rcnt||jj|_t j |j|_t j|j|d|_t j|_ t j||jd|_ y)z:Simple MLP block with intermediate_size and embedding sizeFr;N) rrrr rr$lnrBfcReLUactc_proj)r&intermediate_sizerr(s r)rzIdeficsMLP.__init__st --77,,t~~.))DNN,=EJ779ii 14>>N r1 hidden_statesrc|j|}|j|}|j|}|j|}|S)N)rbrcrerf)r&rhs r)r0zIdeficsMLP.forwards@ .  . /  M2 r1) r2r3r4rrrtupler FloatTensorr0r7r8s@r)r"r"s:O-OXeE4E4E.F%GEL]L]r1r") __doc__typingrrtorch.nnrconfiguration_ideficsrModulerr!r"r1r)rrsL4  09( 9(x>G >GBr1