JL iO|dZddlZddlmZddlmZddlmZmZm Z m Z ddl m Z m Z Gdde ZGd d e Zy) u\ Translation model that reorders output words based on their type and distance from other related words in the output sentence. IBM Model 4 improves the distortion model of Model 3, motivated by the observation that certain words tend to be re-ordered in a predictable way relative to one another. For example, in English usually has its order flipped as in French. Model 4 requires words in the source and target vocabularies to be categorized into classes. This can be linguistically driven, like parts of speech (adjective, nouns, prepositions, etc). Word classes can also be obtained by statistical methods. The original IBM Model 4 uses an information theoretic approach to group words into 50 classes for each vocabulary. Terminology ----------- :Cept: A source word with non-zero fertility i.e. aligned to one or more target words. :Tablet: The set of target word(s) aligned to a cept. :Head of cept: The first word of the tablet of that cept. :Center of cept: The average position of the words in that cept's tablet. If the value is not an integer, the ceiling is taken. For example, for a tablet with words in positions 2, 5, 6 in the target sentence, the center of the corresponding cept is ceil((2 + 5 + 6) / 3) = 5 :Displacement: For a head word, defined as (position of head word - position of previous cept's center). Can be positive or negative. For a non-head word, defined as (position of non-head word - position of previous word in the same tablet). Always positive, because successive words in a tablet are assumed to appear to the right of the previous word. In contrast to Model 3 which reorders words in a tablet independently of other words, Model 4 distinguishes between three cases. 1. Words generated by NULL are distributed uniformly. 2. For a head word t, its position is modeled by the probability d_head(displacement | word_class_s(s),word_class_t(t)), where s is the previous cept, and word_class_s and word_class_t maps s and t to a source and target language word class respectively. 3. For a non-head word t, its position is modeled by the probability d_non_head(displacement | word_class_t(t)) The EM algorithm used in Model 4 is: :E step: In the training data, collect counts, weighted by prior probabilities. - (a) count how many times a source language word is translated into a target language word - (b) for a particular word class, count how many times a head word is located at a particular displacement from the previous cept's center - (c) for a particular word class, count how many times a non-head word is located at a particular displacement from the previous target word - (d) count how many times a source word is aligned to phi number of target words - (e) count how many times NULL is aligned to a target word :M step: Estimate new probabilities based on the counts from the E step Like Model 3, there are too many possible alignments to consider. Thus, a hill climbing approach is used to sample good candidates. Notations --------- :i: Position in the source sentence Valid values are 0 (for NULL), 1, 2, ..., length of source sentence :j: Position in the target sentence Valid values are 1, 2, ..., length of target sentence :l: Number of words in the source sentence, excluding NULL :m: Number of words in the target sentence :s: A word in the source language :t: A word in the target language :phi: Fertility, the number of target words produced by a source word :p1: Probability that a target word produced by a source word is accompanied by another target word that is aligned to NULL :p0: 1 - p1 :dj: Displacement, Δj References ---------- Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York. Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311. N defaultdict) factorial) AlignedSent AlignmentIBMModel IBMModel3)Countslongest_target_sentence_lengthcXeZdZdZ d fd ZfdZdZdZdZdZ e dZ xZ S) IBMModel4u Translation model that reorders output words based on their type and their distance from other related words in the output sentence >>> bitext = [] >>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus', 'war', 'ja', 'groß'], ['the', 'house', 'was', 'big'])) >>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small'])) >>> bitext.append(AlignedSent(['ein', 'haus', 'ist', 'klein'], ['a', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house'])) >>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book'])) >>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book'])) >>> bitext.append(AlignedSent(['ich', 'fasse', 'das', 'buch', 'zusammen'], ['i', 'summarize', 'the', 'book'])) >>> bitext.append(AlignedSent(['fasse', 'zusammen'], ['summarize'])) >>> src_classes = {'the': 0, 'a': 0, 'small': 1, 'big': 1, 'house': 2, 'book': 2, 'is': 3, 'was': 3, 'i': 4, 'summarize': 5 } >>> trg_classes = {'das': 0, 'ein': 0, 'haus': 1, 'buch': 1, 'klein': 2, 'groß': 2, 'ist': 3, 'war': 3, 'ja': 4, 'ich': 5, 'fasse': 6, 'zusammen': 6 } >>> ibm4 = IBMModel4(bitext, 5, src_classes, trg_classes) >>> print(round(ibm4.translation_table['buch']['book'], 3)) 1.0 >>> print(round(ibm4.translation_table['das']['book'], 3)) 0.0 >>> print(round(ibm4.translation_table['ja'][None], 3)) 1.0 >>> print(round(ibm4.head_distortion_table[1][0][1], 3)) 1.0 >>> print(round(ibm4.head_distortion_table[2][0][1], 3)) 0.0 >>> print(round(ibm4.non_head_distortion_table[3][6], 3)) 0.5 >>> print(round(ibm4.fertility_table[2]['summarize'], 3)) 1.0 >>> print(round(ibm4.fertility_table[1]['book'], 3)) 1.0 >>> print(round(ibm4.p1, 3)) 0.033 >>> test_sentence = bitext[2] >>> test_sentence.words ['das', 'buch', 'ist', 'ja', 'klein'] >>> test_sentence.mots ['the', 'book', 'is', 'small'] >>> test_sentence.alignment Alignment([(0, 0), (1, 1), (2, 2), (3, None), (4, 3)]) ct|||j||_||_|bt ||}|j |_|j|_|j|_|j|_ |j|n<|d|_|d|_|d|_|d|_ |d|_ |d|_ td|D]}|j|y) a Train on ``sentence_aligned_corpus`` and create a lexical translation model, distortion models, a fertility model, and a model for generating NULL-aligned words. Translation direction is from ``AlignedSent.mots`` to ``AlignedSent.words``. :param sentence_aligned_corpus: Sentence-aligned parallel corpus :type sentence_aligned_corpus: list(AlignedSent) :param iterations: Number of iterations to run training algorithm :type iterations: int :param source_word_classes: Lookup table that maps a source word to its word class, the latter represented by an integer id :type source_word_classes: dict[str]: int :param target_word_classes: Lookup table that maps a target word to its word class, the latter represented by an integer id :type target_word_classes: dict[str]: int :param probability_tables: Optional. Use this to pass in custom probability values. If not specified, probabilities will be set to a uniform distribution, or some other sensible value. If specified, all the following entries must be present: ``translation_table``, ``alignment_table``, ``fertility_table``, ``p1``, ``head_distortion_table``, ``non_head_distortion_table``. See ``IBMModel`` and ``IBMModel4`` for the type and purpose of these tables. :type probability_tables: dict[str]: object Ntranslation_tablealignment_tablefertility_tablep1head_distortion_tablenon_head_distortion_tabler)super__init__reset_probabilities src_classes trg_classesr rrrrset_uniform_probabilitiesrrrangetrain) selfsentence_aligned_corpus iterationssource_word_classestarget_word_classesprobability_tablesibm3n __class__s Y/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/translate/ibm4.pyrzIBMModel4.__init__sP 01   "..  %4jAD%)%;%;D "#'#7#7D #'#7#7D ggDG  * *+B C&88K%LD "#56G#HD #56G#HD (.DG);zSIBMModel4.reset_probabilities......s DMMr'rr/sr&r0zAIBMModel4.reset_probabilities....s 4I(Jr'rr/sr&r0z/IBMModel4.reset_probabilities..s K JKr'c tfdS)NcjSr,r-r/sr&r0zAIBMModel4.reset_probabilities....s  r'rr/sr&r0z/IBMModel4.reset_probabilities..s K 56r')rrrrrrr%s`r&rzIBMModel4.reset_probabilitiess: #%%0 K& " *5 6* & r'ct|}|dkrtjn dd|dz zz tjkr$tjdt |zdzt d|D]p}tfd|j|<tfd|j| <tfd|j|<tfd|j| <ry ) zj Set distortion probabilities uniformly to 1 / cardinality of displacement values zA target sentence is too long (z& words). Results may be less accurate.c tfdS)NcSr, initial_probsr&r0zGIBMModel4.set_uniform_probabilities....Lr'rr:sr&r0z5IBMModel4.set_uniform_probabilities.. $89r'c tfdS)NcSr,r9r:sr&r0zGIBMModel4.set_uniform_probabilities....r<r'rr:sr&r0z5IBMModel4.set_uniform_probabilities..r=r'cSr,r9r:sr&r0z5IBMModel4.set_uniform_probabilities.. s\r'cSr,r9r:sr&r0z5IBMModel4.set_uniform_probabilities..!slr'N) r rr.warningswarnstrrrrr)rrmax_mdjr;s @r&rz#IBMModel4.set_uniform_probabilitiess //FG A:#,,LUQY0L (++ + MM1e*:;  5/ TB-89.D & &r */:9/D & &s +2==Q1RD * *2 .2=>R2SD * *B3 / Tr'c t}|D]}t|j}|j|\}}t |j |_|j|}|D]}|j|} | |z } td|dzD]>} |j| || |j| || |j|j@|j| ||j| ||j } |j#| |_|j%||j'||j)||j+|y)Nr5) Model4Countslenwordssamplerzero_indexed_alignment alignmentprob_of_alignmentsprob_t_a_given_srupdate_lexical_translationupdate_distortionrrupdate_null_generationupdate_fertilityrr*maximize_lexical_translation_probabilities!maximize_distortion_probabilities maximize_fertility_probabilities&maximize_null_generation_probabilities) rparallel_corpuscountsaligned_sentencemsampled_alignmentsbest_alignment total_countalignment_infocountnormalized_countjexisting_alignment_tables r&rzIBMModel4.train#sv / J $**+A26=M1N . )2557*  & 112DEK#5 J--n=#(;#6 q!a% A55(.!,,(&((((  --.>O''(8.I# J JF$(#7#7    "7 77? ..v6 --f5 33F;r'c"|j}|jjD]o\}}|jD]W\}}|D]M}|j||||j||z }t |t j ||||<OYq|j} |jjD]N\}}|D]D}|j|||j|z }t |t j | ||<FPyr,) rhead_distortionitemshead_distortion_for_any_djmaxrr.rnon_head_distortionnon_head_distortion_for_any_dj) rrY head_d_tablerFrs_clsrt_clsestimatenon_head_d_tables r&rUz+IBMModel4.maximize_distortion_probabilitiesQsH11 %55;;= VOB &1&7&7&9 V"{(VE..r259%@ ;;EB5IJ698CTCT5UL$U+E2 V V V 99%99??A OOB $ O..r259;;EBC/2(H.null_generation_termqsEBRB+::1=NN//014A S^,s2q1~;M7M/NN NEx1nq01 :!n,q01499 :Lr'cd}j}tdt|D]@}j|}|t |j |||zz}|ks>cS|S)Nrsr5) src_sentencerrIrurr)rxrr{ fertilityr.r_r|s r&fertility_termz9IBMModel4.model4_prob_t_a_given_s..fertility_termsE)66L1c,/0 $*99!< i(// :<?KL8##O $Lr'cj|}j|}j|}j||Sr,)rvrMrr)rbtr{sr_r|s r&lexical_translation_termzCIBMModel4.model4_prob_t_a_given_s..lexical_translation_termsI++A.A((+A++A.A..q1!4 4r'c j|} j|}|dk(ry j|rk j|}d}| j|} j |} j |}| j|z } j|||S j|} j |}||z } j||S)Nrrs) rvrM is_head_word previous_ceptrrrcenter_of_ceptrprevious_in_tabletr) rbrr{r src_class previous_s trg_classrFprevious_positionr_r|s r&distortion_termz:IBMModel4.model4_prob_t_a_given_s..distortion_terms++A.A((+AAv**1- . < ...s E(:r'rr9r'r&r0z'Model4Counts.__init__..s K :;r'c ttSr,rr9r'r&r0z'Model4Counts.__init__..s k%>Pr'c ttSr,rr9r'r&r0z'Model4Counts.__init__..s {57Ir')rrrrergrirrjr3s r&rzModel4Counts.__init__sI * ; +66P*Q'#./I#J .9%.@+r'c|j|}|j|}|dk(ry|j|r{|j|}||j|} || } nd} ||} ||j |z } |j | | | xx|z cc<|j| | xx|z cc<y|j|} ||} || z } |j| | xx|z cc<|j| xx|z cc<y)Nr) rMrvrrrrrergrrirj)rr`r_rbrrr{rrprevious_src_wordrrrF previous_js r&rQzModel4Counts.update_distortions  $ $Q '  ' ' * 6   ( ( +*88;M($2$?$? $N!'(9:  #AI^22=AAB   $Y / :e C :  + +I 6y AU J A(::1=J#AIZB  $ $R ( 3u < 3  / / :e C :r')rrrrrrQrrs@r&rHrHs ADr'rH)rrB collectionsrmathrnltk.translaterrrr nltk.translate.ibm_modelr r r rHr9r'r&rs=dL#FFKJJZ 'D6'Dr'