JL i1$RdZddlZddlmZddlmZmZmZddlm Z GddeZ y)ab Lexical translation model that ignores word order. In IBM Model 1, word order is ignored for simplicity. As long as the word alignments are equivalent, it doesn't matter where the word occurs in the source or target sentence. Thus, the following three alignments are equally likely:: Source: je mange du jambon Target: i eat some ham Alignment: (0,0) (1,1) (2,2) (3,3) Source: je mange du jambon Target: some ham eat i Alignment: (0,2) (1,3) (2,1) (3,1) Source: du jambon je mange Target: eat i some ham Alignment: (0,3) (1,2) (2,0) (3,1) Note that an alignment is represented here as (word_index_in_target, word_index_in_source). The EM algorithm used in Model 1 is: :E step: In the training data, count how many times a source language word is translated into a target language word, weighted by the prior probability of the translation. :M step: Estimate the new probability of translation based on the counts from the Expectation step. Notations --------- :i: Position in the source sentence Valid values are 0 (for NULL), 1, 2, ..., length of source sentence :j: Position in the target sentence Valid values are 1, 2, ..., length of target sentence :s: A word in the source language :t: A word in the target language References ---------- Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York. Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311. N) defaultdict) AlignedSent AlignmentIBMModel)CountscNeZdZdZd fd ZdZdZdZdZdZ dZ d Z xZ S) IBMModel1u Lexical translation model that ignores word order >>> bitext = [] >>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big'])) >>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house'])) >>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book'])) >>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book'])) >>> ibm1 = IBMModel1(bitext, 5) >>> print(round(ibm1.translation_table['buch']['book'], 3)) 0.889 >>> print(round(ibm1.translation_table['das']['book'], 3)) 0.062 >>> print(round(ibm1.translation_table['buch'][None], 3)) 0.113 >>> print(round(ibm1.translation_table['ja'][None], 3)) 0.073 >>> test_sentence = bitext[2] >>> test_sentence.words ['das', 'buch', 'ist', 'ja', 'klein'] >>> test_sentence.mots ['the', 'book', 'is', 'small'] >>> test_sentence.alignment Alignment([(0, 0), (1, 1), (2, 2), (3, 2), (4, 3)]) ct||||j|n |d|_t d|D]}|j ||j |y)ae Train on ``sentence_aligned_corpus`` and create a lexical translation model. Translation direction is from ``AlignedSent.mots`` to ``AlignedSent.words``. :param sentence_aligned_corpus: Sentence-aligned parallel corpus :type sentence_aligned_corpus: list(AlignedSent) :param iterations: Number of iterations to run training algorithm :type iterations: int :param probability_tables: Optional. Use this to pass in custom probability values. If not specified, probabilities will be set to a uniform distribution, or some other sensible value. If specified, the following entry must be present: ``translation_table``. See ``IBMModel`` for the type and purpose of this table. :type probability_tables: dict[str]: object Ntranslation_tabler)super__init__set_uniform_probabilitiesr rangetrain align_all)selfsentence_aligned_corpus iterationsprobability_tablesn __class__s Y/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/translate/ibm1.pyr zIBMModel1.__init__lsj, 01  %  * *+B C&88K%LD "q*% 0A JJ. / 0 ./c"dt|jz tjkr7t j dt t|jzdz|jD]}tfd|j|<y)Nz)Target language vocabulary is too large (z& words). Results may be less accurate.cSN) initial_probsrz5IBMModel1.set_uniform_probabilities..sLr) len trg_vocabrMIN_PROBwarningswarnstrrr )rrtrs @rrz#IBMModel1.set_uniform_probabilitiess3t~~.. (++ + MM;c$..)*+00  JA(34H(ID " "1 % JrcVt}|D]}|j}dg|jz}|j||}|D]T}|D]M}|j ||} | ||z } |j ||xx| z cc<|j |xx| z cc<OV|j|yr)rwordsmotsprob_all_alignmentsprob_alignment_point t_given_s any_t_given_s*maximize_lexical_translation_probabilities) rparallel_corpuscountsaligned_sentence trg_sentence src_sentence total_countr'scountnormalized_counts rrzIBMModel1.trains / @ +11L 6$4$9$99L22<NK" @%@A 55a;E',{1~'=$$$Q'*.>>*((+/??+ @ @ @ 77?rc ztt}|D]&}|D]}||xx|j||z cc<!(|S)a Computes the probability of all possible word alignments, expressed as a marginal distribution over target words t Each entry in the return value represents the contribution to the total alignment probability by the target word t. To obtain probability(alignment | src_sentence, trg_sentence), simply sum the entries in the return value. :return: Probability of t for all s in ``src_sentence`` :rtype: dict(str): float )rfloatr,)rr4r3alignment_prob_for_tr'r6s rr+zIBMModel1.prob_all_alignmentssV +51 KA! K$Q'4+D+DQ+JJ' K K$#rc&|j||S)z| Probability that word ``t`` in the target sentence is aligned to word ``s`` in the source sentence )r )rr6r's rr,zIBMModel1.prob_alignment_points %%a(++rcd}t|jD]>\}}|dk(r |j|}|j|}||j||z}@t |t jS)zc Probability of target sentence and an alignment given the source sentence g?r) enumerate alignmentr3r4r maxrr#)ralignment_infoprobjitrg_wordsrc_words rprob_t_a_given_szIBMModel1.prob_t_a_given_ss n667 ?DAqAv%2215H%2215H D**84X> >D  ?4**++rc4|D]}|j|yr)align)rr0 sentence_pairs rrzIBMModel1.align_alls, &M JJ} % &rcRg}t|jD]}\}}t|j|dtj }d}t|j D]!\}}|j||} | |k\s| }|}#|j||ft||_ y)a Determines the best word alignment for one sentence pair from the corpus that the model was trained on. The best alignment will be set in ``sentence_pair`` when the method returns. In contrast with the internal implementation of IBM models, the word indices in the ``Alignment`` are zero- indexed, not one-indexed. :param sentence_pair: A sentence in the source language and its counterpart sentence in the target language :type sentence_pair: AlignedSent N) r>r)r@r rr#r*appendrr?) rrJbest_alignmentrCrE best_probbest_alignment_pointrDrF align_probs rrIzIBMModel1.aligns$]%8%89 =KAxD228rYs*4l#;;+p<p