JL i/ldZddlZddlmZddlmZmZmZmZddl m Z GddeZ Gdd e Z y) a Lexical translation model that considers word order. IBM Model 2 improves on Model 1 by accounting for word order. An alignment probability is introduced, a(i | j,l,m), which predicts a source word position, given its aligned target word's position. The EM algorithm used in Model 2 is: :E step: In the training data, collect counts, weighted by prior probabilities. - (a) count how many times a source language word is translated into a target language word - (b) count how many times a particular position in the source sentence is aligned to a particular position in the target sentence :M step: Estimate new probabilities based on the counts from the E step Notations --------- :i: Position in the source sentence Valid values are 0 (for NULL), 1, 2, ..., length of source sentence :j: Position in the target sentence Valid values are 1, 2, ..., length of target sentence :l: Number of words in the source sentence, excluding NULL :m: Number of words in the target sentence :s: A word in the source language :t: A word in the target language References ---------- Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York. Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311. N defaultdict) AlignedSent AlignmentIBMModel IBMModel1)CountscTeZdZdZd fd ZdZdZdZdZdZ dZ d Z d Z xZ S) IBMModel2u` Lexical translation model that considers word order >>> bitext = [] >>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big'])) >>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house'])) >>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book'])) >>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book'])) >>> ibm2 = IBMModel2(bitext, 5) >>> print(round(ibm2.translation_table['buch']['book'], 3)) 1.0 >>> print(round(ibm2.translation_table['das']['book'], 3)) 0.0 >>> print(round(ibm2.translation_table['buch'][None], 3)) 0.0 >>> print(round(ibm2.translation_table['ja'][None], 3)) 0.0 >>> print(round(ibm2.alignment_table[1][1][2][2], 3)) 0.939 >>> print(round(ibm2.alignment_table[1][2][2][2], 3)) 0.0 >>> print(round(ibm2.alignment_table[2][2][4][5], 3)) 1.0 >>> test_sentence = bitext[2] >>> test_sentence.words ['das', 'buch', 'ist', 'ja', 'klein'] >>> test_sentence.mots ['the', 'book', 'is', 'small'] >>> test_sentence.alignment Alignment([(0, 0), (1, 1), (2, 2), (3, 2), (4, 3)]) ct|||2t|d|z}|j|_|j |n|d|_|d|_t d|D]}|j||j|y)a Train on ``sentence_aligned_corpus`` and create a lexical translation model and an alignment model. Translation direction is from ``AlignedSent.mots`` to ``AlignedSent.words``. :param sentence_aligned_corpus: Sentence-aligned parallel corpus :type sentence_aligned_corpus: list(AlignedSent) :param iterations: Number of iterations to run training algorithm :type iterations: int :param probability_tables: Optional. Use this to pass in custom probability values. If not specified, probabilities will be set to a uniform distribution, or some other sensible value. If specified, all the following entries must be present: ``translation_table``, ``alignment_table``. See ``IBMModel`` for the type and purpose of these tables. :type probability_tables: dict[str]: object Ntranslation_tablealignment_tabler) super__init__rrset_uniform_probabilitiesrrangetrain align_all)selfsentence_aligned_corpus iterationsprobability_tablesibm1n __class__s Y/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/translate/ibm2.pyrzIBMModel2.__init__cs, 01  %4a*nED%)%;%;D "  * *+B C&88K%LD "#56G#HD q*% 0A JJ. / 0 ./ct}|D]}t|j}t|j}||f|vs4|j ||fd|dzz }|t j kr$tjdt|zdztd|dzD].}td|dzD]}||j||||<0y)NzA source sentence is too long (z& words). Results may be less accurate.r) setlenmotswordsaddrMIN_PROBwarningswarnstrrr) rrl_m_combinationsaligned_sentencelm initial_probijs rrz#IBMModel2.set_uniform_probabilitiess5 7 H $))*A$**+A1v-- $$aV, AE{ ("3"33MM9a&!BC q!a%HA"1a!e_H;G,,Q/215a8HH Hrc t}|D]}dg|jz}dg|jz}t|j}t|j}|j ||}t d|dzD]d} || } t d|dzD]K} || } |j | | ||} | || z }|j|| | |j|| | ||Mf|j||j|y)NUNUSEDr r) Model2Countsr#r$r"prob_all_alignmentsrprob_alignment_pointupdate_lexical_translationupdate_alignment*maximize_lexical_translation_probabilities maximize_alignment_probabilities)rparallel_corpuscountsr+ src_sentence trg_sentencer,r- total_countr0tr/scountnormalized_counts rrzIBMModel2.trains. / J  6$4$9$99L$:(8(>(>>L$))*A$**+A22<NK1a!e_ J Oq!a%JA$QA 55aL,WE',{1~'=$556F1M++,qAB!**1q51!a%8;A>?* *I+,( -  ! !1&:"; <# =&#,N"; rrD)__name__ __module__ __qualname____doc__rrrr9r4r5rWrrY __classcell__rs@rr r ;s:%N'0RH(64 S$. O,*&&zKModel2Counts.__init__......3s K...3s 4N(Orrrmrrrnz'Model2Counts.__init__..3s K OPrctdS)Nc ttSrDrlrmrrrnz9Model2Counts.__init__....6s E(:rrrmrrrnz'Model2Counts.__init__..6s K :;r)rrrrErG)rrs rrzModel2Counts.__init__0s/ $ P $/ ;$  rcf|j||xx|z cc<|j|xx|z cc<yrD) t_given_s any_t_given_s)rrAr@r?s rr6z'Model2Counts.update_lexical_translation9s1 q!% 1&rc~|j||||xx|z cc<|j|||xx|z cc<yrD)rErG)rrAr/r0r,r-s rr7zModel2Counts.update_alignment=sE q!Q"e+"   #A&q)U2)r)rarbrcrdrr6r7rerfs@rr3r3*s  '3rr3) rdr' collectionsrnltk.translaterrrrnltk.translate.ibm_modelr r r3rmrrrxs7*X#FF+l<l<^363r