JL iF:0ddlZddlZddlZddlZddlmZddlmZmZm Z ddl m Z m Z ddl mZmZdZdZdZd Zd Zd Zd Zd ZdZdZdZdZ ddZdZddZeddgZegdZ dZ!e"dk(reyy)N)treebank)BrillTaggerTrainer RegexpTagger UnigramTagger)PosWord)Template error_listcty)z Run a demo with defaults. See source comments for details, or docstrings of any of the more specific demo_* functions. NpostagS/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/tbl/demo.pydemors  Hrctdy)N Exemplify repr(Rule) (see also str(Rule) and Rule.format("verbose")) repr ruleformatNr rrrdemo_repr_rule_formatrs  frctdy)rstrrNr rrrdemo_str_rule_formatr$s  erctdy)z* Exemplify Rule.format("verbose") verboserNr rrrdemo_verbose_rule_formatr+s  i rcFtttgdgy)a The feature/s of a template takes a list of positions relative to the current word where the feature should be looked for, conceptually joined by logical OR. For instance, Pos([-1, 1]), given a value V, will hold whenever V is found one step to the left and/or one step to the right. For contiguous ranges, a 2-arg form giving inclusive end points can also be used: Pos(-3, -1) is the same as the arg below. ) templatesN)r r rrrrdemo_multiposition_featurer$2s hs<0123rc \tttdgtddggy)z8 Templates can have more than a single feature. rr r!r"N)r r rrrrrdemo_multifeature_templater&As$ htQCy#r2h-89:rctddy)ah Show aggregate statistics per template. Little used templates are candidates for deletion, much used templates may possibly be refined. Deleting unused templates is mostly about saving time and/or space: training is basically O(T) in the number of templates T (also in terms of memory usage, which often will be the limiting factor). T)incremental_statstemplate_statsNr rrrdemo_template_statisticsr*Hs T$7rctjgdddgd}tjgdddgd}tt j||gd }t d j t|t|dd y ) a  Template.expand and Feature.expand are class methods facilitating generating large amounts of templates. See their documentation for details. Note: training with 500 templates can easily fill all available even on relatively small corpora )r!rr,F) excludezero)r r!rr,T)r,) combinationsz8Generated {} templates for transformation-based learning)r#r(r)N) rexpandrlistr printformatlenr )wordtplstagtplsr#s rdemo_generated_templatesr8Tsu{{:1v5AHjj!QTBGX__h%8vNOI BII  N   Y$tLrc tdddy)z Plot a learning curve -- the contribution on tagging accuracy of the individual rules. Note: requires matplotlib Tzlearningcurve.png)r(separate_baseline_datalearning_curve_outputNr rrrdemo_learning_curver<hs  #1rctdy)zW Writes a file with context for each erroneous word after tagging testing data z errors.txt) error_outputNr rrrdemo_error_analysisr?us  %rctdy)zm Serializes the learned tagger to a file in pickle format; reloads it and validates the process. z tagger.pcl)serialize_outputNr rrrdemo_serialize_taggerrB|s  L)rc tdddy)z Discard rules with low accuracy. This may hurt performance a bit, but will often produce rules which are more interesting read to a human. i gQ? ) num_sentsmin_acc min_scoreNr rrrdemo_high_accuracy_rulesrHs  T426rc |xst}|ddlm}m}|}t |||||\}}}}|rt j j|sRt||}t|d5}tj||dddtdj|t|5}tj|}td|dddnt||}td|r)td jj|t!j }t#||| }td |j%||||}td t!j |z d d|rtd|j|z|dk(rNtdt'|j)dD]&\}}t|dd|j| d(| r{td|j+||\} }!td|s td|j-}"| r|j/|!|rLt1||!|"|td|n.td|j3|} | r|j/| st| d5}#|#j5d| z|#j5dj7t9|| j;ddzdddtd| | |j3|} t| d5}tj||dddtd| t| 5}tj|}$dddtd| |j3|}%| |%k(r td ytd!yy#1swYJxYw#1swYxYw#1swYxYw#1swYxYw#1swYxxYw)"a Brill Tagger Demonstration :param templates: how many sentences of training and testing data to use :type templates: list of Template :param tagged_data: maximum number of rule instances to create :type tagged_data: C{int} :param num_sents: how many sentences of training and testing data to use :type num_sents: C{int} :param max_rules: maximum number of rule instances to create :type max_rules: C{int} :param min_score: the minimum score for a rule in order for it to be considered :type min_score: C{int} :param min_acc: the minimum score for a rule in order for it to be considered :type min_acc: C{float} :param train: the fraction of the the corpus to be used for training (1=all) :type train: C{float} :param trace: the level of diagnostic tracing output to produce (0-4) :type trace: C{int} :param randomize: whether the training data should be a random subset of the corpus :type randomize: C{bool} :param ruleformat: rule output format, one of "str", "repr", "verbose" :type ruleformat: C{str} :param incremental_stats: if true, will tag incrementally and collect stats for each rule (rather slow) :type incremental_stats: C{bool} :param template_stats: if true, will print per-template statistics collected in training and (optionally) testing :type template_stats: C{bool} :param error_output: the file where errors will be saved :type error_output: C{string} :param serialize_output: the file where the learned tbl tagger will be saved :type serialize_output: C{string} :param learning_curve_output: filename of plot of learning curve(s) (train and also test, if available) :type learning_curve_output: C{string} :param learning_curve_take: how many rules plotted :type learning_curve_take: C{int} :param baseline_backoff_tagger: the file where rules will be saved :type baseline_backoff_tagger: tagger :param separate_baseline_data: use a fraction of the training data exclusively for training baseline :type separate_baseline_data: C{bool} :param cache_baseline_tagger: cache baseline tagger to this file (only interesting as a temporary workaround to get deterministic output from the baseline unigram tagger between python versions) :type cache_baseline_tagger: C{string} Note on separate_baseline_data: if True, reuse training data both for baseline and rule learner. This is fast and fine for a demo, but is likely to generalize worse on unseen data. Also cannot be sensibly used for learning curves on training data (the baseline will be artificially high). Nr)brill24describe_template_sets)backoffwz)Trained baseline tagger, pickled it to {}zReloaded pickled tagger from zTrained baseline taggerz! Accuracy on test set: {:0.4f}rzTraining tbl tagger...zTrained tbl tagger in z0.2fz secondsz Accuracy on test set: %.4fr,z Learned rules: 4d szJIncrementally tagging the test data, collecting individual rule statisticsz Rule statistics collectedzbWARNING: train_stats asked for separate_baseline_data=True; the baseline will be artificially high)takez Wrote plot of learning curve to zTagging the test datazErrors for Brill Tagger %r  zutf-8z)Wrote tagger errors including context to zWrote pickled tagger to z4Reloaded tagger tried on test set, results identicalz;PROBLEM: Reloaded tagger gave different results on test set) REGEXP_TAGGERnltk.tag.brillrJrK_demo_prepare_dataospathexistsropenpickledumpr3r4loadaccuracytimertrain enumeraterulesbatch_tag_incremental train_statsprint_template_statistics _demo_plot tag_sentswritejoinr encode)&r# tagged_datarE max_rulesrGrFr_trace randomizerr(r)r>rAr;learning_curve_takebaseline_backoff_taggerr:cache_baseline_taggerrJrK training_data baseline_data gold_data testing_databaseline_tagger print_rulestbrilltrainer brill_taggerrulenorule taggedtest teststats trainstatsfbrill_tagger_reloadedtaggedtest_reloadeds& rr r sp6FB I >PUIy2H?;]M9lww~~34+'>O+S1 :[ O[9 : ;BB)  ' ( KK$kk+6O 12G1HI J K K( ?VW '( / 6 6((3  YY[F EjG "#== 9gNL "499;#7"=X FG .1F1Fy1QQR z !"%l&8&8&:A> >LFD VBKqZ!8 ;< = >  X #/"D"D )# Y -.% , "--/   2 29 = %y*CV  45J4KL M %&!++L9   2 2 4 , $ Y GG47GG H GGDIIjJ?@GGPSWW X Y 9,HI#!++L9 "C ( 3K KK k 2 3 ()9(:;< " # ={$*KK $< ! = -.>-?@A*44\B , , H I O P$U : : K Kz Y Y 3 3 = =s=*O .$O0AO#6O/.O; OO #O,/O8;Pc |tdtj}|t||kr t|}|r3t j t|t j |t||z}|d|}|||}|D cgc]}|D cgc]} | d c} } }} |s|} nt|dz} |d| || d}} t|\} }t| \}}t| \}}td|dd|ddtd| dd|ddtd j|||rd nd || || fScc} wcc} }w) Nz%Loading tagged data from treebank... rr/zRead testing data (dz sents/z wds)zRead training data (z-Read baseline data ({:d} sents/{:d} wds) {:s}z[reused the training set]) r3r tagged_sentsr5randomseedshuffleint corpus_sizer4)rjr_rErmr:cutoffrqrssenttrtrr bl_cutoff trainseqs traintokenstestseqs testtokens bltrainseqs bltraintokenss rrUrUQs|  56++- C , 9 $  C $%{# U" #F(MF9-I5>?T4(aQqT(?L? !%  &!+ *9 % )* %& +=9Y (6Xz#.}#= [- |7:a. FG 1 W[O5 IJ 7>>  (B.I  =)\ BB+)?s E E $EEc|dg}|dD]}|j|d|z |d|Dcgc] }d||dz z }}|dg}|dD]}|j|d|z |d|Dcgc] }d||dz z }}ddlm}tt t |} |j | || ||jgd|j|ycc}wcc}w)N initialerrors rulescoresr!r, tokencountr)NNNg?) appendmatplotlib.pyplotpyplotr2ranger5plotaxissavefig) r;r}r~rQ testcurve rulescorex traincurvepltrs rrereys?+,I|,4 2234:CET:JKQQ<000KIK_-.J -6 *R.9456 "#AHHQ 1j)HH $%KK%&L Os C!1C&z^-?[0-9]+(\.[0-9]+)?$CDz.*NN) r)z(The|the|A|a|An|an)$AT)z.*able$JJ)z.*ness$r)z.*ly$RB)z.*s$NNS)z.*ing$VBG)z.*ed$VBDrc<t|td|DfS)Nc32K|]}t|yw)N)r5).0rs r zcorpus_size..s0a3q60s)r5sum)seqss rrrs Is0400 11r__main__)NNi,r/Ng?r/FrFFNNNrNFN)NN)#rVrZrr^ nltk.corpusrnltk.tagrrrrTrrnltk.tblr r rrrrr$r&r*r8r<r?rBrHr rUre NN_CD_TAGGERrSr__name__rrrrs DD$) ! 4; 8M( &*7    'BQJ%CP'&=}MN    2 zr