Transformation-Based Tagging PA154 Jazykové modelování (7.1) Pavel Rychlý pary@fi.muni.cz April 13, 2021 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic The Task, Again ■ Recall: ► tagging ~ morphological disambiguation ► tagset VT e (Ci, C2,..., Cn) ► C/ - moprhological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER,.... ► mapping w {t e VT} exists ► restriction of Morphological Analysis: A+ 2(L'c1,C2'-'Cn), where A is the language alphabet, L is the set of lemmas ► extension to punctuation, sentence boundaries (treated as word) PA154 Jazykove modelovani (7.1) Transformation-Based Tagging 2/15 Setting ■ Not a source channel view ■ Not even a probabilistic model (no "numbers"used when tagging a text after a model is developed) ■ Statistical, yes: ► uses training data (combination of supervised [manually annotated data available] and unsupervised [plain text, large volume] training) ► learning [rules] ► criterion: accuracy (that's what we are interested in in the end after all!) PA154 Jazykove modelovani (7.1) Transformation-Based Tagging 3/15 "he General Scheme Training Tagging PA154 Jazykove modelovani (7.1) Transformation-Based Tagging 4/15 "he Learner RULES /Annotated Remove tags ATD without annotation ^^\training data Iteration 1 Assign initial tags Iteration n annotation/ Transformation-Based Tagging "he I/O of an Iteration ■ In (iteration i): ► Intermediate data (initial or the result of previous iteration) ► The TRUTH (the annotated training data) ► pool of possible rules ■ Out: ► One rule rsejected^ to enhance the set of rules learned so far ► Intermediate data (input data transformed by the rule learned in this iteration, rselected[i)) PA154 Jazykove modelovani (7.1) Transformation-Based Tagging 6/15 The Initial Assignment of Tags One possibilty: ► NN Another: ► the most frequent tag for a given word form Even: ► use an HMM tagger for the initial assignment Not particulary sensitive PA154 Jazykove modelovanf (7.1) Transformation-Based Tagging 7/15 "he Criterion Error rate (or Accuracy): ► beginning of an iteration: some error rate Ejn ► each possible rule r, when applied at every data position: ► makes an improvement somewhere in the data {Cimpr0ved{r)) ► makes it worse at sme places {cworsened{r)) ► and, of course, does not touch the remaining data Rule contribution to the improvement of the error rate: ► contrib(r) = Cjmproveci(r) — cworsened(r) Rule selection at iteration i: ► rSeiected(i) = argmaxrcontrib(r) New error rate: Eout = Eln- contrib(rselected{i)) PA154 Jazykove modeloväni (7.1) Transformation-Based Tagging 8/15 The Stopping Criterion Obvious: ► no improvement can be made ► contrib(r) < 0 ► or improvement too small ► contrib(r) < Threshold NB: prone to overtraining! ► therefore, setting a reasonable threshold advisable Heldout? ► maybe: remove rules which degrade performance on H PA154 Jazykove modelovani (7.1) Transformation-Based Tagging 9/15 The Pool of Rules(Templates) Format: change tag at position i from atob / condition Context rules (condition definition - "template"): Wi-3 Wi-2 Wi-1 Wi Wi+1 Wi+2 W>+3 ^ t1+1 ti+2 tj+3 Instantiation: w; t permitted PA154 Jazykove modelovani (7.1) Transformation-Based Tagging 10/15 Lexical Rules Other type: lexical rules ti-3 Li-2 Example: ► Wj has suffix -ied ► Wj has prefix ge- w. 'l-l W. t w i+1 t wi+2 t w 1+J 1+1 4+2 \+3 "look inside the word" PA154 Jazykove modelovani (7.1) Transformation-Based Tagging 11/15 Rule Application Two possibilities: ► immediate consequences (left-to-right): * data: DT nn vbp SSI vbp S3... * rule: nn—* nns / preceded by nn vbp * apply rule at position 4: DT nn VBPlSS vbpnn.-DT nn VBPInHS VBP nn„, * ...then rule cannot apply at position 6 (contest notNN vbp). ► delayed ("fixed input"): ► use original input for context ► the above rule then applies twice PA154 Jazykove modelovani (7.1) Transformation-Based Tagging 12/15 In Other Words D Strip the tags off the truth, keep the original truth H Initialize the stripped data by some simple method B Start with an empty set of selected rules S. □ Repeat until the stopping criterion applies: ► compute the contribution of the rule r, for each r: contrib(r) = cimproved{r) - cworsened{r) ► select r which has the biggest contribution contrib(r), add it to the final set of selected rules S. 0 Output the set S PA154 Jazykove modelovani (7.1) Transformation-Based Tagging 13/15 The Tagger ■ Input: ► untagged data ► rules (S) learned by the learner ■ Tagging: ► use the same initialization as the learner did ► for i = 1 ..n (n - the number of rules learnt) ► apply the rule i to the whole intermediate data, changing (some) tags ► the last intermediate data is the output PA154 Jazykove modelovani (7.1) Transformation-Based Tagging 14/15 N-best & Unsupervised Modifications ■ N-best modification ► allow adding tags by rules ► criterion: optimal combination of accuracy and the number of tags per word (we want: close to 11) ■ Unsupervised modification ► use only unambiguous words for evaluation criterion ► work extremely well for English ► does not work for languages with few unambiguous words PA154 Jazykove modelovani (7.1) Transformation-Based Tagging 15/15