Disambiguation strategies for Spanish used in
PrADo
PrADo
Project PrADo was a project of two universities in Catalonia, Autonomous University of
Barcelona and Pompeu Fabra University, with the aim of creating a grammar checker
prototype for Catalan and another one for Spanish, from the texts tagged with a
morphosyntactic tagger. The project was realised from 2001-2003 and the results of the
project are summarized in "PrADo: Preparación Automatizada de Documentos" [1] from
March 2004.
Module for Spanish language
Before any work on the tools was started, a corpus was created. The size of the corpus was
238.766 words with texts from web, newspapers, literature and e-mails (without
specialisation) and also with texts specialised: on law and linguistics. The corpus was created
with specific users in mind: trilingual (Spanish, Catalan and English -- all texts from Iberian
Peninsula), contemporary (texts no older than 1.1.2000) and users with higher level of
writing. The reason was to create tools that would effectively work with these types of users
(and their texts) and also to create a corpus of errors for correction tools development.
Preprocessing
To preprocess the text, an external tool named TACO+ from Polytechnic University of
Catalonia was used. TACO+ is in principle a tagger, but incorporates its own text
preprocessing. This made the Spanish text preprocessing easier, but on the other hand it was
necessary to alter the output to comply with Constraint Grammar.
First of all, there was a need to detect abbreviations. MACO+ gives every abbreviation a tag,
but it is not possible to use them directly for creating grammar rules. Moreover, MACO+
cannot assign a morphological category to a concrete abbreviation.
MACO+ also lacks the ability to mark boundaries of sentences and cannot distinguish
between simple verbs and verbs with clitics, so additional modules were created for marking
sentences with SGML and preserving information about clitics in the format used by
Constraint Grammar.
Morphology
MACO+, as stated before, was used for morphological tagging. Tags made by MACO+ were
then automatically converted to Constraint Grammar format.
Grammar used for Spanish language was structured in similar matter as the one for Catalan to
two blocks. The first one consists of rules that eliminate ambiguity that was caused by tagging
words. To eliminate ambiguity correctly, specific user cases mentioned before are taken into
account (the way they use anachronisms and Americanisms for instance).
The rest of the rules are organised according to different morphological categories. First rules
refer to closed categories (determiners, prepositions, conjunctions, pronouns, adverbs)
because based on them it is possible to better categorise other words and detect their
characterictics (grammatical number, gender, etc.).
Apart from corpus mentioned before, additional sources of data are used: dictionaries
(especially DRAE [2]) and grammars (GDLE [3]).
Disambiguation
Disambiguation strategies for nouns and verbs
Distinguishing between noun and verb is one of the most important disambiguation problems
and affects almost 6 % of the corpus.
 Missing concordance: if for example gender of an article doesn’t match the word, it is
probably a verb.
 Presence of other verb in the phrase: if there is other verb, treat the word as noun.
 Other restrictions based on whether concrete interpretation is possible in verbal and
non-verbal contexts.
With these techniques it was possible to reduce ambiguous cases by 81 %.
Disambiguation strategies for pronouns and determiners
Nearly 3 % of cases where a word can be either pronoun or determiner have to be resolved.
 If the following word is not a verb and is not ambiguous noun, it is not pronoun. (las
flores vs. las cantas, las niñas vs. las cuentas)
 If the following word is no-ambiguous verb, it is a pronoun.
With these techniques it was possible to reduce ambiguous cases by 84.7 %.
Disambiguation strategies for prepositions
 Preposition and adverb: in majority of cases it is possible to resolve ambiguity
between these two categories, because preposition is always followed by some
nominal structure.
 Preposition and noun: possible combinations are considered here, for instance
presence of two prepositions in a row or presence of a determiner before preposition.
 Preposition and verb: for this kind of cases (like „bajo“ and „entre“) were introduced
rules more ad hoc that are difficult to generalise.
With these techniques it was possible to reduce ambiguous cases by 99.7 %.
Disambiguation strategies for conjunctions
For conjunctions, similar strategies like for prepositions were used and it was possible to
reduce ambiguity by 82.14 %.
Disambiguation strategies between verbs
A lot of verbs in Spanish has forms that are the same words but everytime it refers to different
person, tense or verb. 10.38 % of the corpus is affected by this ambiguity. Again, for
resolving this kind of words, concordance for number and person is used. Also, grammar
mood is considered and it is possible to omit some verb forms with respect to specific type of
user who wrote the text. Effectiveness of this rules was 87.41 %.
References:
[1] http://mutis.upf.es/glicom/Papers/inftecn/pradov2.pdf (accessed January 31, 2013)
[2] http://lema.rae.es/drae/ (accessed January 31, 2013)
[3] http://es.wikipedia.org/wiki/Gramática_descriptiva_de_la_lengua_española (accessed
January 31, 2013)
Tools & Resources
Petra Tag - Spanish POS Tagger. http://petrapostagger.sourceforge.net/ (accessed January
31, 2013)
FreeLing - library providing language analysis services (including POS tagging) for various
languages (including Spanish). http://nlp.lsi.upc.edu/freeling/ (accessed January 31, 2013)
TreeTagger - a language independent part-of-speech tagger developed at University of
Stuttgart (also for Spanish). http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
(accessed January 31, 2013)
Corpus de Referencia del Español Actual (CREA) – corpus of contemporary Spanish by
Real Academia Española (The Royal Spanish Academy). http://corpus.rae.es/creanet.html
(accessed January 31, 2013)
Corpus Diacrónico del Español (CORDE) – historical corpus of Spanish by Real
Academia Española (The Royal Spanish Academy). http://corpus.rae.es/cordenet.html
(accessed January 31, 2013)
Corpus del Español - free online Spanish (historical) corpus with 100 million words.
http://www.corpusdelespanol.org/x.asp (accessed January 31, 2013)
CRATER - Multilingual Aligned Annotated Corpus for English, French and Spanish
http://www.comp.lancs.ac.uk/linguistics/crater/corpus.html (accessed January 31, 2013)
Various Spanish corpuses from the Laboratorio de Lingüística Informática:
http://www.lllf.uam.es/ING/Recursos.html (accessed January 31, 2013)
A Universal Part-of-Speech Tagset including Spanish http://code.google.com/p/universalpos-tags/
(accessed January 31, 2013)