Simple queries - regular expressions Using structures - within part Pavel Rychlý pary@fi.muni.cz 24. března 2014 Pavel Rychlý IB047 Simple queries - regular expressions Using structures - within part Meet/Union queries Corpus Query Language Test it from http: / /ske Use CQL query type fi.muni.cz/ ► < g ► < ■O0.O Simple queries - regular expressions Using structures - within part Meet/Union queries Corpus Query Language Test it from http : //ske . f i .muni . cz/ Use CQL query type ■ Query - pattern matching a set of single tokens or token sequences Pavel IB047 Simple queries - regular expressions Using structures - within part Meet/Union queries Corpus Query Language Test it from http : //ske . f i .muni . cz/ Use CQL query type ■ Query - pattern matching a set of single tokens or token sequences ■ Each token consists of attributes (depending on corpus configuration): word, lemma, tag, lempos, Ic ■ Use [attribute="value"]for each token sub-pattern. Simple queries - regular expressions Using structures - within part Very simple queries [word="dream"] [word="Dream"] [lc="dream"] [lemma="dream"] [lempos="dream-n"] [word="The"] [word="dream"] [word="the"] [lemma="dream"] [tag—"AJO"] [lempos="dream-n"] Pavel Rychly IB047 Simple queries - regular expressions Using structures - within part Meet/Union queries Regular Expression in Attributes Value is a regular expression in a [attribute="value"] expression. [word="dream.*"] [word="[dD]ream"] [word="[0-9]*"] [lc="dreams"] [tag="NN."] [lempos="dream-v"] [word="[0-9]{5,}"] [word="\."] [word="\("] [word="0[0-9]{3}"] [word="\)"] [word—")"] [word—"."] [word="[A-Z][0-9A-Z]{2,3}"] [word="[0-9][0-9A-Z]{2} Simple queries - regular expressions Using structures - within part Meet/Union queries Regular Expressions PCRE library used for evaluation of REs Several useful special sequences ■ \d - any decimal digit ■ \d - any character that is not a decimal digit ■ \w - any "word" character ■ \w - any "non-word" character ■ (?i) - ignore case [word="\d\d\W"] Pavel Rychlý IB047 Simple queries - regular expressions Using structures - within part Meet/Union queries Logical combinations of attributes Boolean combinations (AND, OR and NOT) of [attribute="value"] expre ss i o n s. Use: &, ,!=,() [word="dream" & tag="NNl"] [lemma="dream" & tag="W."] [word="dream" | word="Dream"] [word="the" | tag="DPS"][lempos="dream-n" & tag="NN2"] [word="the" | (tag="DPS" & lemma!="my")][lemma="dream"; Pavel Rychly IB047 Simple queries - regular expressions Using structures - within part Meet/Union queries Regular expressions of tokens Regular expressions on token level: ? optional token * any number of repetition + at least one {N} exact number of repetitions {M,N} from M to N repetitions [ ] any token [tag="DPS"] [] [lemma="dream"] [tag="DPS"] [tag="AJ0"]? [lemma="dream"; [tag="AJ0"]{2} [lemma="dream"] [word="the"] []{0,3} [lempos="dream-n"] Pavel Rychly IB047 Simple queries - regular expressions Using structures - within part Within within keyword at the end of a query ■ within restricts result to one sentence ■ within restricts result to a subcorpus [lemma="dream"] within [word="the"] []{3,5} [lemma="dream"] [word="the"] []{3,5} [lemma="dream"] withi Simple queries - regular expressions Using structures - within part Meet/Union queries Within More within combinations: Boolean combinations of regular expressions [lemma="dream"] within [lemma="dream"] within [word="the"] []{3,5} [lemma="dream"] within within [word="the"] []{3,5} [lemma="dream"] within Pavel Rychlý IB047 Simple queries - regular expressions Using structures - within part Meet/Union queries Within within could be inverted [word="THE"] within >ord="THE"] within ! Simple queries - regular expressions Using structures - within part Meet/Union queries Structure boundaries Structure boundaries: start/end of a structure, whole structure [lemma="dream"] [word=="?"] within tag. Query can limit the search to segments with aligned parts containing a subquery hits. [lemma="hrad"] within kacen: [word="castle"] [lemma="hrad"] within ! kacen: [word="castle" Simple queries - regular expressions Using structures - within part Meet/Union queries Meet/Union queries ■ combining and nesting simpl ■ not a sequence of tokens ■ meet and union operators Simple queries - regular expressions Using structures - within part Union Union operator: ■ union Q1 Q2 (union [word="dream"] [word="dreams'r [word="dream" | word="dreams"] 4 □ ► 4 & k 4 = * Pavel Rychly IB047 Simple queries - regular expressions Using structures - within part Meet Meet operator: ■ meet Q1 Q2 W-BEG W-END ■ find Q1 with Q2 in window from W-BEG to W-END ■ W-BEG, W-END defaults to 1 (meet [word="my"] [word="dream"]) [word="my"] [word="dream"] (meet [word="my"] [word="dream"] 1 3) [word="my"] []{0,2} [word="dream"] (meet [word="black"] [word="white"] -3 3) Simple queries - regular expressions Using structures - within part Meet/Union queries Meet/union combination use a meet/union operator in place of a simple query (meet [word="and"] (meet [word="black"] [word="white"l -3 3) -2 2) Simple queries - regular expressions Using structures - within part Meet/Union queries Within keyword within works with any subquery not only a structure [lemma="dream"] within ([word="my"] [lemma="dream"]) (meet [lemma="dream"] [word="my"] -1 -1) [word="the"] []{0,3} [lemma="dream"] within ([tag="AT."] [tag="AJ."] {0,4} [tag="NN."]) Pavel Rychlý IB047 Simple queries - regular expressions Using structures - within part containing keyword containing keyword ■ inverts within keyword ■ matches results of the first subquery which contains matches of the second subquery containing [lemma="dream"] (meet [lemma="dream"] [word="my"] -1 -1) [word="the"] []{1,3} [lemma="dream"] containing [lemma="wild"] Simple queries - regular expressions Using structures - within part Meet/Union queries Combinations of containing/within Both keyword forms a query which can be used as subquery, they can be nested. [lemma="break"] within ( containing [lemma="rul [lemma="student"] within ( containing [lemma="break"] containing [lemma="rule"]) [lemma="break"] within ([]{5} containing [lemma="ru