Simple queries - regular expressions Using structures - within part Meet/Union queries Pavel Rychlý pary@fi.muni.cz 24. března 2014 Simple queries - regular expressions Using structures - within part Meet/Union queries Corpus Query Language Test it from http : //ske . f i .muni Use CQL query type Simple queries - regular expressions Using structures - within part Meet/Union queries Corpus Query Language Test it from http : //ske . f i .muni . cz/ Use CQL query type ■ Query - pattern matching a set of single tokens or token sequences Corpus Query Language Simple queries - regular expressions Using structures - within part Meet/Union queries Test it from http : //ske . f i .muni . cz/ Use CQL query type ■ Query - pattern matching a set of single tokens or token sequences ■ Each token consists of attributes (depending on corpus configuration): word, lemma, tag, lempos, Ic ■ Use [attribute="value"]tor each token sub-pattern. Very simple queries Simple queries - regular expressions Using structures - within part Meet/Union queries [word="dream"] [word="Dream"] [lc="dream"] [lemma="dream"] [lempos="dream-n"] [word="The"] [word="dream"] [word="the"] [lemma="dream"] [tag="AJ0"] [lempos="dream-n" Pavel Rychlý IB047 ular Expression in Attributes Simple queries - regular expressions Using structures - within part Meet/Union queries Value is a regular expression in a [attribute="v [word="dream.*"] [word="[dD]ream"] [word="[0-9]*"] [lc="dreams"] [tag="NN."] [lempos="dream-v"] [word="[0-9]{5,}"] [word="\."] [word="\("] [word="0[0-9]{3}"] [word—" ) " ] [word—" . " ] [word="[A-Z][0-9A-Z]{2,3}"] [word="[0 [word="\)"] 9][0-9A-Z]{2} Simple queries - regular expressions Using structures - within part Meet/Union queries Regular Expressions PCRE library used for evaluation of REs Several useful special sequences ■ \d-any decimal digit ■ \d - any character that is not a decimal digit ■ \w - any "word" character ■ \w - any "non-word" character ■ (?i) - ignore case [word="\d\d\Wf!] Pavel Rychlý IB047 Simple queries - regular expressions Using structures - within part Meet/Union queries Logical combinations of attributes Boolean combinations {AND, Of? and NOT) of [attribute="value"] expressions. Use: &, |, !=, () [word="dream" & tag="NNl"] [ lemma="dream" & tag="W."] [word="dream" | word="Dream"] [word="the" [word="the" tag="DPS"][lempos="dream-n" & tag="NN2"] (tag="DPS" & lemma!="my")][lemma="dream"] ular expressions of tokens Simple queries - regular expressions Using structures - within part Meet/Union queries Regular expressions on token level: ? optional token * any number of repetition + at least one {N} exact number of repetitions {M,N} from M to N repetitions [ ] any token [tag="DPSf!] [] [lemma="dream" ] [tag="DPSf!] [tag="AJOf!] ? [ lemma="dream" ] [tag="AJ0"]{2} [lemma="dream"] [word="the"] []{0,3} [lempos="dream-n"] Pavel Rychlý IB047 within keyword at the end of a query ■ within restricts result to one sentence ■ within restricts result to a subcorpus [lemma="dream"] within [word="thef!] [] {3,5} [lemma="dream" ] [word="thef!] []{3,5} [lemma="dream"] within More within combinations: Boolean combinations of regular expressions [lemma="dream"] within [lemma="dream"] within [word="theM] []{3,5} [lemma="dream"] within within [word="theM] []{3,5} [lemma="dream"] within Simple queries - regular expressions Using structures - within part Meet/Union queries Within within could be inverted [word="THE"] within [word="THE"] within ! Pavel Rychlý IB047 Simple queries - regular expressions Using structures - within part Meet/Union queries Structure boundaries Structure boundaries: start/end of a structure, whole structure [lemma="dream"] [word=="?M] within Pavel Rychlý IB047 Simple queries - regular expressions Using structures - within part Meet/Union queries Global conditions Global condition ■ numeric labels of tokens ■ testing agreement or disagreement of attribute values [tag="NN."] [word="ancT ] [tag="NN."] Global conditions Simple queries - regular expressions Using structures - within part Meet/Union queries Global condition ■ numeric labels of tokens ■ testing agreement or disagreement of attribute values [tag="NN."] [word="ancT ] [tag="NN."] 1: [tag!="NN.f!] [word="andf!] 2: [tag!="NN.f!] 1:[] [word="andf!] 2:[] & l.k=2.k & 1. c=2 . c & 1. tag 2 . tag Parallel corpora Simple queries - regular expressions Using structures - within part Meet/Union queries Parallel corpora - separate corpus for each language, 1 -to-1 alignment using tag. Query can limit the search to segments with aligned parts containing a subquery hits. [lemma="hrad"] within kacen: [word="castle"] [lemma="hrad"] within ! kacen: [word="castle" Simple queries - regular expressions Using structures - within part Meet/Union queries Meet/Union queries ■ combining and nesting simple (one-token) queries ■ not a sequence of tokens ■ meet and union operators Simple queries - regular expressions Using structures - within part Meet/Union queries nion Union operator: ■ union Q1 Q2 (union [word="dream"] [word="dreams"]) [word="dream" | word="dreams"] Simple queries - regular expressions Using structures - within part Meet/Union queries Meet Meet operator: ■ meet Q1 Q2 W-BEG W-END ■ find Q1 with Q2 in window from W-BEG to W-END ■ W-BEG, W-END defaults to 1 (meet [word="my"] [word="dream"]) [word="myM] [word="dream"] (meet [word="myM] [word="dream"] 1 3) [word="myM] []{0,2} [word="dream"] (meet [word="black"] [word="white"] -3 3) Simple queries - regular expressions Using structures - within part Meet/Union queries Meet/union combination use a meet/union operator in place of a simple query (meet [word="and"] (meet [word="black"] [word="whiteM] -3 3) -2 2) < □ ► 4 ► < Simple queries - regular expressions Using structures - within part Meet/Union queries Within keyword within works with any subquery not only a structure [lemma="dream"] within ([word="my"] [lemma="dream"]) (meet [lemma="dream"] [word="myM] -1 -1) [word="theM] []{0,3} [lemma="dream"] within ([tag="AT."] [tag="AJ."] {0,4} [tag="NN."]) Pavel Rychlý IB047 Simple queries - regular expressions Using structures - within part Meet/Union queries containing keyword containing keyword ■ inverts within keyword ■ matches results of the first subquery which contains matches of the second subquery containing [lemma="dream"] (meet [ lemma="dream" ] [word="myf!] -1 -1) [word="thef!] [] {1,3} [lemma="dream" ] containing [lemma="wild"] Combinations of containina/within Simple queries - regular expressions Using structures - within part Meet/Union queries Both keyword forms a query which can be used as subquery, they can be nested. [lemma="break"] within ( containing [lemma="rule"]) [lemma="student"] within ( containing [lemma="break"] containing [lemma="rule"]) [lemma="break"] within ([]{5} containing [lemma="rule"])