2014/11/23 1 Corpus Linguistics and Japanese Language Takehiko Maruyama 丸山 岳彦 National Institute for Japanese Language and Linguistics / University of Oxford 18 November 2014 SEMINÁŘ JAPONSKÝCH STUDIÍ Masarykova Univerzita コーパス言語学と日本語 NINJAL National Institute of Japanese Language and Linguistics (“NINJAL”) 国立国語研究所 Established in 1948 Scientific surveys of Japanese language Creation of Japanese corpora Contents (10:50 – 11:50) Introduction : Japan and Japanese Language Japanese Corpus: History What is a “Corpus” ? History of Japanese Corpus Japanese Corpus: Present situation Spoken Corpus : CSJ Written Corpus : BCCWJ Introduction: Japan and Japanese Language Where is Japan ? Japan / Nihon 日本 Tokyo Kyoto Mt. Fuji Hokkaido Japan Alps Okinawa 2014/11/23 2 Dialects in Japan Dialect surveys by NINJAL since 1940s Fukushima pref. 1949 Hachijo island 1949 Tokyo Iwate pref. 1980 Tottori pref. 1984 Okinawa pref. 1978 Dialects in Japan Linguistic Atlas of Japan (NINJAL, 1966) Japanese Writing System Three types of characters Kanji 教科書 玉子 Hiragana ほん たまご Katakana テキスト タマゴ Other types of characters Punctuation, mark 、 。 ? ! 「 ( / & Alphabet NINJAL Arabic numeral 1,234 Roman numeral I IV XIII Japanese Corpus : History  What is a “Corpus”?  History of Japanese Corpus What is a “Corpus” ? A “corpus” is… an collection of language in “real world”. “a collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to be used for linguistic analysis”. (Francis 1982) Various corpora Text (written) corpus, Speech (spoken) corpus, Historical corpus, Learner corpus, Dialect corpus… “Corpus linguistics” is… a methodology of linguistic study using corpora Corpus collection/creation started in 1960s 1959 UK The Survey of English Usage (1 million words) 1964 US Brown Corpus (1 million wds) 1991 UK Bank of English (BOE) (500 million wds) 1994 UK British National Corpus (BNC) (100 million wds) 2000 CZ Czech National Corpus (CNC) (100 million wds) 2004 JP Corpus of Spontaneous Japanese (CSJ) (7.5 million wds) 2011 JP Balanced Corpus of Contemporary Written Japanese (BCCWJ) (100 million wds) Various corpora in the world Where is the origin of Japanese corpus? 2014/11/23 3 History of Japanese Corpus 1 Surveys of daily vocabulary at NINJAL 1953 Research on vocabulary in women's magazines 1957-1958 Research in vocabulary in cultural reviews 1962-1964 Vocabulary and Chinese characters in ninety magazines of today (I, II, III) 0.5 million words Real text Sampling Vocabulary list Origin of Japanese written Corpus! History of Japanese Corpus 2 Surveys of colloquial speech at NINJAL 1955 Research in the colloquial Japanese 30 hours of colloquial speech were recorded, 83,620 words 1960, 1963 A research for making sentence patterns in colloquial Japanese (1: dialog) (2: monolog) History of Japanese Corpus 2 Surveys of colloquial speech at NINJAL 1955 Research in the colloquial Japanese 30 hours of colloquial speech were recorded, 83,620 words 1960, 1963 A research for making sentence patterns in colloquial Japanese (1: dialog) (2: monolog) Origin of Japanese spoken Corpus! Vocabulary surveys using computers 1970-1973 Studies on the vocabulary of modern newspapers 1-4 2 million words from three major newspapers in 1966 History of Japanese Corpus 3 Vocabulary surveys using computers 1970-1973 Studies on the vocabulary of modern newspapers 1-4 2 million words from three major newspapers in 1966 History of Japanese Corpus 3 Origin of Japanese electric Corpus! 2014/11/23 4 Japanese Corpora in 2000s NINJAL started creating large sized corpora Corpus of Spontaneous Japanese (CSJ) - 2004 651 hours, 752 million words of spontaneous speech Balanced Corpus of Contemporary Written Japanese (BCCWJ) - 2011 100 million words of various written text (well balanced) Corpus of Historical Japanese (CHJ) - 2013~ 14 literary works with 0.79 million words in Heian period Ultra Large-sized Corpus (ULC) - under construction 10 billion words of Japanese text extracted from web Japanese Corpus : Present situation  Spoken Corpus : CSJ  Written Corpus : BCCWJ Keywords Knowledge and Behavior Znalosti a chování 知識と行動 CSJ : Corpus of Spontaneous Japanese 『日本語話し言葉コーパス』 1. コミニケーション 2. コミュニケーション 3. コミニュケーション 4. コミュニュケーション Question 1 Which one is a correct spell of “communication” in Japanese? Variable forms in speech How do you read this word? 自転車 じ てん しゃ Question 2 ji ten sya Yes! 2014/11/23 5 How do Japanese people pronounce the word “自転車” in real life? 自転車 じ でん しゃ Question 2-2 ji den sya Guess the percentages of each pronunciation in real Japanese. コミニケーション ( )% コミュニケーション ( )% コミニュケーション ( )% コミュニュケーション( )% じてんしゃ ( )% じでんしゃ ( )% Question 3 How should we get the answers?? When you hesitate while speaking, you might use FPs (filled pauses). hm... er... uh... What type of FP do you use most frequently in your daily Czech? How about in Japanese? Question 4 How should we get the answers?? How should we get the answers? Think it in your head ! (intuition) Your answer may be wrong !! Who guarantee your answer ? Ask the speech corpus ! (survey) Everyone can get the same answer ! Of course you need a reliable corpus. We have knowledge about (at least) a language. But we don’t know how we behave with it. CSJ Corpus of Spontaneous Japanese (2004) Japanese spontaneous speech (mainly monolog) 651 hours, 7.52 million words 3,302 lectures by 1,418 different speakers Rich annotations 18 DVDs Aims Automatic Speech Recognition (ASR) system Linguistic study of spontaneous speech 2014/11/23 6 Basic Form & Pronounced Form FP Frag- ment Repair Variable pronunciation Elongation Two Ways of Transcription Answer to “Communication” How do Japanese pronounce “communication” ? Corpus : CSJ, 651 hours, 7.52 million words Frequency of the word “communication” : 601 times コミニケーション 296 コミニュケーション 136 コミュニケーション 123 コミュニュケーション 36 misc 10 Total 601 49% 23% 20% 6% 2% コミニケーション コミニュケーション コミュニケーション コミュニュケーション その他 Answer to “自転車” How do Japanese pronounce 自転車 ? Corpus : CSJ, 651 hours, 7.52 million words Frequency of the word 自転車 : 483 times ジテンシャ 349 ジデンシャ 116 misc 18 Total 483 72% 24% 4% ジテンシャ ジデンシャ misc Answer to Filled Pauses (JP) What FP do Japanese use most frequently? Corpus : CSJ, 651 hours, 7.52 million words Frequency of Filled Pauses : 430,472 times ☞(F えー) (F e:) 116,772 27.1% (F え) (F e) 45,665 10.6% (F ま) (F ma) 44,549 10.4% (F あのー) (F ano:) 40,695 9.5% (F あの) (F ano) 33,330 7.7% (top 5) Answer to the Filled Pauses (JP) What FP do Japanese use most frequently? Corpus : CSJ, 651 hours, 7.52 million words Frequency of Filled Pauses : 430,472 times Male (F e:) 95,359 30.2% (F e) 36,078 11.4% (F ma) 34,643 11.0% (F ma:) 24,369 7.7% (F ano:) 21,302 6.8% Female (F e:) 21,413 18.7% (F ano:) 19,393 16.9% (F ano) 15,954 13.9% (F ma) 9,906 8.6% (F e) 9,587 8.4% えー e: あのー ano: Answer to the Filled Pauses (CZ) What FP do Czech use most frequently? … and tell me the result  2014/11/23 7 Annotations to speech signals Two-way Transcription Segment Labels Intonation Labels Morphological Analysis Clause Boundary Labels Dependency Structure Discourse Structure Impression Rating Speaker Info Phonetics / Phonology Morphology / Lexicon Syntax Discourse analysis Metadata / bibliography Morphological Analysis All transcriptions were segmented into words (manually/automatically) with rich information Transcription ID Utterance time Orthographic form Pronunciation form Part-of-Speech Conjugation type Conjugation form XML Encoding Various annotation were encoded into XML file. Concordancer “Himawari” What CSJ offers Variations in spontaneous speech Pronunciation, Accent, Intonation, Grammar… Disfluency in spontaneous speech FP, Word Fragments, Elongation, Self-repair… Resource to analyze behaviors in spontaneous JP Future work: to create a large dialog corpus Linguistic knowledge never tells us our behavior ! BCCWJ : Balanced Corpus of Contemporary Written Japanese 『現代日本語書き言葉均衡コーパス』 2014/11/23 8 BCCWJ Contents: balanced corpus for general purpose Corpus Size: 100 million words Period: 1976 - 2005 (-2009) Media: Books, Magazines, Newspapers, Whitepapers, Textbooks, Web Documents, Law, Verse, Diet minutes... Method: Stratified random sampling Aim: Vocabulary survey, Grammatical study, Lexicography, Natural language processing... Structure of BCCWJ Publication sub-corpus Books, Magazines, Newspapers 35 million words 2001-2005 Library sub-corpus Books stored in many public libraries 30 million words 1986-2005 Special-purpose sub-corpus Whitepapers, Textbooks, Public Relation, Best-Seller book, Web documents, Verse, Law, Diet minutes 40 million words 1976-2005 Publication Sub-corpus Population: All the books, magazines, and newspapers published in the years 2001 to 2005. defined by the number of characters. Actual state of Publication Population ( # of chars) Books Magazines Newspapers Sample (35M words) Definition of Population  Investigated number of chars in 2001- 2005 Titles Pages Chars Books 317,117 74,911,520 48,539,925,351 Magazines 55,779 10,414,955 10,515,681,636 Newspapers 49,625 1,198,189 6,416,070,114 Powered by : National Diet Library Japan Magazine Publishers Association Japan Newspaper Publishers Association Stratification and Each Ratio # chars ratio Book 48,539,925,351 74.138% Magazines 10,515,681,636 16.063% Newspaper 6,416,070,114 9.800% TOTAL 65,471,677,100 100% Genres ×11 × 6 × 3 Media Strata # of chars Ratio Media Strata # of chars Ratio Book 0. General works 1,636,414,548 2.50% Magazine General 7,421,447,806 11.34% 1. Philosophy 2,597,610,813 3.97% Education 877,875,592 1.34% 2. General history 4,301,204,340 6.57% Politics 456,459,405 0.70% 3. Social sciences 12,408,321,943 18.95% Industry 110,640,958 0.17% 4. Natural sciences 5,069,594,034 7.74% Technology 1,468,293,360 2.24% 5. Technology 4,615,929,967 7.05% Medical 180,964,513 0.28% 6. Industry 2,196,387,437 3.35% Newspaper National 2,417,622,461 3.69% 7. The arts 3,258,432,447 4.98% Block 1,296,592,154 1.98% 8. Language 888,800,128 1.36% Local 2,701,855,499 4.13% 9. Literature 9,341,275,486 14.27% Total 65,471,677,100 100% n. Unclassified 2,225,954,208 3.40% Distribution of # chars = Compositional Ratio Extracting sample A character randomly chosen in a page Sample starts here Figures, old Japanese are omitted 2014/11/23 9 Compilation of BCCWJ Sampling (as shown above) Copyright solution We identified almost 30,000 copyright holders. 70-80% of them approved to the request. Text digitalization and XML tagging Logical structure of text  Annotation of Part of Speech information 98% accuracy with an electronic dictionary UniDic 99.9% with annotator’s modification for 1 million wd Compilation of BCCWJ
やがて、後燕は漢人のに乗っ取られてしまいま す。 西暦四〇九年のことですが、この翌年前記の南燕が東晋のによって、ほろ ぼされてしまいました。 四〇九年には、いろいろなことがおこっています。 さしもの拓跋珪も、この年、思わぬことで、あろうことか息子の一人、 によって殺されました。 Release of BCCWJ In 2011, completed BCCWJ is released 少納言 Shonagon http://www.kotonoha.gr.jp/shonagon/ Character-based Concordance on the web Free, max 500 examples (randomly chosen) 中納言 Chunagon https://chunagon.ninjal.ac.jp/ Word-based Concordance on the web Registration is needed, all the examples downloadable DVD All the morphologically analyzed text, bibliographic data Academic Use, 52,500 YEN Collocations in BCCWJ NINJAL-LWP for BCCWJ http://nlb.ninjal.ac.jp/ Shows collocation (common word combinations) Question 5 How do Japanese write in daily life? tamago katakana hiragana kanji kanji Which is most frequent? What BCCWJ offers The first balanced corpus of written Japanese Actual situation of published / spread written text Various types of written text Easy access to 100 million words corpus Everybody can use a large-sized corpus Objective tests for linguistic analyses Infrastructure for Japanese corpus linguistics 2014/11/23 10 Conclusion before Lunch Japanese corpora NINJAL stated creating a series of large corpora rapidly since 2000. Infrastructures for Japanese corpus linguistics Knowledge and Behavior There are many linguistic questions we can not answer with our linguistic knowledge. Linguists need reliable corpora to investigate the linguistic behavior in actual life. Use corpora ! Workshop after Lunch BCCWJ demonstrations 少納言 Shonagon 中納言 Chunagon NINJAL-LWP for BCCWJ CSJ demonstration ひまわり Himawari Other resources 青空文庫 Aozora Bunko on ひまわり Himawari Corpus Linguistics and Japanese Language (2) Workshop Takehiko Maruyama National Institute for Japanese Language and Linguistics / University of Oxford 18 November 2014 SEMINÁŘ JAPONSKÝCH STUDIÍ Masarykova Univerzita BCCWJ demonstrations 少納言 Shonagon 中納言 Chunagon NINJAL-LWP for BCCWJ CSJ demonstration ひまわり Himawari Other resources 青空文庫 Aozora Bunko on ひまわり Himawari Contents (14:10 – 15:45) BCCWJ : demonstrations 『現代日本語書き言葉均衡コーパス』 What is this ? すいか 西瓜 スイカ 2014/11/23 11 すいか スイカ 西瓜 How do they write it? すいか スイカ 西瓜 How do they write it? Question 6 すいか / スイカ / 西瓜 Which is the most frequent in Newspapers ? Ask 少納言 Shonagon! http://www.kotonoha.gr.jp/shonagon/ Question 7 Give an example of writing variation like すいか, and ask 少納言 Shonagon! For example… バイオリン・ヴァイオリン ダイヤモンド・ダイアモンド 買い物・買物 打ち合わせ・打合わせ・打合せ にんじん・ニンジン・人参 ひふ科・ヒフ科・皮ふ科・皮フ科・皮膚科 Question 5 How do Japanese write in daily life? tamago katakana hiragana kanji kanji Which is most frequent? Question 5 たまご タマゴ 玉子 卵 - Which is the most frequent in BCCWJ? Is it a good way to ask 少納言 Shonagon ? Example of search result “卵” 「バター、黒糖、卵黄をよくすり混ぜる。」 (Butter, brown sugar, yolk, mix them well.) 卵黄 (yolk) らん おう (ran o: ) It’s not the case of 卵 ! たまご 2014/11/23 12 Ask 中納言 Chunagon, in which Part-of-Speech information can be used. https://chunagon.ninjal.ac.jp/login Registration is needed to log in ! Question 5 Settings for the corpus search 『語彙素』が『卵』 ← Lemma AND 『語彙素読み』が『タマゴ』 ← Reading Question 5 Question 8 Give an example of writing variation like たまご, and ask 中納言 Chunagon! For example… 買い物・買物 ねこ・ネコ・猫 いぬ・イヌ・犬 Collocations in BCCWJ NINJAL-LWP for BCCWJ http://nlb.ninjal.ac.jp/ Shows collocation (common word combinations) Ask NLB about Japanese collocations ! 「 X を飲む」 (to drink X) What is the most frequent word for X in BCCWJ? Question 9 Question 10 Give an example of collocation like 「Xを飲む」, and ask NLB! For example… 「 X を食べる」 eat X 「 X を聞く」 listen to X 「 X を読む」 read X 「 X を書く」 write X 「 X を話す」 speak X 2014/11/23 13 CSJ : Corpus of Spontaneous Japanese 『日本語話し言葉コーパス』 Distribution of CSJ CSJ (with 18 DVDs) is distributed at the Center for Corpus Development, NINJAL. http://www.ninjal.ac.jp/corpus_center/csj/ Himawari Himawari is a character-based concordance system for Japanese linguistics http://goo.gl/nBcPO Answer to “Communication” How do Japanese pronounce “communication” ? Corpus : CSJ, 651 hours, 7.52 million words Frequency of the word “communication” : 601 times コミニケーション 296 コミニュケーション 136 コミュニケーション 123 コミュニュケーション 36 misc 10 Total 601 49% 23% 20% 6% 2% コミニケーション コミニュケーション コミュニケーション コミュニュケーション その他 Answer to Filled Pauses (JP) What FP do Japanese use most frequently? Corpus : CSJ, 651 hours, 7.52 million words Frequency of Filled Pauses : 430,472 times ☞(F えー) (F e:) 116,772 27.1% (F え) (F e) 45,665 10.6% (F ま) (F ma) 44,549 10.4% (F あのー) (F ano:) 40,695 9.5% (F あの) (F ano) 33,330 7.7% (top 5) Aozora Bunko 『青空文庫』 2014/11/23 14 Aozora Bunko Aozora Bunko (青空文庫) is a Japanese digital library. This online collection has several thousands of works of Japanese-language fiction and non-fiction. Aozora Bunko has digital copies of many out-of-copyright books. http://www.aozora.gr.jp/ Aozora Bunko on Himawari Aozora Bunko Package can be downloaded. http://goo.gl/Re73C Instead of Conclusion… ありがとう 7085 有難う 419 ありがと 337 有り難う 102 アリガト 26 アリガトウ 24 あリがとう 3 ア・リ・ガ・ト 2 あリがと 2 アリヽ(*^▽^*)ノガトゥ 1 総計 8001