PA153: Určování autorství s použitím stylometrie a strojového učení Jan Rygl rygl@f i.muni.cz Dec 14, 2015 Natural Language Processing Centre Faculty of Informatics, Masaryk University Botanická 68a, 602 00 Brno, Czech Republic Historie a motivace Historie Proč nás zajímá autorství? • Antika: Homer, Demosthenes vs Anaximenes • Bible: Pentateuch o Anglie 1694 (konec předpublikační cenzury): začínají se používat pseudonymy • Anglie 1887: první algoritmus řešící autorství • Anglie 1976: uznáno jako důkaz u soudu o současnost: analýza anonymních dokumentů a textů s pod vrženým autorstvím, . .. PA153: Určování autorství s použitím stylometrie a strojového učení 2/34 Historie a motivace Přístupy k určování autorství O ideologická a tématická analýza historické dokumenty, literární díla O dokumentární a faktografické důkazy inkvizice, knihovny O jazyková a stylistická analýza - stylometrie současnost Historie a motivace Verifikace autorství ? r ^ v__ J v__ J • Jsou dva texty napsány jedním autorem? (lvi) o Byl dokument napsán podepsaným autorem? (lvN) Příklady • Otázka autorství Shakespearových děl 9 Ověření pravosti závěti Historie a motivace Authorship Verification The Shakespeare authorship question Mendenhall, T. C. 1887. The Characteristic Curves of Composition. Science Vol 9: 237-49. o The first algorithmic analysis o Calculating and comparing histograms of word lengths Oxford, Bacon Derby, Marlowe http://en.wikipedia.org/wiki/File:ShakespeareCandidatesl.jpg PA153: Určování autorství s použitím stylometrie a strojového učení 5/34 Historie a motivace Authorship Attribution Definition I- , J J L_; L_j find out an author of a document candidate authors can be known Examples False reviews Historie a motivace s- Judiciary • The police falsify testimonies Morton, A. Q. Word Detective Proves the Bard wasn't Bacon. Observer, 1976. o Evidence in courts of law in Britain, U.S., Australia s_ Historie a motivace Authorship Clustering Definition Author B Author A cluster documents or text paragraphs according to the authors Examples o The Bible • Analysis of anonymous documents Historie a motivace Authorship Clustering Bible K. Grayston and G. Herdan. The authorship of the pastorals in the light of statistical linguistics. New Testament Studies, Vl:l-15, 1959-1960. Gustav Herdan, statistician and linguist: • born 1897 in Brno 9 author of Quantitative linguistics 9 mathematical language laws, e.g. the dependence of the number of distinct words in a document as a function of the document length Historie a motivace Computional stylometry • Online social networks: predicting age and gender o Plagiarism: co-authorship a Supportive authentication, biometrics (e.g. in e-learning) • Native language prediction o Anonymous documents, threats, ... • Ministry of the Interior of CR within the project VF20102014003 Research for Ministry of the Interior of CR • authorship detection for Czech • new author's characteristics and adaptation of existing for flective free-word-order languages • new techniques for "Internet documents" • software Authorship Recognition Tool (ART) Techniques Q Historie a motivace Q Techniques Q Results Definition Computional stylometry techniques that allow us to find out information about the authors of texts on the basis of an automatic linguistic analysis Motivation Stylometry analysis is used for • Linguistic expertise • Stylome: set of characteristic author's features • Machine learning: stylometric features ^ attributes for machine learning Techniques Stylometry Preprocessing • document crawling o text and meta data extraction (detect author's la o text cleaning o deduplication • boilerplate removal o remove markup tags • language and encoding detection • tokenize Techniques Stylometry Preprocessing • morphological analysis j e byt spor spor mezi mezi Severem sever k5eAaImIp3nS klglnScl k7c7 klglnScľ 9 syntactic analysis 15 16 17 18 19 ekonomiky 44 20 43 P P 20 20 P P Authorship recognition through stylometry For each text: O preprocess text 0 count values of stylometric features (text is represented by a vector of feature values) Depending on the task: O compare two documents, subtract one feature-value vector from the second one Q characterize label (author), analyze feature-value vectors with the same label (author) tylometry-feature categories Categories * Morphological 9 Syntactic 9 Vocabulary • semantic words • stop-words o Technical (text formatting, publishing time) o utner PA153: Určování autorství s použitím stylometrie a strojového učení 17 / 34 Techniques Author's characteristic features Word length statistics o Count and normalize frequencies of selected word lengths (eg. 1-15 characters) • Modification: word-length frequencies are influenced by adjacent frequencies in histogram, e.g.: l: 307., 2: 707., 3: 07. is more similar to l: 707., 2: 307., 3: 07. than 1: 07., 2: 607., 3: 407. Sentence length statistics • Count and normalize frequencies of 9 word per sentence length o character per sentence length • Detect sentences written in the first person • Extract author's gender if possible o včera jsem byla v Brně a viděla Wordclass (bigrams) statistics o Count and normalize frequencies of wordclasses (wordclass bigrams) • verb is followed by noun with the same frequency in selected five texts of Karel Capek Techniques Author's characteristic features Morphological tags statistics 9 Count and normalize frequencies of selected morphological tags • the most consistent frequency has the genus for family and archaic freq in selected five texts of Karel Capek Word repetition 9 Analyse which words or wordclasses are frequently repeated through the sentence Techniques Stopwords • Count normalized frequency for each word from stopword list • Stopword ^ general word, semantic meaning is not important, e.g. prepositions, conjunctions, .. . • stopwords ten, by, člověk, že are the most frequent in selected five texts of Karel Capek Techniques Author's characteristic features Syntactic Analysis Extract features using SET (Syntactic Engineering Tool) I I I Verifikujeme I autorstvi Se I I syntaktickou analýzou Syntactic analysis .................................. ' ^ Cfft-JpíoyĽpi^Ľ A A A A A A A A A A A A A A A A A A A A a syntactic trees have similar depth in selected five texts of Karel Capek Techniques Other stylometric features • typography o formatting richness 9 emoticons 9 errors • vocabulary richness Techniques Author's characteristic features Document comparison 0.0183 || global Wordclass bigrams 0.0003 Wordclasses 0.04 Q Morph. tags 0.0735 Word repetition 0.1105 Syntactic analysis 0.1358 Stopword 0.3538 Punctuation 0.3929 Chars per sentence 0.7667 I Example: comparison between two different authors Wend repiehüuri ■ Charakteristik features Techniques Author writeprint/stylome Collection of author's documents Morph, tags £ 0.5 3j A B word doc. doc. diff. length A B 1 0 2 2 2 0 2 2 3 2 1 1 4 6 1 5 5 0 1 1 6 1 2 1 7 2 0 2 New ML layer (replace linguist's heuristic by empirical evidence): vector A B for SG(1..7) S/A77 = classifier {vector) Performance (Czech texts) Balanced accuracy: books essays newspapers blogs letters e-mails discussions Verification: o books, essays: 90 % -> 99 % • blogs, articles: 70% -> 99% o tweets: 70 % —^ 99 % (given enough tweets) Attribution (depends on the number of candidates, comparison on blogs): • up to 4 candidates: 80% -> 95% Results Current work Machine translation detection o Recognize texts translated by Google, Bing and other machine translators o Remove translations from corpora o Detect texts falsely submitted as translated by a human expert o http://nlp.fi.muni.cz/sir X Výsledek analýzy: Pravděpodobnost napsání človek em: % % Pravděpodobnost pTekladu: 4 9* lfinw — so:; s« -73:0 -6o:í -53:0 -tms -3o:í -ICňk -".3:0 - 'i % 1 word-class word-class word count word per reduced typography word per word word unigrams bgrams sentence word per sentence length richness count PA153: Určování autorství s použitím st\ 31 / 34 Results Current work Web structure detection o Create stylometric corpora 9 Detect web structure and download documents with meta-data (author, gender, age, title, topic) Web Expert supporting system Expert explain a choice of the author the selected author with an explanation Fact Extraction summary of the document Intelligent Web Monitor extract facts from the document an anonymous document download all new documents from a selected website analyse structure of the web extract information from documents Results Current work • Gender detection o Use data from dating services o Detect advertisements with a falsely submitted gender • Authorship detection consultations Results ■ť! STS um 'hank you for your attention St wage CMcáetw b» )-- /PmiS AND FoJS / ALL THAT WrLLtAM TL T. _ ÍS S AWD i L L PR* VE. 1 , THAT I WRoTE / r* £ t** r PA153: Určování autorství s použitím stylometrie a strojového učení 34/34