Scalability of Semantic Analysis in Natural Language Processing

ŘEHŮŘEK, Radim. Scalability of Semantic Analysis in Natural Language Processing. 2011.

Další formáty: BibTeX LaTeX RIS

TY  - ART
ID  - 959018
AU  - Řehůřek, Radim
PY  - 2011
TI  - Scalability of Semantic Analysis in Natural Language Processing
KW  - latent semantic analysis, latent dirichlet allocation, digital libraries, natural language processing
UR  - http://radimrehurek.com/phd_rehurek.pdf
N2  - Data mining applications that work over input of very large scale (web-scale problems) pose challenges that are new and exciting both academically and commercially. Any web-scale algorithm must be robust (dealing gracefully with the inevitable data noise), scalable (capable of efficiently processing large input) and reasonably automated (as human intervention is very costly and often impossible on such scales). This thesis consists of two parts. In the first part, I explore scalability of methods that derive a semantic representation of plain text documents. The focus will be entirely on unsupervised techniques, that is, on methods that don’t make use of manually annotated resources or human input. I develop and present scalable algorithms for Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA), two general-purpose statistical methods for semantic analysis that serve as building blocks for more concrete, applied algorithms. Scalability is achieved by building the semantic models in a constant amount of memory and distributing the computation over a cluster of autonomous computers, connected by a high-latency network. In addition, the novel LSA training algorithm operates in a single pass over the training data, allowing continuous online training over infinite-sized training streams. The second part of the thesis deals with possible applications of these general semantic algorithms. I present my research in the field of Information Retrieval (IR), including work on topic segmentation of plain-text documents, on document-document similarities (“semantic browsing”) in digital libraries and on language segmentation of documents written in multiple languages.
ER  -

Základní údaje
Originální název	Scalability of Semantic Analysis in Natural Language Processing
Název česky	Škálovatelnost semantické analýzy ve zpracování přirozeného jazyka
Autoři	ŘEHŮŘEK, Radim.
Vydání	2011.

Další údaje
Originální jazyk	angličtina
Typ výsledku	Původní umělecké práce
Obor	10201 Computer sciences, information science, bioinformatics
Stát vydavatele	Česká republika
Utajení	není předmětem státního či obchodního tajemství
WWW	PhD thesis reviews
Organizační jednotka	Fakulta informatiky
Klíčová slova anglicky	latent semantic analysis, latent dirichlet allocation, digital libraries, natural language processing
Příznaky	Mezinárodní význam, Recenzováno
Změnil	Změnil: RNDr. Radim Řehůřek, Ph.D., učo 39672. Změněno: 24. 11. 2011 18:52.

Anotace

Data mining applications that work over input of very large scale (web-scale problems) pose challenges that are new and exciting both academically and commercially. Any web-scale algorithm must be robust (dealing gracefully with the inevitable data noise), scalable (capable of efficiently processing large input) and reasonably automated (as human intervention is very costly and often impossible on such scales). This thesis consists of two parts. In the first part, I explore scalability of methods that derive a semantic representation of plain text documents. The focus will be entirely on unsupervised techniques, that is, on methods that don’t make use of manually annotated resources or human input. I develop and present scalable algorithms for Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA), two general-purpose statistical methods for semantic analysis that serve as building blocks for more concrete, applied algorithms. Scalability is achieved by building the semantic models in a constant amount of memory and distributing the computation over a cluster of autonomous computers, connected by a high-latency network. In addition, the novel LSA training algorithm operates in a single pass over the training data, allowing continuous online training over infinite-sized training streams. The second part of the thesis deals with possible applications of these general semantic algorithms. I present my research in the field of Information Retrieval (IR), including work on topic segmentation of plain-text documents, on document-document similarities (“semantic browsing”) in digital libraries and on language segmentation of documents written in multiple languages.

Anotace česky

Práce se zabývá dolováním dat z rozsáhlých korpusů. Zaměřuje se na robustní statistické metody, které dokáží automatizovaně vytvořit kompaktní sémantickou reprezentaci volného textu, tj. bez použití metadat či ručního vstupu člověka. První část práce se zabývá škálovatelností metod Latent Semantic Analysis (LSA) a Latent Dirichlet Allocation (LDA). Představuji nové algoritmy pro škálovatelnou tvorbu těchto sémantických modelů. Škálovatelnost je dosažena 1) distribucí výpočtů na více strojů a 2) využitím pouze konstatního množství paměti vzhledem k velikosti trénovacích dat, a 3) trénováním modelu v omezeném počtu průchodů trénovacími daty (resp. pouze na jeden průchod v případě LSA, což umožňuje trénování na nekonečném, nestacionárním proudu trénovacích dat). Druhá část práce popisuje několik možných aplikací těchto obecných sémantických algoritmů. Prezentuji zde výsledky svého výzkumu v oblasti Information Retrieval (IR), jako je např. tématická segmentace volného textu, sémantická podobnost dokumentů v digitálních knihovnách či efektivní segmentace textu podle jazyka. Součástí práce je také open-source software, který obsahuje implementaci těchto metod.

VytisknoutZobrazeno: 20. 9. 2024 13:43

Scalability of Semantic Analysis in Natural Language Processing

Další aplikace