Další formáty:
BibTeX
LaTeX
RIS
@misc{959018, author = {Řehůřek, Radim}, keywords = {latent semantic analysis, latent dirichlet allocation, digital libraries, natural language processing}, language = {eng}, title = {Scalability of Semantic Analysis in Natural Language Processing}, url = {http://radimrehurek.com/phd_rehurek.pdf}, year = {2011} }
TY - ART ID - 959018 AU - Řehůřek, Radim PY - 2011 TI - Scalability of Semantic Analysis in Natural Language Processing KW - latent semantic analysis, latent dirichlet allocation, digital libraries, natural language processing UR - http://radimrehurek.com/phd_rehurek.pdf N2 - Data mining applications that work over input of very large scale (web-scale problems) pose challenges that are new and exciting both academically and commercially. Any web-scale algorithm must be robust (dealing gracefully with the inevitable data noise), scalable (capable of efficiently processing large input) and reasonably automated (as human intervention is very costly and often impossible on such scales). This thesis consists of two parts. In the first part, I explore scalability of methods that derive a semantic representation of plain text documents. The focus will be entirely on unsupervised techniques, that is, on methods that don’t make use of manually annotated resources or human input. I develop and present scalable algorithms for Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA), two general-purpose statistical methods for semantic analysis that serve as building blocks for more concrete, applied algorithms. Scalability is achieved by building the semantic models in a constant amount of memory and distributing the computation over a cluster of autonomous computers, connected by a high-latency network. In addition, the novel LSA training algorithm operates in a single pass over the training data, allowing continuous online training over infinite-sized training streams. The second part of the thesis deals with possible applications of these general semantic algorithms. I present my research in the field of Information Retrieval (IR), including work on topic segmentation of plain-text documents, on document-document similarities (“semantic browsing”) in digital libraries and on language segmentation of documents written in multiple languages. ER -
ŘEHŮŘEK, Radim. \textit{Scalability of Semantic Analysis in Natural Language Processing}. 2011.
|