Další formáty:
BibTeX
LaTeX
RIS
@inproceedings{769142, author = {Pomikálek, Jan and Rychlý, Pavel}, address = {Marrakech, Morocco}, booktitle = {Proceedings of the Sixth International Language Resources and Evaluation (LREC'08)}, keywords = {Detecting; Large Text Collections}, language = {eng}, location = {Marrakech, Morocco}, isbn = {2-9517408-4-0}, pages = {132-135}, publisher = {European Language Resources Association (ELRA)}, title = {Detecting Co-Derivative Documents in Large Text Collections}, url = {http://www.lrec-conf.org/lrec2008/}, year = {2008} }
TY - JOUR ID - 769142 AU - Pomikálek, Jan - Rychlý, Pavel PY - 2008 TI - Detecting Co-Derivative Documents in Large Text Collections PB - European Language Resources Association (ELRA) CY - Marrakech, Morocco SN - 2951740840 KW - Detecting KW - Large Text Collections UR - http://www.lrec-conf.org/lrec2008/ N2 - We have analyzed the SPEX algorithm by Bernstein and Zobel (2004) for detecting co-derivative documents using duplicate n-grams. Although we totally agree with the claim that not using unique n-grams can greatly increase the efficiency and scalability of the process of detecting co-derivative documents, we have found serious bottlenecks in the way SPEX finds the duplicate n-grams. While the memory requirements for computing co-derivative documents can be reduced to up to 1% by only using duplicate n-grams, SPEX needs about 40 times more memory for computing the list of duplicate n-grams itself. Therefore the memory requirements of the whole process are not reduced enough to make the algorithm practical for very large collections. We propose a solution for this problem using an external sort with the suffix array in-memory sorting and temporary file compression. The proposed algorithm for computing duplicate n-grams uses a fixed amount of memory for any input size. ER -
POMIKÁLEK, Jan a Pavel RYCHLÝ. Detecting Co-Derivative Documents in Large Text Collections. In \textit{Proceedings of the Sixth International Language Resources and Evaluation (LREC'08)}. Marrakech, Morocco: European Language Resources Association (ELRA), 2008, s.~132-135, 3 s. ISBN~2-9517408-4-0.
|