česky
Jan Kasprzak, Michal Brandejs, Jitka Brandejsová
Presentation for ITA 2009 conference (8. – 11. September 2009)
Abstract: With wide deployment of e-learning methods such as computer-mediated communication between the students and teachers, including papers and essays submission and evaluation, it has become much easier for students to base those works on electronic resources, including the plagiarization of the work of other people. In this paper we will briefly present a system for discovering similarities in a large base of documents, which has been in production use inside the Czech National Archive of Graduate Theses since January 2008. We will then focus on the distributed aspects of such a system, especially on the task of creating and maintaining the index for discovering the similarities on a cluster of commodity computers.
Keywords: Theses, Archive, Plagiarism, Similar documents, Distributed computing.
Since 2006, the universities in the Czech Republic are required by law to make the graduate theses of their students publicly available. Having the results of publicly funded education and research available for the general public means that the quality of the education can be closely scrutinized by any interested party.
Some universities have decided that the most transparent way of fulfilling the law is to make the texts of the theses available on the web. In late 2007, the effort of several universities and university libraries resulted in a project of the Czech National Archive of Graduate Theses [1], sponsored by the Czech Ministry of Education. The system has been developed and deployed in 2008 by the team from the Masaryk University in Brno. As of February 2009, there are about 20 universities participating in the theses.cz system.
Having a central archive of graduate theses definitely lowers the barrier for accessing these texts (which would otherwise require to physically visit or at least communicate with the library of some remote faculty or university). On the other hand, having the full texts of theses available for downloading and reading also means that it is easier to copy the work from other texts. Thus a necessary component of the theses archive is the system for discovering similarities in a given document base.
The theses.cz archive shares a big part of code base with the Masaryk University Information System (IS MU, [2], [3]). In the e-learning subsystem of IS MU, there are tools for submitting students' essays, evaluating them, and also searching and finding similarities in them. The system for finding similar documents in IS MU has been deployed since 2006. For the theses.cz archive, it has undergone a major rewrite, which allows faster response to newly added documents, as well as distributed processing of the documents.
The crucial part of both IS MU e-learning subsystem and the theses.cz archive is the document storage subsystem. It is an in-house developed networked data storage, running on a cluster of commodity computers with the Linux operating system, storing data on cheap commodity hard disks. The storage subsystem provides some features similar to the features of general-purpose file systems (such as tree-organized structure, file names, hard links, etc.).
Apart from that, it also has some features unique to it, like alternative versions of documents, most of which are created automatically (e.g. when user imports a Word or OpenOffice.org document, it is automatically converted also to PDF, and the plain text is extracted from it; for the imported PDF files, the plain text extraction can use the PDF file properties, or - should the PDF file be bitmap-based - the text can be extracted by the means of the OCR software). Having a reliable plain text version of all documents is a necessary prerequisite for being able to search these documents, and also being able to find similarities in the document base.
Amongst other special features we can mention replicating, periodically verified checksums, or rich system of access rights (such as "students of such and such course in this semester" or "student which had enrolled in such and such course in any past or present semester", etc.).
The document storage is used not only for the e-learning agendas in IS MU, but also for other purposes like the back-end for user e-mail boxes, or a bulletin board of the university documents. As of February 2009, the storage subsystem of the IS MU and theses.cz has a raw storage capacity of about 80 terabytes, and currently hosts about 13 millions of objects, amongst them about 1,200,000 documents which are indexed for the purpose of finding the document similarities (theses, seminar works, articles, etc.).
For finding the similarities in the document base, we use a chunk-based approach to the finding similarities problem: roughly speaking, the plain text form of the document is split into short, overlapping sequences of words (chunks), and we look up those sequences in other documents. The similarity of the document A to the document B is defined as a number of chunks from the document A, which are also present in the document B (note that the similarity is not symmetric - e.g. a shorter document A can be included as a whole in the larger document B - then A is 100 % similar to the document B, while the document B can be only a few percent similar to the document A).
The non-distributed version of the algorithm we use has been described in detail in our earlier work [4], including the performance analysis on a real-world set of documents. The algorithm works the following way:
We have verified that even though we use hash-based chunk IDs which are not always unique because of the hash function collisions, the results are not affected in a significant way (we have observed a variance within one percent of the exact value).
Using a fixed width chunk ID (we have tested 28, 30, and 32 bits) allows us to split the chunk ID space in the step 1f in preparation for a bucket sort.
Also note that in steps 1f and 2a-c, we essentially do a bucket sort, which computes an inverted index mapping a chunk ID to the list of document IDs from the original mapping of a document ID to the list of chunk IDs.
The resulting invented index is kept in two separate files/arrays:
For example, in Figure 1, the chunk with ID 0 is present in documents number 243 and 5039, the chunk with ID 1 is not present anywhere in the base of documents, and the chunk with ID 2n-1 is present in documents 649 and 2108.
The size of the array of document IDs is equal to the number of different (chunk ID, document ID) pairs in the whole base of documents (for our document base, it has about 8.9 GB with 32-bit document IDs).
The size of the array of offsets is (2n+1) * size of the pointer. For 32-bit pointers and n=30, we have 4 GB plus 4 bytes.
This data structure needs to be stored permanently for incremental runs of the algorithm, when only a few documents are added/modified/deleted. The list of document similarities as needed in steps 3a, 3b-ii, and 3d can be kept in memory. Any tree-based data structure can be sufficient, we use a Judy array library as described in [5].
One of the task the system has to fulfil is to have a fast response time to a newly imported documents. Users who import their documents should be able to search for similar documents to the just-imported documents soon after the import is done. It is not feasible to run the whole Algorithm 1 periodically. We have designed the data structures to make an incremental variant of Algorithm 1 possible. Here are the modifications required for incremental runs:
For the step 1, we need an in-memory data structure which can be easily searched for the presence of the key. Again, we use the 1-bit Judy array for this.
For the incremental version of this algorithm, we in addition to the inverted index (array of offsets, and array of document IDs) need to save the mapping of the document ID to the number of chunk IDs in this document, which is needed in step 5 of the incremental run.
In order to keep a document ID range low, we recycle the IDs of the previously deleted documents, instead of having forever increasing IDs taken from, for example, a database server sequence object.
The algorithm and the data structure described in the previous section were designed in order to be easily parallelized to a cluster of computers. In short, we can easily split the chunk ID space equally between the computing nodes.
The method of computing an inverted index on a cluster of computers has been described in the paper on the MapReduce system [7]. We use a similar approach, but in addition to the inverted index, we have to perform additional computations with the newly computed index (step 3 of the non-distributed algorithm).
In the above algorithm, many parts of it can be run in parallel, especially:
We have implemented the previously described algorithm in the IS MU and the theses.cz systems. We use a cluster of 45 dual-core systems (some Athlon 64 X2, some Core2 Duo) as a computational nodes, and a master server with the Oracle 10g database for the table of documents and for the database of document similarities. We run only a single computational process per host, because the servers have also other tasks such as serving the IS MU or theses.cz web application pages, and we do not want to increase their interactive latency times for those applications.
To avoid extremely large table of similarities, we keep the highest 100 similarities for each document only. With this limitation, we have about 30 millions of (document 1 ID, document 2 ID, similarity) rows in the table of similarities.
The computation of all the document similarities in a given document base takes about three hours, of which more than two are occupied by inserting the similarities to the Oracle database as described in the step 10.
This can be improved by creating the file with raw data, and then importing them directly to the database on the database server. However, we do not do this for practical purposes (one reason is that no process runs on the database server itself, and raw imports in Oracle cannot be done remotely, the other one is that the full recomputation is usually done only for testing purposes).
The incremental run is usually finished in 15 to 20 minutes. This time is largely dominated by steps 3 and 8, which we do no yet run on background, and we do not do the caching suggested in the section 3.3. The next longest part of the algorithm is again inserting the results to the Oracle database and deleting the old rows for the documents which are being deleted/modified/added.
For practical purposes the time taken by the incremental run is sufficient, because there are other significant delays inside the document handling system (such as converting DOC to PDF and plain text, running the bitmap-based documents through the OCR software, etc.).
The real-world implementation, which has been running for about a year inside the IS MU and theses.cz systems, has still some shortcomings, which we would like to address in future:
We have described a new generation of the system for finding similar documents in a large document base. This implementation has been in practical use for about a year, with previous generations being available since the August 2006. The system is almost fully distributed, can tolerate node failures, and can run on a commodity hardware.
The theses.cz system contains about 35,000 theses with full texts available, and together with seminar works, essays and other documents in IS MU compares the base of 1,200,000 documents for similarities. So far we are not aware of any other nation-wide theses archive with the built-in system for discovering similar documents. The biggest archive - Theses Canada [7] - claims to have about 50,000 theses available in electronic form, but has no system for discovering similarities.
We have proposed the algorithm which can be easily distributed, we have described its shortcomings and advantages. The most non-obvious part of the algorithm is using a hash-based chunk IDs, which can greatly reduce the size of data we need to handle. We have discovered that even though this approach is not exact, hash function collisions do not introduce significant inaccuracy to the system.
The authors of this article want to thank all the members of the IS MU and theses.cz development team system for their support. Special thanks to Lucie Pekárková for careful proofreading and importing the text to MS Word.
Kontakty:
fi
muni
cz