Source Retrieval for Plagiarism Detection Šimon Suchomel and Michal Brandejs Faculty of Informatics, Masaryk University, Brno, Czech Republic Email: {suchomel, brandejs}@fi.muni.cz Abstract—Plagiarism has become a serious problem mainly because of the electronically available documents. An online document retrieval is a weighty part of a modern antiplagiarism tool. This paper describes an architecture and concepts of a real-world document retrieval system, which is a part of a general anti-plagiarism software. Up to date systems for plagiarism detection are discussed from the source retrieval perspective. The key approaches of source retrieval are compared. The system recommendations stem from design, implementation, and several years of operation experience of a nationwide plagiarism solution at Masaryk University in the Czech Republic. The design can be adapted to many situations. Proper usage of such systems contributes to the gradual improvement of the quality of student theses. Index Terms—plagiarism detection, plagiarism, source document retrieval, candidate document retrieval, system design, system architecture I. INTRODUCTION Plagiarism is usually defined as passing off someone else's work as one's own. It is a moral offence which can appear in many forms. It is well known especially in the form of text plagiarism, for example, in journalism, which breaches moral ethics and in academia is referred to as an academic dishonesty. If such an offence is proven it, discredits the person who plagiarized and sometimes leads to resignations or expulsions. Later appearance of plagiarism in academia also discredits the institution where it originated from because it passed unnoticed and should have been detected and dealt with accordingly at the time of submission. Higher educational institutions are not usually fond of making such cases public, on the contrary, they try to conceal it and resolve the issue internally as much as it is possible. Generally, acquiring an academic degree by deception has a bad influence on contemporary society. In some countries it is also legislatively impossible to revoke an academic degree if serious problems for the thesis defence are proven later on. Prevention and early detection are the best ways of solving plagiarism issues. These are some of the reasons why the issue of plagiarism is not only complicated, but also very sensitive. There have been many publicly well known cases of plagiarism among high ranking politicians, journalists, artists or professors in the last decade which have proven that plagiarism cannot be taken lightly. Manuscript received December 1, 2014; revised April 11, 2015. Not all plagiarism is actual cheating. Much of it arises from a lack of text-using skills. Students sometimes do not know how to cite correctly, how to work with other text sources, or which actions lead to breaching the honor code. Therefore, they should be taught such skills as early as possible. All plagiarism should ideally by detecting and handled accordingly. If it is found to be undertaken in purpose, i.e. a cheating, it should be dealt with without delay, according to the honour codes or law. However, detecting plagiarism may be quite a difficult, tedious and time consuming process and so, when marking papers, such as theses or seminar works, an automated computer system facilitating the task of checking for plagiarism, may prove to be very helpful. However, such systems never detect actual plagiarism, they cannot decide about what is right and what is not, the issue of plagiarism is very complex for a computer to decide. The automated systems can detect similarities among documents, mostly textual similarities, but it can be any kind of similarity which is somehow calculable. In the real-world today's computers cannot discover all forms of plagiarism, simply by its definition. For example, it would hardly detect that someone stole another person's idea. It is always up to a human specialist, a reviewer, a supervisor, or an authorized person to decide this. The main goal of existence of plagiarism detection systems is to improve the quality of textual works. The mere existence of such systems puts pressure on students to have their texts correct knowing it has to pass some control. Also a formative feedback with assistance from an automated text reuse detection system has a positive impact on students’ final submissions [1]. The prevention of plagiarism is the best success which the education about the problematic and also the tools for helping revealing plagiarism can achieve. If someone wants to cheat, they will do it anyway. Attackers are trying to deceive the system witch various techniques (some of them are discussed later), of which detection are after discovery implemented and the process continues with other techniques all over again. The plagiarism detection cannot be perfect, and the cheaters will always be one step ahead of the automated systems. However, a good trade-off of modern plagiarism detect system is that the cheater must put more effort into deceiving the system than to write an original text. Standard systems for plagiarism detection, which operate on the basis of detailed document comparison, cannot detect similarities unless they possess both the source and the plagiarized document. An algorithm to 18 Journal of Advances in Information Technology Vol. 6, No. 1, February 2015 doi: 10.12720/jait.6.1.18-26 2015 J. Adv. Inf. Technol.© evaluate document similarity must build inner indices for a detailed document comparison. A modern plagiarism detection process can be divided into two main tasks. The source retrieval and the detailed document comparison, which can be reduced to pairwise document comparison. The pairwise document comparison is very computationally demanding, especially for real-time plagiarism solutions which must evaluate millions of documents [2]. Fig. 1 shows the approach of a modern plagiarism detection. For an input suspicious document, the outputs of the plagiarism revealing software are annotated passages of that document, which may have been plagiarized. This paper discusses the source retrieval as part of the whole anti-plagiarism detection process in detail. Figure 1. A global view of a modern anti-plagiarism software. A source retrieval is a process for anti-plagiarism software to be performed for each suspicious document before it computes pairwise document similarities to find potential sources of plagiarism. The goal is to enlarge the document database of the anti-plagiarism system of relevant documents only. More relevant document means better opportunity to discover specific similarity. On the other hand, since the similarity computation is very demanding it is not wise to maintain a uselessly vast document database, for instance by crawling the Web, unless you possess really high computational capabilities like modern web search engines. A. Source Retrieval Having a suspicious document suspd and a very large document collection D of potential source documents, the source retrieval task is to select a small subset DDret  of documents retret Dd  which probably served as a source of plagiarism. For example, documents which have a sufficient probability of 0),( retsusp ddsim , where SIM can represent any inter-documents measurable similarity. In a realistic scenario the D would contain all documents on the Web and the access method would be through the standard interface of a modern search engine, where we do not have direct access to its internal index. The retrieved candidate documents are subsequently indexed for the purpose of the detailed document comparison. II. STATE OF THE ART There are also several types of plagiarism such as submitting another's work, which may also be bought; copying a text without citing the source, paraphrasing other texts, copying document structure; reusing own texts; or translations. There are many tools that deal with uncovering plagiarism in one way or the other. Different approaches are also applied for detecting different types of plagiarism. Nonetheless, a system in order to be successful, must be aware of the original document, therefore the general usage of tools that compare documents only with a local corpus is limited, and this results in a sophisticated extension of the document base being an essential part of a successful tool for detecting plagiarism. The most straightforward document source is the Web. In this section we further mention selected tools for plagiarism detection that are somehow extending their document bases from the Web. A. CheckForPlagiarism.net It is the name and the address of a commercial webbased service for scanning documents for reused text and showing potential plagiarism. The system assesses sentence structure in the suspicious document and makes a "digital snapshot" of each paper. The structure of paragraphs and sentences is cross-referenced to database of collected publications and it is "simultaneously sent to crawlers who scour the World Wide Web for possible matches"1 . However, it is unclear what structure it is evaluating and how the online sources are looked up and retrieved. They do not publish the developed technology in more details. B. DOC Cop 2 It is a commercial web application for detecting plagiarism. It also offers web check for form-submitted suspicious text. It uses simple exhausted online search. The string length ]12,6[n is selected by the user prior to the text check. The submitted text is divided into strings of the selected length n – created sequentially from the input text. Each string is shifted by one word from the previous one, thus each neighbouring strings contain 2n overlapping words. The first string is created from n words from the beginning of the submitted text. These strings are passed to Bing search engine and results are compared to the suspicious text. One submission to DOC Cop is limited to 1100 words of input text. C. Masaryk University's Anti Plagiarism Solution The Information System 3 of Masaryk University (IS MU) provides study administration and supports university e-learning. It also provides plagiarism detection among its documents. It is mainly designed for checking university theses prior to their defence. The document database and the plagiarism detection is interconnected also with papers of other schools [3], where the IS MU is being outsourced; next with the 1 http://www.checkforplagiarism.net/service-features/sentence-structure 2 https://www.doccop.com 3 http://is.muni.cz 19 Journal of Advances in Information Technology Vol. 6, No. 1, February 2015 2015 J. Adv. Inf. Technol.© Czech National Archive of Graduate Theses (Theses.cz)4 ; the project for seminar works and papers (Odevzdej.cz)5 ; and the project for storing academic publications (Repozitar.cz) [4]. All documents respect a rich variety of access permissions. The candidate documents are retrieved based on each document entering the database for plagiarism detection, from online sources according to principles discussed in this paper. D. PlagScan6 It is a commercial software for plagiarism detection. It can be accessed via a web browser and is also offered to be installed on-site, on a dedicated server into one's own data processing center. PlagScan claims to be searching for thematically related documents. It offers to include its document base and also other local databases into similarity calculations with respect to document permissions. As a detailed document comparison it uses similarities based on chunks from consecutive three words of the texts. It utilizes Yahoo search as a means of obtaining relevant online sources. How the queries are constructed is not explained. The results are however displayed in a sentence-based manner. E. turnitin; iThenticate; Ephorus7 Turnitin and iThenticate are commercial tools developed for originality check and plagiarism detection. Turnitin is designed for teachers and educational institutions, it helps them to organize and control the process and the quality of student papers. iThenticate is designed for individuals such as academic workers and writers, it allows them to check submitted document for unintentional plagiarism and to verify its originality. The methodology of search behind the systems is common. Concerning the growth of document database for text comparison from online sources, they adopted the Internet crawling8 methodology like contemporary search engines, meaning that the company has sufficient resources to maintain crawlers for downloading and indexing the Web for detailed comparison. They index as much content as possible no selection of relevant sources is made in the state of resource retrieval. The source retrieval task is shifted to the whole index of retrieved documents where standard methods of information retrieval can be utilized. Figure 2. The generic steps of text reuse from the Web [5]. 4 http://theses.cz 5 http://odevzdej.cz 6 http://www.plagscan.com 7 http://turnitin.com 8 Turnitin's web crawler is called TurnitinBot previously known as SlySearch https://turnitin.com/robot/crawlerinfo.html. Ephorus is another major commercial software for preventing plagiarism. It administers also a large document database which stems from involved institutions. Online sources are also retrieved broadly via automated crawling. In autumn 2014, Ephorus was acquired by turnitin. Ephorus provides integrations with many various e-learning systems and others can be integrated via API. F. Viper9 It is an anti-plagiarism scanner and free windowsbased desktop application. It is intended for individual use and it offers a real-time Internet scan for plagiarism detection. To fulfil this task, it divides the input text into not overlapping chunks of about twenty consecutive words. It does not bother to remove any special characters or punctuation. Such text chunks are encoded into URL encoding and passed to the Yahoo search engine. Similarities between search results and the input text are, after evaluation, displayed in the application. Due to its inner encoding methods it does not support texts written in languages other than English. III. CONCEPTUAL SYSTEM CHARACTERISTICS The outer behaviour of the source retrieval system should be as much like the behaviour of a student who searches for documents on the Web and reuse a text from them. Martin Potthast depicts a standard process of text reuse from the Web as shown at Fig. 2. Considering a standard plagiarism detection tool, we suppose that suspicious documents are single-themed. That is the most common situation. Such documents are, for example, theses or seminar papers. The majority of documents which are expected to be checked for plagiarism are single-themed. This assumption leads to the possibility of extracting keywords from the whole document without significantly lowering the performance of automated keywords extraction. Keywords extraction is one of the fundamental features of the source retrieval system (see the following section for more details). Under this assumption, an example of unsuitable use of an antiplagiarism tool would be the checking of one diversethemed document, like newspapers uploaded in a single file. Such a document should then be divided into separate documents according to the articles which the newspapers contain, which will result again in singlethemed documents. From a single document point of view, the system preprocesses the input document and creates appropriate queries which are submitted to a search engine interface. The search results must also be processed accordingly. The system should follow several considerations: i) maximizing precision and recall; ii) minimizing the overall cost; iii) be scalable and robust. A. Retrieval Performance The demand to maximize the recall and precision of retrieved documents is obvious. However, it is usually 9 http://gateway.scanmyessay.com 20 Journal of Advances in Information Technology Vol. 6, No. 1, February 2015 2015 J. Adv. Inf. Technol.© balanced with acceptable computational load of the system. It is also very difficult to measure precision and recall of a real-world web document retrieval system. Let srcD denote the set of documents that served as a source of plagiarism for document suspd , and let retD denote the set of retrieved documents. Then the precision and recall can be defined as retsrcret DDDprec / and srcsrcret DDDrec / respectively. However, this standard information retrieval calculation is far from being applicable, namely because of the existence of so called near-duplicate documents on the Web [6]. The source document retrieval system can select a near duplicate document retd which certainly is true positive detection and it does not have to be the same source document srcd from which it was plagiarized. In order to measure a near duplicate, some characteristics must be defined. For an anti-plagiarism system, the positive value of similarity   0, retsusp ddsim can be sufficient to consider the retrieved document retd as a true positive. The similarity can be any kind of likeness between two documents which is computed by the detailed document comparison subsystem of the anti-plagiarism software. Organizers of PAN 10 , competition on plagiarism detection, determine whether a document retd is a nearduplicate to any document from the set of source document srcD by three characteristics: i) whether they are actually equal srcret dd  ; ii) whether they are similar according to an empirically set threshold of Jaccard similarity   nddsim srcretjac , ; iii) or whether the passages in a suspicious document plagd that are known to be reused from srcd are also contained in retd [7]. We can now observe that one document can be a nearduplicate of more than one source document and one source document can have more than one near-duplicate. Next, they denote a set dupD of all near-duplicates of all source documents srcD of plagd and a subset srcret DD ' containing documents having at least one positive detection in retD . Then precision and recall of set retD based on a suspicious document plagd are defined as follows: ' ' , ret srcret ret dupret D DD recall D DD precision   (1) This results in the fact that, retrieving more than one near-duplicate document to a single source document does not increase recall and it does not decrease precision either. Retrieving the first of the near-duplicate documents into a single source document increases both recall and precision. 10 http://pan.webis.de/ It is worth mentioning that in order to evaluate all near-duplicates we need to build an index of the whole corpus of all potential source documents, which could be searched via a given search engine. Therefore, such evaluation is infeasible in a real-world situation when the corpus of source documents is the Web. In a real-world scenario, the recall is much more important than precision. If the precision is low it could affect time performance of the retrieved algorithm, since the system would process a lot of documents needlessly. It can also excessively extend the index for detailed document comparison, which is not a problem as long as the detailed document comparison is feasible according to user expectations. On the contrary, if the recall is low, the anti-plagiarism system may simply not be able to detect the plagiarized passage, since it may not have the source document retrieved and indexed in its database. In addition to documents that contain similar passages with the suspicious document, we consider as a true positive retrieved result a document following the same theme as the source document. Themed documents are considered relevant. A theme can be detected by overlapping sets of equal keywords or keyphrases [8]. Existing themes are therefore defined by the characteristics of the suspicious documents within the database of the anti-plagiarism system. B. Retrieval Cost In a standard way the cost of the system consists of time and space requirements of all algorithms and data needed. Apart from that, the most costly component is the number of executed search queries, and secondly the number of Internet document downloads. In any information retrieval system, there is always a correlation between the retrieval performance and the cost. Consider a system using an exhaustive search approach. For example, querying every sentence from a suspicious document would result in high recall, but it is simultaneously too cost demanding to be applicable elsewhere than in an experimental environment. On the contrary, in real-world systems, the number of search queries should be narrowed as much as possible, which can result in certain situations in executing only a single query per suspicious document. It is crucial to reduce the number of queries since the real anti-plagiarism system must utilize modern search engines like Bing, Google or Yahoo. Furthermore, each search engine has strict rules about the amount of queries which can be submitted from one IP address, which prevents using the exhaustive search. The query execution is usually not particularly time consuming, yet the time consumption is not negligible. The automated querying can often be attended by additional fees. In the document retrieval system design, the querying represents the most expensive part. The second significant part of the system cost is the number of document downloads. The download alone is in today's system, a cheap operation, but it can be very time and space consuming while considering a huge number of downloads. Also a post-processing of the downloaded documents is a very time consuming 21 Journal of Advances in Information Technology Vol. 6, No. 1, February 2015 2015 J. Adv. Inf. Technol.© operation. The number of downloads must be tuned according to the system computational possibilities and expectations. C. System Scalability and Robustness The purpose of the system determines its scalability. The modern anti-plagiarism systems maintain database of millions of documents and are able to process new documents within hours or even minutes. The complete processing of a new document means that all results of candidate document retrieval, together with the suspicious document, must already be indexed for detailed document comparison. Afterwards, the evaluation of similarities of that document is usually realtime (within seconds). The design of the source retrieval system, which is further recommended, can scale easily by adding more computational nodes. A need for robustness stem mainly from a huge diversity of Internet documents. It is discussed together with the detailed design of the system in the following sections. IV. SYSTEM ARCHITECTURE The source retrieval should run as several independent tasks, in order to be highly parallelizable and scalable. The tasks can share data via a transactional relation database. The database represents a central point for process control. If it is accessible over a network, the computational power can be increased by adding more computer nodes. The database should be utilized in order to keep detailed information about the progress of document processing. The tasks could be divided according to the following functions into 4 main groups: 1) parsing an input document; 2) searching the Internet; 3) downloading the results; 4) the results post-processing. A. Parsing an Input Document Let us assume that an input of this stage is a textual representation of a suspicious document plagd . Since the anti-plagiarism system needs to build data structures for the detailed document comparison, the plaintext format is needed anyway. Therefore, the input document conversion into plaintext must generally also be considered. The output of this stage would be queries prepared for their execution. Textual processing and keywords extraction algorithms may become quite time consuming. A standard algorithm optimization should be considered when needed. This stage, however, does not represent the most time consuming part of the overall source retrieval process. Each suspicious document is processed independently, thus the system may scale by simultaneously processing suspicious documents. The matters to consider at this stage include: i) document cleaning and preprocessing; ii) language detection; iii) chunking; iv) keywords extraction; v) query formulation; vi) permanent storage of extracted queries and the input document information into the database. Cleaning of the document may comprise the removal of special characters, original document structure violation, existing citation detection, or in-text urls extraction for a direct download. 1) Language detection A modern anti-plagiarism system should also be able to detect similarities among and across multiple languages. This must be borne in mind during the system implementation. Many of the shelf tools for lingual processing or keyphrase extraction would not be possible to utilize. Current effective automated language identification methods are based on frequency analysis, such as utilizing the principle of language-characteristic sequences of n-grams. For the usage of such methods one needs to construct a referential vector language model for each supported language. Other beneficial, less computational demanding method, can be language detection by stop-words matching. Only lists of language specific stop words are kept and the language with the highest number of matches is selected. This method works reliably for longer textual parts. Next, a method based on word relevancy can be utilized for shorter texts. It is also applicable for the web page language identification [9]. With supporting of multiple languages the automated language identification must also be applied on every retrieved web document. Please consider that in many theses, there are usually small parts of text written in multiple languages, such as the abstract or summary. The detection method should detect the major language of the text or identify those language-different parts. 2) Chunking and keywords extraction The purpose of chunking is to distribute focus of text processing algorithms evenly across the document and thus lower the possibilities of influencing the efficiency of that algorithms by unexpected characteristics of the text. Chunking is also applied in order to detect textual differences, where one cannot pre-set the exact boundary in a document, where the textual characteristics are changed. For this purpose, the principle of a sliding window is usually used, where two primary parameters must be determined. The first stands for the size of the window and the second represents the size of the overlap between two neighbour windows during sliding. The size of the overlapping part also influences the detected characteristic differences between two chunks. If the overlapping size is too big the difference would probably pass unnoticed. On the other hand, using small size of that interval sharpens algorithm detection edges, but poses a risk of placing the window's centre on the textual characteristic boundary, resulting accidentally in no detection. It may also be considerable to process more than one pass of the algorithm with different sliding windows settings—a type of cross validation. Other considerable chunking approaches are no chunking, paragraph based chunking [10], chapter based chunking, or sentence chunking. Some approaches use also chunks of pre-set size [11], which is applicable in all situations, but is little correlated with the structure of the document. 22 Journal of Advances in Information Technology Vol. 6, No. 1, February 2015 2015 J. Adv. Inf. Technol.© The Keywords or keyphrase extraction is the most straightforward process for subsequent query formulation. The high quality keywords extraction is crucial for the proper query formulation. Modern keywords extraction methods are based on the word repetition allied to a statistical estimate of likelihood. Also the most widely used method in all PAN competitions on plagiarism detection (since 2012) in the source retrieval subtask was keywords scoring by idftf  (term frequency—inverse document frequency) [12], [7], [13]. Keyphrases can also be extracted from the selected chunks. However, in the real-world scenario it appears to be more beneficial to extract global keywords from the whole document. Such keywords are fully related to the document theme and should suitably describe the individual document. From a longer textual part, there can also be obtained more descriptive keywords than from the shorter part. 3) Query formulation The query formulation is the most important part of the source retrieval system, since it has the highest impact on the overall performance and costs. In order to control the cost, a maximum number of queries submitted per document should be set. The total number of executed queries influences directly, not only the cost, but also the time demands of the input document process and the number of Internet documents to be processed. Suchomel et al. [14] propose a methodology based on the combination of three different types of queries. The first type of queries is constructed from keywords or keyphrases extracted from the whole document. They suggest using use 5 word long keywords based queries. The query length is important since it directly influences the number and the relevance of retrieved results. If the query is too long, it could be too specific, which will probably lead to no retrieved results. On the other hand, if the query is too short, it will be too general resulting in retrieval of many irrelevant documents. The purpose of the keywords based queries is to retrieve theme bounded documents. Other types of the proposed queries are extracted from different chunks of the suspicious document. It deepens the search for those specific parts and aims for retrieval of more text-related documents. They also suggest to detect suspicious passages of the document by evaluating textual characteristics with intrinsic plagiarism detection methods and deepen the search in those passages. Queries constructed from small text parts of the source document can be characterized as phrasal queries and are usually longer (up to 10 words). The proposed methodology is applicable in a realworld document retrieval system and it also performed best in terms of the total system workload, while maintaining a good retrieval performance in PAN 2012 competition on plagiarism detection [12]. In following runs, this methodology was improved with enhanced download control and the third type of queries was changed from header based to paragraph based queries. However, the main idea remains the same [15], [10]. This methodology is also expected to perform better in real-world scenarios while utilizing modern search engines, than in PAN competition environment. It is because the search engine used during the competition did not support phrasal search, which influenced a significant part of queries of the proposed methodology. It also scales up to a single query per document. The first query is constructed from the keywords which obtained the highest score. In the next step, the keywords based queries are formulated from the consecutive extracted keywords sorted by their score up to the score threshold or a up to the pre-set maximum number of queries of this type. After that, the search can be deepened by phrasal types of prepared queries. A multilingual search can be accomplished by a query translation, which is generally easier than the full sentence translation. It is sufficient to translate all the query terms consecutively, especially if the query is constructed from keywords only. Translation can be done by the dictionary associations. Still, a quality disambiguation may pose a problem for successful translations. It is therefore, better to use words in their canonical forms in keywords based queries, since the search engine will not distinguish between different forms of one word. A different situation is at phrasal queries, they should remain unchanged, since the modern search engines will attempt to appraise the meaning of the search, like for example, the new Google's searching algorithm called Hummingbird. B. Searching the Internet During this task, the prepared queries are submitted to the search engine. A search control should be implemented in order to minimize the total number of submitted queries. As a consequence of the limited query budget, the queries should be processed stepwise and search results should be evaluated, in terms of a basic feedback for the search controller, after each query. During the searching the search controller can reschedule the submission of prepared queries, which may include query skipping or reformulation. The basic search control represents submission of queries according to their priority up to a specific number of submissions. Haggag and El-Beltagy [16] check subsequent queries against all previously downloaded documents, which were downloaded based on an analysis of one suspicious document, through a simple token matching. They skip queries which show 60% or higher tokens match. Suchomel et al. [15] submit document global keywords based queries at first. Next, according to obtained similarities between the suspicious and retrieved document, they skip queries covering by their position the portions of the suspicious document, which were already mapped to the source. Issues to consider during this task include: i) the search control; ii) the feedback from retrieved documents; iii) the storage of query records and retrieved results with results filtering. The storage of the query record prevents 23 Journal of Advances in Information Technology Vol. 6, No. 1, February 2015 2015 J. Adv. Inf. Technol.© executing of the same queries if prepared for different input documents. Also the date and time of each query execution can be decisive for the eventual query resubmission. C. Downloading the Results Real Internet searches include many types of documents, such as textual rich formats or multimedia formats. A plaintext needs to be extracted from retrieved documents, therefore only plaintext convertible documents should be downloaded. Downloads can pose huge bandwidth and disk storage demands. In the realworld, there is little information known prior to the download, which influences download decision making. The type and size of the document can usually be determined from headers of Internet documents prior to the full document download, which still pose a header request to a web server. This leads to having to leave the decision making about the quality of the search results to the post-processing phase. The Web is a very wild and volatile environment, thus more emphasis must be put on the robustness of the downloading subsystem. Addition to standard timeouts, other techniques should also be considered. For example, more attempts to download a document should be carried out if the previous one was unsuccessful. A maximum file size limit should be set for html files, otherwise the downloader can get stuck in endless web files. On the other hand, certain types of documents (like pdf files) have to be downloaded completely in order to extract the text from them properly. The request for header can help to set the maximum file size of such types of documents. Unfortunately, Internet files headers do not always provide veracious information. Various metadata of downloaded documents need to be stored permanently. The date and time of the last download and the original internet document are needed also to be able to ascertain plagiarism. Assuming database driven data exchange, the time demands of the hypothetical simple downloader consist of database operations; establishing connections to the target web server; downloading the actual data; and saving the data. Our tests show that beneficial speedup of download is favourable by the process parallelization only. The connection establishment can be sped up, for example, by the DNS record caching. Not any crawlerlike optimization can be performed, since the search results generally contain various web sites. All downloads can be sorted according to the hosting site in order to use cached DNS records. TABLE I. PERFORMANCE OF VARIOUS WEB DOCUMENT DOWNLOAD TECHNIQUES. Δt [min.] domain ordering no domain ordering one process 1:49 1:44 db dedicated thread 1:56 1:48 4 download dedicated threads 0:31 0:34 Table I presents averaged times of 2 passes of downloads of 137 different Internet documents obtained from searches based on different queries. 91 of those 137 documents were downloadable at the time of the tests. Others ended with various HTTP errors among which the HTTP 404 (Not Found) was the most abundant. The tests ran in homogenous network and hardware conditions. The domain ordering column shows times, when the downloader tried to optimize Internet requests by accessing the same sites consecutively. The times shown in the second column were obtained while accessing Internet documents in the order as they were added to the database for download by the search algorithm. The second data line of the table shows times of the changed downloader differing from the first line of the table in a threading approach. A dedicated thread was used for downloading the documents, and database operations together with the other logic remained in the main thread of the programme. The third line represents multithreaded download process—4 threads were used for the documents download and the main thread remained unchanged. The results show that the database operations are negligible when compared to time demands for downloads. Also, the site ordering will probably not pay off, since it burdens the algorithm with the additional sorting. Here the downloads are certainly the most time consuming operations, but they are also easily parallelizable. The third row of the table shows that n additional threads will almost linearly n times decrease the total time for download. D. The Results Post-Processing In the post-processing phase, the task is to evaluate the quality of the downloaded document and if the quality is sufficient, to pass the document to the indexer. Only among the indexed documents the similarities can be calculated, which represents the subsequent stage of the plagiarism detection process. A plaintext needs to be extracted from every downloaded document in order to evaluate similarities. Information must be obtained before the actual text conversion, which includes: file type identification—the file type given by a HTTP response header cannot be trusted, therefore the file type must be identified by MIME detection tools. Sometimes the file type must be identified according to the file extensions, if any, or according to other heuristics. The encoding of the file needs to be determined in pursuance of the correct tokenization and indexing. The language of the document should also be known supposing appropriate lingual classification. A modern source retrieval system should be able to convert the most common web file formats which include: html and other markup languages document formats; Microsoft Office family file formats; Open Office file formats; and probably the most common pdf files. From the nature of MS Office and Open Office formats it follows, that the plaintext conversion is possible, because those files carry a text source information. It is hidden in the proprietary structure of the files. Standard 24 Journal of Advances in Information Technology Vol. 6, No. 1, February 2015 2015 J. Adv. Inf. Technol.© tools for those format creation can extract the text; however, the extraction must be fully automated. Not all documents are generally convertible, since those formats allow to lock or create password protected text. There are also publicly available programme modules and extraction tools for Open Office 11 and MS Office 12 documents. The plaintext conversion of pdf file is more complicated since the pdf is not an easily editable file format. There are many tools for creating various versions of pdf files, thus the issue of the pdf text conversion is far from being a smooth and errorless process. Firstly, the standard methods of text extraction from the pdf text layer together with the text correctness should be performed. If the text is not well-formed or if the extraction fails, other conversion methods should be applied. Further possibility of text extraction is to pass the document to an OCR recognition13 . The check for text correctness is important even if the extraction from pdf layer was errorless. Typically non ASCI characters can be damaged and a profile of the text must align with any of the supported language. If a plagiarist obfuscates plagiarism by braking the textual layer and keeping the document to display correctly, the use of an OCR is also inevitable. For example a student creates a plagiarized text and replaces every space in the text with any letter in white colour, which will not be seen by a human reader. The text will have the character of a single huge meaningless word for the text extractor. Another cheater's known approach to confuse computers is replacing some types of characters with characters from a different alphabet using certain fonts, which look very similar and will pass unnoticed by the reader. For example the Roman character o can be replaced by the Greek omicron14 if one uses a font in which they look the same. The use of OCR will recreate the text correctly, since it looks at the document in the same way as the user does. The issue of text extraction from html family files is even more complicated. The majority of web pages include together with the main content also so-called boilerplate content. The boilerplate content is, for example, a navigation link, advertisement, header or footer. It is a meaningless content for document comparison. Having indexed all unchanged text from web pages, it would result in many false positives in evaluation of the document similarities. Internet documents would be spoiled with repeated parts of text, which do not carry needed information. An example of this would be the pages from Wikipedia, they all contain the same footer. Therefore, the text extraction from html files should be accompanied by a boilerplate removal, for example by means of context based approaches to identify and remove the boilerplate [17]. For example, Masaryk University runs proprietary servers for plaintext conversions inside the network document storage of the university's Information System. 11 http://freecode.com/projects/oo2txt 12 http://www.adelton.com/perl/Docserver/ 13 http://en.wikipedia.org/wiki/Optical\_character\_recognition 14 Omicron does not even have an command. It includes dedicated client-server network hosted applications for MS Office, Open Office and pdf, including OCR, documents to plaintext conversion. The plaintext conversion is generally very computationally demanding, it also takes a lot of tools and technologies to convert many document types. From all, the pdf files are possibly the most computationally demanding to convert and except for html files, the pdf files represent the most preferable file format to be published on the Internet. Having extracted a plaintext from the downloaded documents allows for subsequent document evaluation. A decision whether to actually index the document for the plagiarism detection must be made. Straightforward evaluation is to compare the retrieved document retd with the suspicious document plagd for a document similarity. However, considering the real-world plagiarism detection for many input documents, the source retrieval system can retrieve a theme bounded document based on a query created from a certain suspicious document, but the retrieved document could serve as the source for plagiarizing another document, which is also in the anti-plagiarism system database. Then the retrieved document is valuable, but evaluating it with only the document from which the query was constructed can result in no similarities. In such situations all retrieved good quality texts should also be indexed for all document similarities. The text extraction can also be parallelized on the document level, like downloading, but especially the optical recognition can still be very time consuming. It may currently take tens of minutes to complete, based on used hardware. V. CONCLUSION This paper discussed the main points of the source retrieval system architecture. The candidate document retrieval is an unexpendable part of a modern antiplagiarism detection system. The quality document base for detecting document similarities is for the antiplagiarism system of critical importance. Such a system should retrieve potential sources of plagiarism from the Web based on each document entering the plagiarism detection and the main purpose is to retrieve a relatively small subset of similar documents, which may have been plagiarized from. Firstly, such methodology leads to retrieving textually similar documents, and secondly, which is particularly beneficial when done based on academic papers, it retrieves thematically related documents. Consequently, an anti-plagiarism system evaluates document similarities among all documents which the system operates with, together with the newly retrieved documents. A user is usually provided with a percentual portion of document similarities between pairs of similar documents. The overall percentage can also be provided. However, the system does not decide about plagiarism, it only selects similar passages. The issue of plagiarism is far more complicated. It is always up to a user judge to 25 Journal of Advances in Information Technology Vol. 6, No. 1, February 2015 2015 J. Adv. Inf. Technol.© decide, whether the given text is plagiarized or not. The system simplifies the tedious work of finding the sources of similar texts. For example, a page in a thesis can be copied from another text source, which would not be considered as a cheat if cited correctly. There are also many types of plagiarism, for example paraphrasing, copying the structure of the document, copying the results or copying the texts. The overall quality of a plagiarism system can be evaluated using measurement based on what reused text obfuscation it can detect. This paper summarized experience from the real-world operation of the anti-plagiarism system used at Masaryk University as part of a country wide plagiarism solution in the Czech Republic. It provided ideas, concrete methods, and concepts for a candidate document retrieval system construction. It also discussed concepts used on PAN competition on plagiarism detection. The research behind the competition provides additional topic-related information. ACKNOWLEDGMENT The authors would like to thank to the Information System of Masaryk University for creating an opportunity to improve the plagiarism issue in Europe. REFERENCES [1] M. Davis and J. Carrol, “Formative feedback within plagiarism education: Is there a role for text-matching software?” International Journal for Educational Integrity, vol. 5, pp. 580-70, December 2009. [2] J. Kasprzak, M. Brandejs, M. Křipač, and P. Šmerk, “Distributed system for discovering similar documents,” Brno: Faculty of Informatics, Masaryk University, p. 4, 2008. [3] L. Lunter, D. JakubÍK, Š. Suchomel, and M. Brandejs, “Interuniversity cooperation on plagiarism detection systems in Czech republic,” in Jiří Rybička, Plagiarism across Europe and Beyond. 1st. Brno: Mendel University in Brno, 2013, pp. 216-224. [4] D. JakubÍK, L. Lunter, M. Brandejs, and J. BrandejsovÁ, “A central repository of publication results, implemented as a part of systems for revealing plagiarism,” Seminar on Providing Access to Grey Literature 2011: 4th Year, The National Library of Technology in Prague, 2011. [5] M. Potthast, “Technologies for reusing text from the web,” Dissertation, Bauhaus-Universität Weimar, December 2011. [6] D. Fetterly, M. Manasse, and M. Najork, “On the evolution of clusters of near-duplicate web pages,” in Proc. the 1st Latin American Web Congress, LA-WEB 2003. IEEE, 2003. [7] M. Potthast, T. Gollub, M. Hagen, M. Tippmann, J. Kiesel, E. Stamatatos, P. Rosso, and B. Stein, “Overview of the 5th International competition on plagiarism detection,” in Proc. CLEF 2013 Evaluation Labs and Workshop, 2013. [8] Š. Suchomel. “Systems for online plagiarism detection,” Thesis, Faculty of Informatics, Masaryk University, Brno 2012. [9] R. Řehůřek and M. Kolkus, “Language identification on the web: Extending the dictionary method,” in Proc. 10th International Conference on Computational Linguistics and Intelligent Text Processing, Mexico City, Mexico: Springer-Verlag, 2009. [10] Š. Suchomel and M. Brandejs, “Heterogeneous queries for synoptic and phrasal search,” Notebook for PAN at CLEF 2014, L. Cappellato, N. Ferro, M. Halvey, W. Kraaij, eds. CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, 15-18 September, Sheffield, UK. CEUR Workshop Proceedings, CEURWS.org (2014) [11] V. Elizalde, ”Using noun phrases and tf-idf for plagiarized document retrieval,” Notebook for PAN at CLEF 2014, Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. eds. CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, 15-18 September, Sheffield, UK. CEUR Workshop Proceedings, CEUR-WS.org (2014) [12] M. Potthast, T. Gollub, M. Hagen, J. Kiessel, M. Michel, A. Oberlädner, M. Tippmann, A. Barrön-Cedeňo, P. Gupta, P. Rosso, and B. Stein, “Overview of the 4th International competition on plagiarism detection,” CLEF 2012, Evaluation Labs and Workshop, 2012 [13] M. Potthast, M. Hagen, A. Beyer, M. Busse, M. Tippmann, P. Rosso, and B. Stein, ”Overview of the 6th International competition on plagiarism detection,” in Proc. 2014 Conference Working Notes for {CLEF}, Sheffield, UK, September 15-18, 2014. [14] Š. Suchomel, J. Kasprzak, and M. Brandejs, “Three way search engine queries with multi-feature document comparison for plagiarism detection,” CLEF (Online Working Notes/Labs/Workshop), Rome, 2012. [15] Š. Suchomel, J. Kasprzak, and M. Brandejs, “Diverse queries and feature type selection for plagiarism discovery,” CLEF 2013 Evaluation Labs and Workshop, Valencia, 2013. [16] O. Haggag and S. El-Beltagy, “Plagiarism candidate retrieval using selective query formulation and discriminative query scoring,” Notebook for PAN at CLEF 2013, Valencia, 2013. [17] J. PomikÁLEK. (2013). Justext: Heuristic based boilerplate removal tool. [Online]. Available: Google code, online http://code.google.com/p/justext/ Šimon Suchomel studied computer and information systems at Faculty of Informatics, Masaryk University in the Czech Republic. Currently, he studies PhD programme in the field of applied informatics in plagiarism detection. The author is a member of the Masaryk University's Information System development team, he designs and builds the online-source retrieval system for the Information System's plagiarism detection tool. He also works at the faculty as an administrator of windows-based networks. In his free time, he plays tennis and usually spends time with his family or sporting. He is also a devotee of dogs and south Moravian white wine. 26 Journal of Advances in Information Technology Vol. 6, No. 1, February 2015 2015 J. Adv. Inf. Technol.©