Thursday, June 2, 2011

Plagiarism Detection Software and zu Guttenberg's Thesis

We* couldn't resist. Here we had a thesis, a very large one (475 pages) for which a group of collaborators, the GuttenPlagWiki people, had already determined which bits were plagiarized from which source. We decided to see how plagiarism software would fare on the same material.

We wrote to the 5 systems that we rated "partially useful" in our 2010 test (a top score): PlagAware, turnitin, Ephorus, PlagScan, and Urkund. All of the companies were glad to provide us with a test account for this test. Turnitin did, however, suggest that we use iThenticate. That is one of their products that uses the same engine and backend as turnitin, but has an easier to use interface for the purpose at hand, testing individual files instead of modeling classes.

We obtained a copy of the thesis in pdf format, and started in.

The first problem was the size of the file. At 7.3 MB and about 190,000 words, it was a heavyweight and was not easily digestible for the systems. PlagAware gave up after just 159 pages; iThenticate chopped it into 13 pieces á 15,000 words; Ephorus first tested it themselves, then reluctantly let us upload our copy because we wanted to have the same copy used by all systems; PlagScan chugged away over night on the results; Urkund ran into trouble with the number of hits, we appear to have taken down their entire system over the weekend.

The next problem was with the PDF of the source itself. It was formatted nicely with ligatures and different sized spaces for a very professional look. Many systems struggled with this, assuming these were stray  characters and just discarding them, instead of replacing them with the respective letters (fi, fl, or blank). This caused some missed plagiarism.

Then we had the result reports - since there are many copies of the same text available online, many systems inflated their results by counting the same source numerous times. We made every effort to disregard anything published on the GuttenPlag site. iThenticate had a nice functionality for disregarding this site. On the other hand, over 40% of the links that iThenticate returned went to 404s, pages no longer available at that URL! The sites have reorganized their material and URLs. We researched a number of them, the sources are, indeed, still online. This is unconvincing for a dissertation board - they need to be able to verify both the source and the copy.

The GuttenPlag people have determined that 94% of the pages contain plagiarism on 63% of the lines. The following are the results for the individual systems:
PlagAware: Initially 28% on the first 159 pages, however this included a lot of garbage such as pastebin material. After we removed this and the GuttenPlag links, the amount went to 68% before the report disappeared completely. We have not been able to resubmit, it breaks off with an error.
iThenticate: 40%
Ephorus: 5%! Only 10 possible sources found, of these 3 were GuttenPlag and one a duplicate
PlagScan: 15,9%
Urkund: 21%

One general problem that we had with all of the reports was that we could not click on a plagiarism and discover the page number in the PDF source. This is something we would need in the case of preparing a case before the dissertation board, as the side-by-side documentation needs to be marked in copies of the original text and not online. It would be a lot of hand work to find the pages for preparing such a documentation.

Another problem was the presentation of the footnotes. These were generally not recognized by any of the systems, and often the footnote number and text was just inserted in the middle of a paragraph. This often got in the way of marking a larger block of 1-1 plagiarism. The Wiki has found an interesting type of plagiarism we now call “Cut & Slide” - from a large block of plagiarized material, a portion is cut out and demoted to a footnote.

Reports

The PlagAware report disappeared from the database after a software update, we were rather upset that reports that have been produced (and would have been paid for, if we had been real customers) are suddenly gone. The side-by-side that we have always liked turns out to have an extreme problem - the system often marked just 3-4 words, then noted an elision ("[...]") that sometimes went on for a page or so, and then another 3-4 words. Since this was at the very beginning, you had to scroll a good bit to get to the first plagiarism, the Zehnpfennig one (from the FAZ). A law professor might have considered the system broken, reporting minimal plagiarism, and just broken off the check.

iThenticate drove us batty with the 404s and the inability to copy and paste from the online report - many of the sources were large online PDFs containing numerous papers, we had to retype some words and phrases in order to be able to search for the plagiarism source in the link given. It also reports proper quotes as plagiarisms, for example, the appendix which contains material available completely online is only reported to be a 60% plagiarism.

Ephorus has a nice side-by-side, but included a COMPLETE COPY of the entire dissertation for every source found. We assume that this is why it broke off after just 10 sources, the report had already reached 54 MB! The reports don’t need to repeat things that are not plagiarized, just the plagiarized material, but preferably with the page numbers, please.

PlagScan also irritated us with the little dropdown list of sources. We had to open every source and then look to see where the plagiarism was, this took an enormous amount of time.

Urkund was extremely slow - we understand that this had to do with the number of hits found. The navigation, as we had found in the 2010 test, was difficult to use, and the numbers given had no real meaning. This report now cannot be loaded, it gives a count up to 90, stays at 89 of 90, and after some minutes gives up and tells us to come back another day.

38 out of the currently known 131 sources were found by at least one system. Overall, PlagAware found 7 (5%), iThenticate 30 (23%), Ephorus 6 (5%), PlagScan 19 (15%) and Urkund 16 (12%).

Out of the top 20 sources, however, only 8 were “findable”, i.e. online. Many were books that are available on Google books (and are reported by Google through a normal Google search), 7 of the top 20 were papers prepared by the writing services of the German Bundestag and were thus not available to the public.

When we take the top 43 sources found by GuttenPlag, then the top 11 sources found by the systems are included (3 found by all 5 systems, 3 by four out of five, 5 by three out of five systems) as well as the top 20 sources that were available online, we have the following results: PlagAware and Ephorus both found 6 of the top 20 online sources (30%), iThenticate 16 (80%), PlagScan 12 (60%) and Urkund 13 (65%).

Since iThenticate found so many sources, we wanted to go back and look at which position in the results these were to be found. iThenticate had reported 1156 sources, of which only just over 400 were longer than 20 words. We only looked at the 117 sources reported that had a match of 100 words or longer. Of the 13 top sources reported by iThenticate (one for each portion), 6 were 404s (“File not found”), one was a correctly quoted portion in the thesis, one was a correspondence between the bibliography and the
bibliography of another source, 3 were from the same source (Volkmann-Schluck) that is indeed the top source for the thesis, and 2 were of the second most used source that was available online.

No easy answer

There is no easy answer to the question as to whether the professors at the University of Bayreuth would have been able to discover the plagiarism in this thesis with the help of software: The usability problems are very serious. People who are not computer scientists have a hard time interpreting the results. iThenticate’s links to 404s will lead people to disregard other found sources. But it would have been possible for the university to at least suspect a problem, if not see the abysmal magnitude of the plagiarism.

Our suggestion stands: If a teacher or examiner has doubts about a thesis, they should be able to use software systems, preferably two or three, to examine the material. However, they need to be trained in interpreting the results, or have a trained person such as a librarian go over the report with them. We do not find it generally useful to have all papers put through such a system with a simple threshold set for alarming the teacher. Allowing students to “test” their own papers will just encourage them to use one of the many online synonymization tools available until their paper “passes”. Writing, especially scientific writing, is concerned with intensive work on the text itself, not just a superficial attention to word order and choice.

The German report on this research can be found in the June 2011 edition of iX. (Update: Now available online at http://www.heise.de/ix/artikel/Kopienjaeger-1245288.html)

[*] Katrin Köhler is my co-author for this research and report.

No comments:

Post a Comment

Please note that I moderate comments. Any comments that I consider unscientific will not be published.