Powerful plagiarism and collusion detection

Used by the professions, where data security is a prime concern. Uses ‘fuzzy-matching’ to detect re-writing as well as direct copying.

Sophisticated searching

A completely new way of searching large numbers of documents No Keywords, no Proximity setting, no AND’s or OR’s. Just use a whole document or set of documents for the search entry. Investigator does the rest by identifying similar sentences in the documents, and presenting the results ordered by the strongest sentence links between documents.

Fast
Scalable
GUI or automatic
Multi-threaded, multi-processor capability
Multi-platform - written in Java.
CopyCatch Investigator Search Screen


ccifrontw2

Purpose

CopyCatch Investigator looks for similarity between sentences in documents without using keywords or any other user entered search patterns. This makes it different from most search engines and databases in three ways.

  • Whole documents or sets of documents are used as the search data instead of keywords, phrases or Boolean operators.
  • The program uses the level of similarity required by the user to examine each document against the index selected.
  • It looks for similarity, not identity, so it insensitive to changes in word order, the use of a thesaurus to change some words, and the insertion or deletion of material. It finds identity as well, of course, as the extreme case of similarity.

Despite the much more detailed comparisons required, all the words in the document, not just the words in a query, CopyCatch Investigator delivers fully marked up document pairs for a single document in around one second.

Interface

The interface has been designed in consultation with users, and is extremely simple to operate, with only two main screens.

  • The Searching tab allows you to choose the files and indexes, set parameters and to review the results.
  • The Indexing tab allows you to create indexes, switch languages and use other word lists.

Two further screens give you information about the current document pair:

  • The Content Words tab shows how many words are shared and how many are only in one file or another.
  • The Statistics tab gives a summary of the amount of sharing of words and sentences in the pair.

Presentation

Sentences identified as similar are shown side by side, in the order of the document being used as a query. The similar sentences are cross-referenced to the position in the current indexed file which has been found to share material. You have the option of seeing both files side by side fully marked up, so that related sentences can be seen in the context of the different or less similar sentences. In the screen shot above, you can see that sentence 17 on the right is a modified cut and paste of 46 on the left, (or vice versa), whereas 19 is almost certainly a contraction of 48. You can also see that neither example has long successive runs of words in common. The program does not take account of word order, either, so substantial re-writing can be identified.
Indexing

Investigator is built with the recognition that different users have different requirements

  • Forensic Analysts and Plagiarism Investigators might need to index every word and get reports at a low similarity level.
  • Lawyers might have very long and complex documents, and only need to index longer sentences, with some common terminology ignored.

Investigator allows the user to set the levels of indexing and the words which should be ignored.The user can also choose the number of indexes, so can index in larger or smaller units, or have different levels of indexing on the same set of documents.
Reporting

Levels of reporting are also chosen by the user.

  • The minimum number of sentences in common can be selected, when limited use of source material is known or expected.
  • The minimum number of words which must match in a sentence can also be set. If you only want longer sentences, this can be set high; if you want all sentences you set it low.
  • The level of sentence similarity can also be set. Do you want at least 50% matching or do you need 30%? This depends on what you are looking for and what you know or find out about the way the indexed material is being used, so you just move the slider to the required level.

All the sliders which set the limits are interactive, so just need to move them up if you have got too much or down if you have got less than you expected or need.

Search

Both web searchers and database search engines are very fast at delivering answers once you have formulated the questions. What users and the suppliers of the search software tend to overlook is that the total search time involves

  • Constructing an appropriate query
  • Searching.
  • Considering the answers returned.

The fastest bit by far is the searching. You, as the user, have most of the work to do. CopyCatch Investigator removes the first stage altogether. All you need to do is decide how many results you think you need. It also accelerates stage three considerably, because all the matching is immediately visible, and the results are sorted to help you find the documents most similar to the query document.

Multilingual

The program uses lists of function words to assist the discrimination process, so if you have such a list in a plain text file then you can switch languages with a couple of mouse clicks. We have a number of such lists available on request. Note: It can’t find similarities between documents written in two different languages