In November 2006, CFL received a call from Steven Harrop, the IT Strategy Manager for UCAS (Universities and Colleges Admissions Service). UCAS handles applications to UK higher education institutions. What he wanted to know was whether CFL could confirm his suspicions that some applicants were copying personal statements from web advisory sites, and if so could they come up a system to check all half-million applications as they came in.
CFL were contacted because they have a background in large scale electronic plagiarism and collusion detection, with customers such as the Open University with over 200,000 students and Leeds Metropolitan University with well over 30,000.
CFL was able to confirm that there was indeed evidence of copying in around 5% of the 50,000 document sample set provided. They were awarded the contract in June 2007 with a September 2007 implementation deadline. The required solution would monitor incoming applications in real-time with minimal impact on existing operations and leveraging existing hardware and be fully automatic.
The brief
UCAS wanted each of the 500,000 personal statements in an annual cycle to be checked a) against each other, b) against known web sources and c) against at least two years prior data held in their database. Applicants are restricted to 4000 characters, which approximates to 650 words or 30 sentences. Most use all the allocation.
UCAS needed to know whether any portion of the personal statements had an identifiable source in the previous data, so comparison needed to be done at sentence level.
And they needed to be able to find modified rather than exact sentences.
The size of the problem
The formula for working out how many pairwise document comparisons are needed is N*((N-1)/2) where N is the number of documents. That means that 5 documents need 10 comparisons, 500 need 124,750, so 500,000 need almost 124 billion. Comparing at sentence level with earlier years to check as well means the number of comparisons required gets truly enormous.
CFL needed to find a way of handling this volume in real time at an acceptable delivery speed.
Leveraging hardware and software investment
UCAS had available a Sun T2000, which has 8 cores each capable of running 4 threads, giving a 32 thread capacity, so the clear answer was to provide a multithreaded solution. CFL has used Java as its programming language for the last 8 years, so this provided a solid foundation. But they hadn’t written a multi-threaded version at that point. UCAS also use Oracle databases to control their side of the operation, so the program was required to interface seamlessly with such a database, achieved through JDBC.
Where to start
CFL had a clear starting point. The existing UCAS procedures cleared new applications in batches of 125, so this determined the input size. These needed to be checked against all existing files and among themselves, and this was going to need indexes. They could check 32 at a time, so the question was, how many?
Where to go next
We are entering the world of parallel universes. The fascination of this concept for science fiction writers is that putting the same character into different universes means that different things will happen in the ‘timelines’. Think Dr. Who, Back to the Future or Sliding Doors.
In the UCAS case we have a character, the incoming index, dropped into 32 worlds at once. In some worlds they will meet old friends and stop to share common experiences, book read etc. They will remember the same things in different ways. This is fuzzy sentence matching. In others they won’t find any common ground, so will be home almost as soon as they left. We need reports back from each world before a new adventure can start.
Back in the program design, multi-threading allows all the threads to be started within nanoseconds of each other, but when each will finish will depend on the number of interesting friends our ‘character’ has met.
If we provide indexes with too many possible places to look up old friends, one or two of the threads might provide much more matching and hold the rest up, so giving the CPU a rest that we don’t want. So we don’t do that. Experimentally, we discovered that if we had stored indexes 5 times the size of the incoming index, we could keep the CPU running at over 95% user allocation most of the time.
And thanks to the elasticity of the number of parallel universes both the hardware and the software allow, UCAS actually allocate 90 threads at a time, providing a steady stream of work. UCAS were able to experiment with this setting themselves because CFL had provided a means for them to adjust all the parameters to tweak the performance to their satisfaction.
Both during development and during live running, it is possible to watch the behaviour of the program in its use of both memory and CPU time, using the vmstat and prstat tools in Solaris. The end users see nothing of this, of course, and once optimum levels have been found, the system can run fully automatically.
How does it do the comparisons?
Actually how it does it is the trade secret at the heart of the CFL system, but in outline, the program compares indexes not files. The incoming index is built differently from the stored index, but in such a way that direct comparison is possible. Both types of index contain sufficient information for the program to know with absolute certainty the level of similarity between any pair of sentences in the incoming set and the store. This means that at the end of each cycle, the number of sentences matching with the store for each document in the incoming data is also known, and those that meet or exceed the criteria set can be recorded and processed further.
The levels used in development were minimum 3 sentences matched across documents, minimum of 4 non-grammar words in a sentence before checking, and minimum overall sentence similarity of 40%. These can all be adjusted at runtime in the same way as the threads. UCAS report at much lower levels than are going to require action, to ensure that no false negatives can occur. False positives can’t occur in terms of the search criteria, because everything that meets or exceeds them will be reported by the program. The decision on whether to take action is a human one.
Index and file locations
This area required a bit of thought. What we have ended up with is a structure that stores the actual files as plain text in directories named with the application year. Since September 2008 UCAS have been in 2009. The files are uniquely identified by application number, and personal ID. (If an applicant is unsuccessful in 2009 they can re-apply in 2010 with a new application number but they keep their personal ID).
The stored indexes are kept in a single folder. There are two indexes for each stored batch of around 625, one storing the file references and one storing the word/file/sentence information. These indexes are dynamically updated as the initial phase of each program run, ensuring that all new data will be compared with everything, including the current batch.
When matches are found, and the detailed comparison of actual file contents is required, the historical files can be recovered simply by reading the first 4 digits of the unique file reference, giving the directory location year, allowing the relevant file to be recovered.
Operational Control
External control is provided by a scheduled batch file, which runs every minute. This checks to see if the program is running and starts it if it isn’t.
The internal control is via an Oracle database which hold all the file names and the current state of processing. The program checks to see what is pending and grabs any filenames that are waiting. It then indexes them and kicks off the multi-threaded part.
Collecting the results
What happens when each thread finishes? Java has a useful class called ConcurrentMap. If three threads arrive with news of old friends met at the same time, ConcurrentMap only lets one of them in at a time to report this. Since each of them is guaranteed to have met a different person, each match will be unique, and is held as the key value in the map. And what they have found they have in common will often be different too, and the common sentences are held as the value for that key. It is obviously important that the correct values get recorded with the right key, otherwise we won’t be able to find them again when we go to check up. This is what the ConcurrentMap ensures.
What happens next?
Once all 900 indexes have been visited, the complete ConcurrentMap can be read and the Oracle database updated for each entry. This too could be done concurrently, as Oracle has multithreading capability, but it has not been implemented at UCAS, as the numbers found in each batch are low.
How do you test this sort of thing?
UCAS fed known examples, with different levels of similarity, into the control database spread across the years. We could then test whether they got found every time at the appropriate levels and with different thread settings.
Once we were happy with that, UCAS ran the whole of 2006 against the whole of 2007 to ensure its robustness. This took 12 days of 24/7 running, which was a pretty good test of both hardware and software. The output from this was sampled and some adjustments made.
Finally, before full activation, the live output on incoming data in September 2007 was subject to stringent User Acceptance Testing, which the program passed without further adjustment
Doesn’t this all take a lot of time?
Well, it certainly isn’t like a Google search, but Google doesn’t have to look for any more than 10 words, nor is it required to check everything and find everything. And a plagiarism detector can’t report results in terms of ‘about x’, it has to report ‘x’, and ‘x’ has to be correct.
Having said that, the longest time for a search at the end of a 125 batch cycle is 7 minutes, including checking which of the possible matches is most likely, (there are inevitably a lot of near duplicates of popular sources in 500,000 personal statements), and marking up the file with detailed matching and changes for use by the Verification Unit.
This is well within what the Verification Unit need as real-time, and produces an acceptable performance, even at the submission deadline peak times, when nearly 40,000 applications are received in a single day. The hardware/software combination clears any day end backlog by around 2am, so the Verification Unit come in to a completely accurate pending situation in the database.
What do the end users see?
Just a database screen, allowing them to filter at different levels of matching. You can find an example screen shot here. http://www.copycatchgold.com/UCASScreen.html They can click on a Link field to see the whole personal statement marked up in different colours, identifying the different sources and can navigate to those sources for comparison purposes if they need to.
Do they like it?
The user satisfaction level is very high. The IT team like it because it just runs. The Verification Unit like it because the results reported are those that they would find themselves if they were capable of checking a million or so prior statements in a few minutes. In addition, the design of the output, jointly agreed by UCAS and CFL, means that they can reach decisions very quickly and their workload and workflow has not been greatly affected.
Steven Harrop, UCAS’ head of technology and strategy, says: “I have been very impressed with the performance of the CFL solution both in terms of the quality of the similarity detection and the ability to handle such a sizeable task on a small entry level server. The clear design of the similarity transcripts has enabled us to provide these to both the University and the Applicant; this has allowed all parties to quickly see the level of similarity detected.”