Student-Coursework Plagiarism Detector

This computer program was originally written in 1987, based on ideas by David Barnes (now at the University of Kent at Canterbury), which, in turn, were based on ideas by Rees (1982) and Halstead (1977). Many of the ideas were subsequently used in a program called Bandit.

When students hand in their computer programming course work, their source files are collected together in a single directory. The pfing program takes each one in turn, and characterises it by a string of 20 (or so) integer parameters. This is the so-called fingerprint of each student's program (hence the name, pfing).

Parameter Collection

As pfing parses each student's program, the it builds up a simple symbol table, which has already been pre-loaded with the built-in primitives and keywords of the language in which the student coursework is written (the Pascal programming language, for example).

Two of the parameters that pfing generates, E and F, are simple counts of the number of blank and non-blank ASCII characters in the original source file (with two others, O and C, counting the number that occur within program comments, and another pair, G and H, counting the number of upper and lower case characters). Another pair, J and L, counts the number of blank and non-blank lines in the file; and another, T and S, the number of empty and non-empty programming statements.

One parameter, P, counts the size of the vocabulary of primitive identifiers (pids) used by the student, and another, R, counts the total number of times that those primitives are instantiated. One integer, V, does a similar thing for the vocabulary of user-defined identifiers (uids), and U, counts the total usage of these. Yet another parameter, N, indicates the cumulative length, as a total number of ASCII characters, of the names of all the uids put together (thus, N/V is the mean length of a uid name).

One parameter, M, counts the number of programming modules started (functions and procedures in Pascal), and another, B, the number of program blocks (the number of BEGIN statements in Pascal). Another, A, contains the cumulative sum of the block depths of all the active statements (thus, A/S gives the mean depth of the non-empty statements in the program).

One parameter, W, gives the coefficient of correlation of "block indentation" against "block depth", and X and Y give the "gradient" and "y-axis intercept" of this line of correlation.

Automatic Marking Aid

As well as looking for signs of possible plagiarism between coursework, the program is able to offer some marking advice to the teacher. Some syntax errors (imbalances between BEGINs and ENDs of blocks, or starts and ends of comments, for example) are evidence that the student's program does not even compile, and possibly that the student is in need of help from the teacher. Some aspects of good programming style, such as appropriately long uid names, use of comments and mixed case, and consistency of indentation for programming blocks, are given an automatically generated mark. Of course, this mark can then be taken or ignored, at the discretion of the teacher.

Plagiarism Detection

The main purpose of pfing, though, was to collect characteristic parameters that tend to be invariant for each programmer, but that vary from one programmer to another. In short, to act as a plagiarism detector.

Many experiments were conducted, and a number of fairly good characteristic parameters were found. None of them was sufficiently good as a stand-alone parameter, though.

Two parameters, (U+R)/S and U/V, were identified as being poor indicators in isolation, but good when used in combination. The former, also called Q, is the mean number of symbols per programming statement, and the latter, the mean usage per uid. To aid the detection of a similarity in both of the parameters together, the two parameters were used as the Cartesian coordinates on a 2D plot, with the teacher being alerted, visually, to any student programs that clustered together, and hence would then be on the look-out for evidence of plagiarism when marking those programs in the usual, manual, way.

One somewhat contrived parameter, (Q-4)/S, was found, empirically, to be an even better indicator. It was good enough to be used as a sort key, causing dissimilar programs to be sorted far apart, and similar ones to be sorted close together. This can be used to advise on the best order in which to mark the coursework. In this way, any plagiarism becomes apparent just by virtue of the fact that related programs are still fresh in the teacher's mind when they are marked close together.

It is fairly easy to see how a plagiarising student could set out to mask any similarity in the (U+R)/S and U/V parameters. The (Q-4)/S parameter was the result of looking for characteristics that were less easily forged. It is not perfect, and more work would be needed to find better parameters. However, one of the aims of pfing was for it to be used as a deterrent for plagiarism, rather than just as a detector. By knowing that such a program was being used by the teacher, and not knowing exactly what parameters were being collected, the students are discouraged from even trying to cheat; and, in any case, the skill involved in forging the values of even the (U+R)/S and U/V parameters, in some non-blatant way, is perhaps greater, and more deserving of marks, than that involved in doing the coursework the intended way.

Further Work

Parameters might also be sought that analyse how different the current year's class is from those of previous years, as an experiment to see how programming style varies over the years. In any case, it is necessary to run pfing with previous years' coursework, to check for coursework that has been plagiarised from previous years' students.

Another experiment that was never tried would be to characterise other things, other than the student programmers. For instance, to characterise the students as a group, and to identify the style of the teacher: a parameter that would give a different value for students taught by a different teacher.

Pfing was written so that it could be configured, by loading the appropriate initialisation file, for coursework written in any computer programming language (Pascal, C, Fortran, Lisp, etc). It was never tried, but could, in principle, have been used to as the seed of a program for scanning ASCII text written in English. The idea of finding characteristics that are invariant for each author, but different for different authors, was at that time being widely discussed for identifying the authorship of Shakespearean plays and sonnets, for example (albeit, to detect the opposite situation, of authors fraudulently attempting to copy the style of someone else).

Details

The program, itself, was extremely simple. Most of the hard work went into devising parameters to collect, and how to combine them.

First, there was a shell script (written for Primos, in fact) to do the work of compiling the directory listing of student coursework files, and to pass the list to the main program. The main program then just appended a single line of output (the file name, followed by each of the parameters, A, B, C, E, F, etc) for each time round the loop (once for each student program). Then, the shell script called a second program that took the output file, and wrote out a second file with some extra, compound parameters (such as (U+R)/S, U/V and (Q-4)/S) inserted into each line. Then the shell script called the sort utility to sort into a third file (for example on the (Q-4)/S column). Finally, this was read in by a final pascal program, that allocated an identifier to each student file (a single letter identifier, A to Z for the first 26 files, a to z for the next 26, and 0 to 9 for the next 10) and placed this letter in a suitable position in a two dimensional grid, to be output as a plot. This final program first asked the user which two columns would be used for the plot, and then worked out the appropriate scaling and offset to ensure that the interesting part of the plot was scaled to fit the whole plot area, which in turn was designed to fit on a single page of line-printer paper.

As for the first program, all it did was to open each coursework file in turn, and to read it in character by character. A sequence of alphanumeric characters, such as count33, would be treated as a single word, and have a new entry appended to the symbol table (unless that word was already in the symbol table). Similarly, a sequence of non-alphanumeric characters, such as <=, would be treated as a symbol, and added to the symbol table. Numerical constants were also treated as user symbols, and added to the table.

The parameters E, F, O, C, G and H were just integers that were incremented appropriately as each character was read in from the student file.

Each line in the symbol table had an integer for how many times that symbol had been encountered in the file. Counting up the number of non-zero occurrences gave the values of P and V, while totalling them gave the values of R and U.

Before opening the student file, the program first opened an initialisation file. It chose which initialisation file to open according to the file extension on the filename of the first student coursework. Thus, if it was to analyse smith105.pas, it first read in the Pascal initialisation file.

This initialisation file started with a line that listed the alphabet of alphanumeric characters, and an indication of whether case was to be considered. Then there were lines to say how to recognise certain constructs in the language ( ; as the statement separator in pascal, and also in C; BEGIN as the block begin in pascal, but { in C; similarly for END and }, /* and */ or // for comments in C, etc.). There was even had a suitable initialisation file for Fortran, triggering switches for recognising "C" in the first column as the start of a comment, and a character in the sixth column as indicating an extension of the same line.

Finally, there was a list of primitive symbols in the chosen language ( if then else while do := + - * / <= => etc.). It is, of course, for this reason that some of the primitives in the symbol table might have been encountered zero times in the student coursework file, and so not be counted in the P value.

Thus, there were all sorts of ad hoc counters that were added, triggered by appropriate events (J counted the number of empty lines, that is two newline characters, separated possibly only by blanks and tabs; and T did something similar for empty statements, that is two statement separators without any primitive or user symbols between them. L and S were then just the counters that were incremented in the complementary cases).

The coefficient of correlation for the number of spaces of indent of statement against the block depth was also simple enough (involving simple accumulators for sigma X, sigma Y, sigma XX, sigma YY and sigma XY, and perform a simple bit of arithmetic at the end).

Réalisations précédentes

Contact