I have just read the news in Slashdot (Yes, I still read Slashdot) and I found this very interesting article about a new technique to detect GPL violations in proprietary obfuscated code. The technique is called Birthmarking and basically it ‘observes short sequences of method calls received by individual objects from the Java Platform Standard API, which is part of the Java Runtime Environment. By aggregating sets of short call sequences the otherwise overwhelming volume of trace data becomes manageable.’
In other words, every piece of java code (and a library is not an exception) moves at bytecode level the information coming and going from the stack following a pattern. So, they identify and classify the patterns of some key objects of a specific GPL piece of code and then they check how the proprietary code uses the stack. If it matches… gotcha!
But I will tell you something… I have seen this concept implemented 15 years ago!!! Not for Java, obviously. Where? At the Faculty of Computer Science of the Polytechnic University of Madrid. There was two terms called Computers Architecture and Computers Fundamentals. We had practices to learn how to code in assembler and microcode. It seems that students copied practices from each other, but students where so smart that they just copied piece of code of other students, or renamed the variables (or labels, it was low level programing). But the core functionality was copied intact. So, the department in charge developed a program called ‘El Corrector’. This program checked the input and outputs of each practice automatically (just like today we do with automatic unit testing) and they checked how the program move data to and from the stack. All the practices were compared each other (the patterns were the other students). If they detected similarities in the behaviour of the applications with the stack, they examined the source code and then visually tried to identify a possible copy.
A friend of mine was a bit desperate with the practice, so he asked another friend to have a look to his practice ‘to get some inspiration’. He gave him his source code, and he copied parts of it. Both of them were caught by ‘El Corrector’, and both of them had to repeat the term.
So the ideas behind this paper are not so new, but they are applying it for a real world problem.