A magical fraud finderPosted: October 17, 2011
I hope you’ll indulge me moving a little away from normal dashboards topics today, to a neat rule called Benford’s Law. This is handy if you’re trying to see if someone’s data cleaning exercise has made the data a little *ahem* too clean. (See the argument for the good kind of data cleaning here).
The basic insight of this law is that, in most naturally generated lists of numbers, the first digits tend to follow a specific distribution, with about 30% of leading digits being 1, down to about 5% of leading digits being 9. The law works best with numbers that are unbounded and distributed across multiple orders of magnitude – for example, line items in company accounts, which could go from £1 up past £1,000,000. In contrast, when people invent data, they tend to go for an even distribution of first digits, which really sticks out against a Benford distribution.
This gives us an easy way to see if books may have been cooked; chart the actual distribution in the data against the expected distribution in Benford’s Law, and see how close the two lines run. Even quicker, you can do something called a chi-square fit test (of which, more another time – no need to worry about the technical detail for today), which gives you a single numerical measure of how good the fit is.
Here is a great article explaining some real world uses, from spotting financial fraud to spotting fiddled murder statistics.
Of course this law is not going to tell you where the iffy numbers actually came from, but if you already have suspicions of the data being ‘off’, it’s a great start to reinforce or relax your suspicions.
We implemented an easy Benford’s Law checker in Incanter this week; hopefully people will apply it to lots of new datasets and spot some interesting things going on.