# Correlation and its discontents

**Posted:**October 22, 2011

**Filed under:**Uncategorized |

**Tags:**maths Leave a comment

Another powerful addition to the business statistics toolbox today: correlation coefficients. These should be used in the context of two golden rules:

**1. Draw a graph **

**2. Keep using your brain **(I know it shouldn’t need saying, but honestly, people don’t always manage this one).

Here’s what you need to know: most of the time when people talk about correlation in a business context, they’re talking about linear correlation between two variables – ie how neatly they fit on a line when plotted against one another. There’s usually an input variable (e.g. advertising spend per month) and an output (e.g. sales per month). The correlation coefficient is somewhere between -1 and 1. The absolute value tells you how well the two variables correlate, with 0 being no connection at all, and 1/-1 being a perfect line so that if you know the input, the output is perfectly predictable. A positive/negative value tells you whether the line slopes up (positive) or down (negative).

Here are some lovely well-correlated points.

These have a correlation coefficient of 0.99. Which is not very surprising, considering how tidily they line up.

Looking at correlation can be helpful in business, as it measures how well you can predict outputs from inputs. The other essential piece of the puzzle is a regression line: this is the ‘line of best fit’ between your points. The correlation coefficient tells you how good that ‘best fit’ is. Here’s a regression line (in red) for that suspiciously neat set of points that we just looked at.

If you end up with a clear-cut relationship like this, you’re home and dry: whenever you know ‘input’ then you can predict ‘output’, so for example you can predict your sales from your advertising spend. You’ll often see linear correlation being assumed by business managers in the form of ‘rules of thumb’ – for example, when breaking into new markets, businesses often guess at a very simple linear relationship between population and revenue, so if they make £1m in the UK they assume they can make £1m if they go into France as it has a similar population, and £1.3m in Germany as it has roughly 1/3 more people, all other things being equal.

Now, those golden rules:

**1. Draw a graph **

Why? First off, because correlation is heavily affected by rogue and outlier data. If you draw a graph, you’ll see the rogue points, and then you can decide whether they are valid or whether you should exclude them.

For example, if I add one dodgy point into the data above, like this:

then the correlation coefficient plummets from 0.99 to 0.87. You can imagine the mess that you get if your data includes several dodgy points. (Of course, don’t remove data if it really is valid – otherwise your model will just be wishful thinking, created by removing everything that doesn’t suit you).

Furthermore, this kind of correlation only tells you how well your data fits along a straight line. It could be a terrible fit for a straight line, but you might see immediately on plotting it that it fits a curve, zigzag, or something else. For a cautionary tale, see Anscombe’s quartet – remember that all four of these datasets have a correlation coefficient of 0.81 and the same line of best fit, even though when you look at them it is completely obvious that they express different situations.

**2. Keep using your brain**

Once you’ve plotted your data, you need to keep thinking. For example, is there any ‘signal’ which could be masking the straight line relationship you’re looking for, which you could easily remove? (see here for an example of an easy to remove signal). Also, how are you going to interpret the relationship if you find one – does one thing actually *cause* the other, or do they happen to happen at the same time – for example, does the CMO always gee up the sales team at the same time as a big advertising push? In that case it’s hard to know how much revenue is driven by advertising spend and how much by the sales team trying extra hard.

This ‘using your brain’ part is extra important when using clever-sounding measures like correlation coefficients – it’s tempting to just assume it must tell you something important because it seems clever and complicated. As we’ve seen above, there are a few ways in which this can catch you out (rogue data, weird patterns, coincidences) so it’s critical to keep a common sense eye on what you seem to be finding, to ensure that it really does make sense for your business.