Yak shavingPosted: August 24, 2011
I just learned the term ‘yak shaving’. It’s not really clear to me if it’s usually a positive or negative thing, but today I want to talk about a good kind.
The definition of yak shaving from Urban Dictionary is:
“Any seemingly pointless activity which is actually necessary to solve a problem which solves a problem which, several levels of recursion later, solves the real problem you’re working on.”
Apparently it probably originated from Ren and Stimpy.
I’m not sure what they were planning to do with the yak afterwards.
Anyway, the type of yak shaving I want to talk about is data cleansing. And what I want to say is, CLEAN YOUR DATA! Yes, it’s not very exciting, but if your data is not clean, your life will be full of little irritations.
Why clean your data?
- Flexibility. If your data is not clean – for example, if there are odd data points with a different date format – then you can’t roll data up by different key fields like date. Your reporting is going to be brittle and hard to change or to check
- Accuracy. If data is dirty, either in the values it contains or in the formatting of those values, you’ll never be sure you can trust your reports and analyses, and you’ll probably have at least one incident where you make the wrong decision because of corrupt data.
- Freedom and business continuity. When data is dirty, the data owner is locked into it: they are the only one who knows how to use this brittle stuff carefully and make sure that only the good parts get used. If they fall under the hypothetical bus, then the company has a problem. If your data is good throughout, then the original owner is free to go.
How can you make data clean?
Basically, make everything as standard, simple, well-labelled, and atomic as possible. Keep the structure very rigid, and deal with missing or odd data points by adding notes fields or extending the structure rather than by just trying to remember or adjust for something that’s wrong.
Dates: use year-month-day format. This is understood by every person and every computer. Day-month-year creates ambiguity as soon as you talk to an American person or computer.
Number formats: store in the most basic format possible (e.g. 12.34). You can make the formatting pretty in the final report, but until then keep away from the $, %, integer-rounding, and so on.
Week numbers etc: don’t use these in your basic storage; every company does them a bit differently, and there’s every chance that one day your CEO will decide they want Wednesday-Tuesday instead of Monday-Sunday weeks, so you’re setting yourself up for a world of pain if you don’t store using a basic date format. You can store it at the basic level and then add a reference table for which date maps to which week number/ lunar cycle/ anything else.
Record sources and definitions: especially if you change data source or definition in the middle of a period, poor labelling can come back and bite you on the ass, hard. In the same place as you record the data itself, record where it comes from and what it is. Example: if a field says ‘revenue’, is it pounds, dollars, other?
One immutable law of business analysis is that eventually you’ll need to use the data in some way you never imagined. Past data is never totally obsolete – there’s always a good chance you’ll want to go back and use it for something. And at that point, nice, clean, well-structured data will suddenly be an awfully helpful thing to have. And dirty data will make you very unhappy. So get going, and shave that yak right away! You’ll thank yourself for it later.