Reading: Tidy Data
As part of the course Stat 585X I am taking this semester, I will be posting a series of responses to assigned course readings. Mostly these will be my rambling thoughts as I skim papers.
This week we are reading a pre-print of a paper called Tidy Data by Hadley Wickham. This paper lays out a framework for a standardized data structure to facilitate data analysis: tidy data. In the paper, Hadley lays out the ideas that underly his incredibly useful packages reshape2, plyr, and dplyr and calls for even better tools and data storage strategies to be built on top of the tidy data framework.
What is tidy data?
The ideas of tidy and messy data are somewhat difficult to define because we don’t currently have the language to precisely describe variables and individuals. It’s easy to identify when you see it, but writing out a definition can be squishy. Hadley uses the term tidy data to describe a data format in which each variable is a column, each row is an individual, and each kind of observational unit has its own table. This structure roughly corresponds to the 3rd normal form used in relational databases. The contribution of the paper, however, is to define 3rd normal form using the vocaulary of statistics (variables, observations, etc.).
What is messy data?
What’s the point?
The point of tidy data (and normal forms) is to house data in a way that is easy to analyze. There are many different ways to store the same data, but some ways are better than others. Some data organizations reduce redundancy (see the DRY principle). Some structures help identify missing data and data entry issues by maintaining consistency. Tidy data is a step in the direction of standardizing data storage in order to ensure data quality. As it was said in the paper, 80% of data analysis is spent on the process of cleaning and preparing the data. And while the 3rd normal form has been around since before the turn of the century, getting people to consistently use it and in the process help themselves out is another story.
Sometimes it takes a new name for an old standard to be adopted, and I’m all for data being easier to use when it’s handed off to me.
blog comments powered by Disqus