As part of the course Stat 585X I am taking this semester, I will be posting a series of responses to assigned course readings. Mostly these will be my rambling thoughts as I skim papers.
This week we are reading documentation on the datadr package by Ryan Hafen from February 2014. This tutorial lay outs the introduction to the new package datadr as well as some background on the Divide-Recombine strategy that the package is built upon.
Divide and Recombine (or is it S-A-C?)
The divide and recombine (D-R) strategy is very similar to something we have talked about recently: the Split-Apply-Combine paradigm. In fact, the two seem almost identical. The D-R strategy is to take a dataset, split it in a logical way into many chunks (divide) and then fit them back together after applying some transformation or summary (recombine). Sound familiar? This is the same idea as the S-A-C paradigm, except recombine encompasses both the apply and combine steps.
Birds of a feather…
So we have a package (datadr) based on the D-R strategy and multiple packages (plyr, dplyr) based on the S-A-C paradigm, which are essentially the same thing. What’s the difference between the packages? I would say in some ways datadr is more similar to dplyr than plyr, because
In datadr, divisions are data objects that persist. The authors did this because with very large data, computing a division is expensive compared to recombination. This is similar to dplyr in that we can create an object with the group_by() function that persists. Often we will use the chaining methods with the %>% operator, but if splitting is an expensive operation and we may want to use the splits with different apply statements, keeping the splits around makes sense.
Both datadr and dplyr allow for use of the same interface with different backends. For example, dplyr is easily used with data in R memory, or with data from a MySQL, SQLite, or postgreSQL database. Similarly, datadr can be used with data in R memory, or with data in a Hadoop Distributed File System (HDFS), RHIPE, or Hadoop MapReduce setup. Since datadr is already compatible with Hadoop, it would appear that this package has the one up on being able to handle truly huge datasets.
However, datadr uses key-value pairs as its underlying data structure, which can be arranged in list form. dplyr is limited to using a data frame input and output. In this way datadr has some similarity to the plyr package.
Side-by-side example
Let’s take a look at an example of using both datadr and dplyr to accomplish the same task. The data is adult income from the 1994 census database, pulled from the (UCI machine learning repository)[http://archive.ics.uci.edu/ml/datasets/Adult].
We’ll start with the datadr package.
Let’s group the grades together where appropriate and remove preschool. We can make these changes to the education variable within the divide() function. Of course, we could do this to the original data in this case, but this is an illustration. For very large data, it would make more sense (read: be possible) to apply any transformations during the division.
Now let’s do the same thing with dplyr.
As you can see, pretty similar in idea and in execution. We can check the speed of each method.
datadr took 0.9 seconds, while ddply took 0.22 seconds. That’s a 75.56% speedup over datadr in this example.
Takeaways
It seems like both datadr and dplyr have their benefits. Of course, this example is almost a joke because of everything is small enough to fit in memory. I’d like to test these packages on a much bigger dataset and see what I think after that. The ability of using Hadoop with datadr is an attractive one, although I don’t anticipate ever having me hands on data big enough to use it to it’s fullest capability. Both are interesting packages building on a sound principle and I look forward to using both more.