dplyr looks like a fantastic initiative. Congratulations to Hadley and Romain.
The recent CRAN release of dplyr showcased a benchmarking vignette using baseball data. Following that, we noticed under the comments section of the Rstudio blog post:
Comment from a reader: From the introduction vignette: “dplyr also provides data.table methods for all verbs. While data.table is extremely, fast, the current benchmarks suggest that dplyr is 2-3x faster for most single operations, and up to 10x faster for grouped summaries (see the benchmark-baseball vignette for more details).” This is really really really exciting. I can’t wait to try this package out!! Yay for Hadley and collaborators!!!!
Hadley: Ooops, I forgot to remove that statement from the vignette. That was based on some initial benchmarks when I didn’t understand data.table that well. I’d now claim that dplyr and data.table are pretty similar for most operations. Sometimes data.table is faster, sometimes dplyr is faster.
Hadley had asked us in advance what we thought of the baseball vignette and we had several rounds of emails to correct it. But unfortunately due to an oversight the vignette went to CRAN unchanged.
Hadley: It was an oversight, and I’ve filed a bug to remind me to fix it for the next version.
Since the vignette is on CRAN though and the comparison is to data.table, and since our users have been asking us about the comparison made to data.table, we feel it is appropriate to respond. Therefore for the sake of completeness, here’s the benchmark for the relevant cases.
Note that we perform the benchmark with data.table 1.8.10 as on CRAN currently. Many improvements are in 1.8.11 which we will ignore here.
Note 1 An important point is that dplyr’s
group_by
should be included in the benchmarks as it is an essential precursor for grouping operations in dplyr, whereassetkey
is not mandatory for data.table for grouping. Ifsetkey
is to be used in the analysis, of course it must be timed as well.Note 2 A common method to benchmark big-data is to run
system.time
3 times and choose the minimum (to report timings without effects of cache). However theBatting
data size here is so small (10MB) that it almost fits in cache! Relative speed at such small granularity seems utterly pointless to us. But people may believe it is informative. We’ll therefore use microbenchmark as well on the same 10MB dataset.
So, let’s start with initial setup as follows. Note that we don't pre-calculate group_by
operations here.
require(Lahman)
require(dplyr)
require(data.table)
require(microbenchmark)
batting_df = tbl_df(Batting) # for dplyr's data.frame
batting_dt = tbl_dt(Batting) # for dplyr's data.table
DT = data.table(Batting) # for native data.table
Compute the average number of bats for each player:
microbenchmark(
dplyr_df = group_by(batting_df, playerID) %.% summarise(ab = mean(AB)),
dplyr_dt = group_by(batting_dt, playerID) %.% summarise(ab = mean(AB)),
dt_raw = DT[, list(ab=mean(AB)), by=playerID],
times = 100)
Unit: milliseconds
expr min lq median uq max neval
dplyr_df 37.51670 38.61942 39.36357 41.69157 109.1514 100
dplyr_dt 68.30773 75.87789 79.31867 135.32441 162.7321 100
dt_raw 38.22706 39.66757 43.07817 45.80393 117.2201 100
It shows a very different picture now. The time to group_by
was excluded in the vignette, but summarise
can’t run without group_by
having been run first.
Arrange by year within each player:
microbenchmark(
dplyr_df = arrange(batting_df, playerID, yearID),
dplyr_dt = arrange(batting_dt, playerID, yearID),
dt_raw = setkey(copy(DT), playerID, yearID),
times = 100
)
Unit: milliseconds
expr min lq median uq max neval
dplyr_df 58.35734 63.06669 64.62387 66.86399 133.6722 100
dplyr_dt 610.86968 627.17217 676.60503 689.93259 723.6959 100
dt_raw 29.16627 32.27533 33.64683 36.75697 106.7013 100
Once again, a very different picture. This is because in the benchmarks from dplyr vignette, the ordering by playerID
was already done in players_dt
. So basically, the ordering here had to be done only on yearID
for the dplyr_*
cases!
dplyr_dt
performs much worse than data.table run directly. Hadley has said that he considers such differences to be bugs in dplyr.Find the year for which each player played the most games:
microbenchmark(
dplyr_df = group_by(batting_df, playerID) %.% filter(G == max(G)),
dplyr_dt = group_by(batting_dt, playerID) %.% filter(G == max(G)),
dt_raw = DT[DT[, .I[G==max(G)], by=playerID]$V1],
times = 100
)
Unit: milliseconds
expr min lq median uq max neval
dplyr_df 84.75893 88.29261 90.86771 96.03549 181.3004 100
dplyr_dt 105.82635 112.10389 116.55127 124.82832 202.0836 100
dt_raw 68.33252 73.67968 74.88609 77.91639 165.2599 100
Because the data.table expression is not a single command (for dt_raw
), Hadley chose not to benchmark it. Once again, it shows a different picture.
The benchmark in dplyr vignette is incorrect because it doesn’t add a column (doesn’t use :=
at all), rather it aggregates. We'll just highlight one case here - rank
.
microbenchmark(
dplyr_df = group_by(batting_df, playerID) %.% mutate(rank = rank(desc(AB))),
dplyr_dt = group_by(batting_dt, playerID) %.% mutate(rank = rank(desc(AB))),
dt_raw = DT[, rank := rank(desc(AB)), by=playerID],
times = 10
)
# remove rank col from DT
DT[, rank := NULL]
Unit: seconds
expr min lq median uq max neval
dplyr_df 1.173205 1.183997 1.195302 1.266843 1.518752 10
dplyr_dt 1.205231 1.255513 1.262523 1.279185 1.328591 10
dt_raw 1.101572 1.111967 1.120616 1.130896 1.155168 10
This is a bit hard to benchmark because the philosophy of dplyr and data.table are quite different. data.table is designed for adding columns by reference in place (therefore columns are over allocated during data.table creation). But dplyr doesn’t like doing this. However, when benchmarking one should be benchmarking the equivalent of an operation in each tool, not how one thinks the design should be.
But if you insist, this can be accomplished the dplyr way, using data.table's shallow
function which, like dplyr's mutate
, shallow copies the data. Here’s the benchmark:
microbenchmark(
dplyr_df = group_by(batting_df, playerID) %.% mutate(rank = rank(desc(AB))),
dplyr_dt = group_by(batting_dt, playerID) %.% mutate(rank = rank(desc(AB))),
dt_raw = data.table:::shallow(DT)[, rank := rank(desc(AB)), by=playerID],
times = 10
)
Unit: seconds
expr min lq median uq max neval
dplyr_df 1.165308 1.179277 1.214480 1.269659 1.300810 10
dplyr_dt 1.183548 1.210675 1.257438 1.270460 1.275579 10
dt_raw 1.106987 1.109709 1.120783 1.137172 1.192628 10
So, this illustrates once again the methods that are available in data.table.
The micro benchmark results are not that different between dplyr and data.table here. The 2-3x faster and 10x faster figures were an oversight.
We'll propose a range of benchmarks on large data in a future note once the next version of data.table has been released to CRAN.