Jan 28, 2016

Simple aggregation performance comparison: R data.tables vs Python pandas

Motivation

Both R and python are popular tools for data scientist and statisticians. The well-known package data.tables in R and pandas in Python make the data manipulation task much easier and more efficient. Both of them are nowadays well adopted by more and more statistics and data science tools and packages. Most of popular machine learning algorithms are available in both R version and python version.

There is really no way to compare these two tools because the choices that we make really depend our own projects, coding environment, coding habits, tools availability, and even personal preference.
This comparison is just for my own curiosity.

Particularly, I am more interested in their aggregation performance. This is because I often have to run some model to provide predictions for each individual, and the prediction is typically based on the data of each individual. So group and aggregation would be typical operations for me:
  1. Split the big data into multiple pieces each of which presents an individual
  2. Run my own model against each small piece of data
  3. Put the prediction for each individual together in a large data set.
 Both data.tables and pandas can make such job easy, and I observe there is a significant performance difference between the two tools when performing such task. Here I just provide a very simple comparison to highlight the difference.

Test

The test procedure is pretty straightforward:
  1. Create a table with 100K row, and each row has a unique ID (this is not part of test but the data preparation).
  2. Expend each record to 100 rows (totally 10m), and create the 2nd column (called seq)
  3. Calculate the average value of seq for each ID.
The 3rd operation is a typical aggregation to a grouped data. The reason I create the 2nd test step is that, quite often, the output of a individual prediction is not only a value or a flag, but also series of statistics, such as scores, p-values, confidence intervals, and even predicted time series. In other words, the output is another data frame, and the program should be able to concatenate them into a large data set efficiently.

Here is the testing environment:
  • data.tables 1.9.6 with R 3.2.2 (64 bits)
  • pandas 0.16.2 with Python 2.7.10 |Anaconda 2.3.0 (64-bit)

Code and Result

R/data.tables:
> a = data.table(ID = 1:100e3)
> 
> system.time({ b = a[, .(Seq = ID*100+1:100), keyby = ID] })
   user  system elapsed 
   0.28    0.08    0.36 
> system.time({ b = a[, .(Seq = ID*100+1:100), keyby = ID] })
   user  system elapsed 
   0.27    0.03    0.29 
> system.time({ b = a[, .(Seq = ID*100+1:100), keyby = ID] })
   user  system elapsed 
   0.20    0.01    0.21 
> 
> system.time({ c = b[, .(m = mean(Seq)), keyby = ID] })
   user  system elapsed 
   0.10    0.02    0.11 
> system.time({ c = b[, .(m = mean(Seq)), keyby = ID] })
   user  system elapsed 
   0.09    0.00    0.09 
> system.time({ c = b[, .(m = mean(Seq)), keyby = ID] })
   user  system elapsed 
   0.10    0.00    0.09 

Python/pandas:

import pandas as pd
import numpy as np
import math

def f(d, k):

    return(pd.DataFrame({'Seq': np.arange(k)}))

a = pd.DataFrame({'ID': np.arange(100e3)})
b = a.groupby(a['ID']).apply(f, 100).reset_index()
c = b.groupby(b['ID'])[['Seq']].mean()



In [25]: %timeit b = a.groupby(a['ID']).apply(f, 100).reset_index()
...: %timeit c = b.groupby(b['ID'])[['Seq']].mean()
1 loops, best of 3: 26.7 s per loop

1 loops, best of 3: 421 ms per loop


Clearly, data.tables is more efficient.

One interesting thing is that when I do the 3rd test step (c = b.groupby(b['ID'])[['Seq']].mean()), I use ID as a pandas Series column instead of as an index object. This is because, I found that as the number of operating rows increases, using columns/Series is more efficient than using index. Here is the same test using index:

In [28]: b = a.groupby(a['ID']).apply(f, 100)
In [29]: %timeit c = b.groupby(b.index).mean()
1 loops, best of 3: 29 s per loop



Other Notes

If coding the 2nd testing step in python, we can actually output a 1D series, and do the transpose (stack). This is more efficient than the code above:

def f2(d, k):
    return pd.Series(np.arange(k))
In [32]: %timeit b = a.groupby(a['ID']).apply(f2, 100).stack()
1 loops, best of 3: 11 s per loop


However, we can do this only if the output for each individual record can be written in a 1D array object during the aggregation. If we have to use data frame to store the output for each split of data, the efficiency would be significantly impacted.