Motivation
Both R and python are popular tools for data scientist and statisticians. The well-known package data.tables in R and pandas in Python make the data manipulation task much easier and more efficient. Both of them are nowadays well adopted by more and more statistics and data science tools and packages. Most of popular machine learning algorithms are available in both R version and python version.There is really no way to compare these two tools because the choices that we make really depend our own projects, coding environment, coding habits, tools availability, and even personal preference.
This comparison is just for my own curiosity.
Particularly, I am more interested in their aggregation performance. This is because I often have to run some model to provide predictions for each individual, and the prediction is typically based on the data of each individual. So group and aggregation would be typical operations for me:
- Split the big data into multiple pieces each of which presents an individual
- Run my own model against each small piece of data
- Put the prediction for each individual together in a large data set.
Test
The test procedure is pretty straightforward:- Create a table with 100K row, and each row has a unique ID (this is not part of test but the data preparation).
- Expend each record to 100 rows (totally 10m), and create the 2nd column (called seq)
- Calculate the average value of seq for each ID.
Here is the testing environment:
- data.tables 1.9.6 with R 3.2.2 (64 bits)
- pandas 0.16.2 with Python 2.7.10 |Anaconda 2.3.0 (64-bit)
Code and Result
R/data.tables:> a = data.table(ID = 1:100e3) > > system.time({ b = a[, .(Seq = ID*100+1:100), keyby = ID] }) user system elapsed 0.28 0.08 0.36 > system.time({ b = a[, .(Seq = ID*100+1:100), keyby = ID] }) user system elapsed 0.27 0.03 0.29 > system.time({ b = a[, .(Seq = ID*100+1:100), keyby = ID] }) user system elapsed 0.20 0.01 0.21 > > system.time({ c = b[, .(m = mean(Seq)), keyby = ID] }) user system elapsed 0.10 0.02 0.11 > system.time({ c = b[, .(m = mean(Seq)), keyby = ID] }) user system elapsed 0.09 0.00 0.09 > system.time({ c = b[, .(m = mean(Seq)), keyby = ID] }) user system elapsed 0.10 0.00 0.09
Python/pandas:
import pandas as pd
import numpy as np
import math
def f(d, k):
return(pd.DataFrame({'Seq': np.arange(k)}))
a = pd.DataFrame({'ID': np.arange(100e3)})
b = a.groupby(a['ID']).apply(f, 100).reset_index()
c = b.groupby(b['ID'])[['Seq']].mean()
In [25]: %timeit b = a.groupby(a['ID']).apply(f, 100).reset_index()
...: %timeit c = b.groupby(b['ID'])[['Seq']].mean()
1 loops, best of 3: 26.7 s per loop
1 loops, best of 3: 421 ms per loop
Clearly, data.tables is more efficient.
One interesting thing is that when I do the 3rd test step (c = b.groupby(b['ID'])[['Seq']].mean()), I use ID as a pandas Series column instead of as an index object. This is because, I found that as the number of operating rows increases, using columns/Series is more efficient than using index. Here is the same test using index:
In [28]: b = a.groupby(a['ID']).apply(f, 100)
In [29]: %timeit c = b.groupby(b.index).mean()
1 loops, best of 3: 29 s per loop
Other Notes
If coding the 2nd testing step in python, we can actually output a 1D series, and do the transpose (stack). This is more efficient than the code above:
def f2(d, k):
return pd.Series(np.arange(k))
In [32]: %timeit b = a.groupby(a['ID']).apply(f2, 100).stack()
1 loops, best of 3: 11 s per loop
However, we can do this only if the output for each individual record can be written in a 1D array object during the aggregation. If we have to use data frame to store the output for each split of data, the efficiency would be significantly impacted.