Xiaowei Ying's Data World

Jul 12, 2015

A summary of dplyr and data.table

In R, Both dplyr package and data.table package are powerful tools in manipulating data. dplyr is an enhanced version of plyr package, and data.table is becoming more and more popular in handling the large datasets. Several comparisons can be found online, all indicate that, when there are a small number of groups in the data, dplyr and data.table achieve comparable efficiency; when there are a large number of groups, data.table outperforms dplyr functions.

Here, I would like to give a quick summary of the commonly used functions for the two packages. My personal feeling is that I like dplyr better, because it's interface is more consistent and provide some convenience functions to manipulate columns to which I want to apply functions (I will talk about this in details later).

	dplyr	data.tables
Select columns
Select columns by name	data = select(mtcars, mpg, cyl, disp, wt) var_name = c('mpg', 'cyl', 'disp', 'wt') data = select_ (mtcars, .dots = var_name) data = select(mtcars, starts_with("m")) data = select(mtcars, ends_with("t")) data = select(mtcars, contains("s")) # select columns starts with, ends with, or contains some keywords. This is the part I like dplyr slightly better. You can do such things in data.tables as well, but need a little more coding effort.	mtcars.dt = data.table(mtcars) var_name = c('mpg', 'cyl', 'disp', 'wt') data.dt = mtcars.dt[, .(mpg, cyl, disp, wt)] data.dt = mtcars.dt[, var_name, with = F] # with=F would allow you to do normal data.frame operations in data.tables.
Select columns by index	data = mtcars[, 2:5]	data.dt = mtcars.dt[, 2:5, with = F]
Derived columns
	data = mutate(mtcars, x = mpg/cyl, mpg_new = x*cyl, mpg_mean = mean(mpg) ) # newly added one can be used in later calculation	data.dt = mtcars.dt[, x := mpg/cyl] # add a new column called x data.dt = mtcars.dt[, `:=`(x = mpg/cyl, mpg_mean = mean(mpg))][, mpg_new := x*cyl] # add two new columns, and use chain operator add the third column
Select rows
Select rows by condition Comments: select rows by condition in data.tables sometimes may not be straightforward	data = filter(mtcars, cyl==4, mpg>=30) # and relation data = filter(mtcars, cyl==4 \| mpg>=30) # or relation data = filter(mtcars, cyl %in% c(4, 6))	data.dt = mtcars.dt[cyl==4 & mpg>=30] data.dt = mtcars.dt[cyl==4 & vs==1] # you don't have to put a commar for the second dimension setkey(mtcars.dt, cyl, vs) data.dt = mtcars.dt[list(4, 1)] data.dt = mtcars.dt[.(4, 1)] # yo have to use list to select rows with cyl==4 and vs==1; .() is the same as list() data.dt = mtcars.dt[cyl %in% c(4,6)] data.dt2 = mtcars.dt[J(c(4,6))] # the same as above, cyl is either 4 or 6, use must use J() or .(). without that it will select the 4^th and 6^th row (see below) data.dt = mtcars.dt[cyl %in% c(4,6) & vs==1] # cyl is 4 or 6, and vs==1 data.dt2 = mtcars.dt[.(c(6,4), 1)] # the same as above, .() is the same as J()
Select rows by index	# straightforward	data.dt = mtcars.dt[c(6,4)] # the sixth and 4th row, NOT cyl== 4 or 6
Aggregation functions
Aggregate n values to 1 value	myfun = function(x,y) return(mean(x/y)) data = mtcars %>% group_by(vs) %>% summarise( avg_mpg = mean(mpg), avg_mpg_per_cyl = mean(mpg/cyl), avg_mpg_per_cyl2 = myfun(mpg, cyl), row_cnt = n() )	myfun = function(x,y) return(mean(x/y)) data.dt = mtcars.dt[, list( avg_mpg = mean(mpg), avg_mpg_per_cyl = mean(mpg/cyl), avg_mpg_per_cyl2 = myfun(mpg, cyl), row_cnt = .N), by = "vs"] # you have to use list() to wrap all the aggregation functions.
Window function (n values to n values)	data = mtcars %>% group_by(cyl) %>% mutate( lag_mpg = lag(mpg), mpg_rnk = min_rank(desc(mpg)) ) %>% arrange(cyl, mpg) var_name = c('mpg', 'cyl', 'disp', 'wt') data = mtcars %>% group_by(cyl) %>% mutate_each_(funs(lag), var_name) %>% arrange(cyl) # apply 1 functions to multiple columns. this will replace the original column by the lag value. data = mtcars %>% group_by(cyl) %>% mutate_each_(funs(lag, mean), var_name) %>%arrange(cyl) # apply 2 functions to multiple columns. this will CREATE additional columns for lag and mean, the origial column is also kept	mtcars.dt = data.table(mtcars) setkey(mtcars.dt, cyl) # you don't have to setkey before using 'by=' mtcars.dt = mtcars.dt[, mpg_lag := c(NA, mpg[-.N]), by = cyl] # data.table way to write lag function mtcars.dt = data.table(mtcars) lag_var_name = paste(var_name, "_lag", sep="") mtcars.dt = mtcars.dt[, (lag_var_name) := lapply(.SD, function(x){c(NA, x[-.N])}), by = cyl, .SDcols = var_name] # you must use (lag_var_name) instead of lag_var_name. if you use lag_var_name := xxxx, it will create one column called "lag_var_name"
More general (n to m values)	# I don't think you have a direct way to do that. Use “mutate” or “summarise” function, the output has to have n or one variables respectively. You can write a function that returns one list, and dplyr will create a column, with each cell contains a list. half_list = function(l, na.last = TRUE, ties.method = "first"){ r = rank(l, na.last =na.last, ties.method = ties.method) return(l[r<=length(r)/2]) } data = mtcars %>% group_by(cyl) %>% do(min.50pct.mpg = half_list(.$mpg)) %>% mutate(list.len = length(min.50pct.mpg))	half_list = function(l, na.last = TRUE, ties.method = "first"){ r = rank(l, na.last =na.last, ties.method = ties.method) return(l[r<=length(r)/2]) } data.dt = mtcars.dt[, .(min.50pct.mpg = half_list(mpg), max.50pct.mpg = -half_list(-mpg)), by=cyl ] # data.table is more "smart", it will merge the newly created list to the by-variables of the original data table. In the case that multiple lists are created, they must have the same length. Otherwise, it will repeat the shorter lists.

Jun 17, 2015

group_by() and grouped_df

package dplyr is really a must tool in manipulating data. It provide lots of functions close to traditional SQL. If you are a SQL person, you may like this package very much after getting familiar with these functions such as inner_join, left_join, semi_join, and anti_join.

group_by is another commonly used function in dplyr. Combining group_by with n_distinct(), n(), min(), max(), median(), would achieve the those group by functions in SQL.

One point worth to mention is that group_by() return a data type that is essentially a data frame (or data table, depending on your input) but behaves slightly different when combining with other dplyr functions. They are most likely to be grouped_df or grouped_dt. The difference is that when you apply a dplyr function to the grouped_df (or grouped_dt), the function applies to each group, i.e., within the group. For example, apply arrange() to a grouped_df would sort the rows within each group, instead of over the whole dataset. If you want to sort the whole dataset, you need to convert the grouped_df to a regular data frame, using ungroup().

You can find a good intro about dplyr here.

Jul 1, 2014

Aggregate() in R: what if a applied function returns a list?

In R, one of keys in improving the efficiency in data manipulation is to avoid for loops. When we need to apply the same function to all the lists in a data frame, functions like lapply, by, and aggregate are very useful to eliminate for loops. Here is a good summary on these different 'apply-like' functions.

Aggregate function calculates the user-specified summary statistics to each list in the data frame by segmentation. Here is a simple example:

Dat = data.frame(G = c(rep('F',4),rep('M',6)),X1 = 10:1, X2 = 1:10)
Dat.mean = aggregate(.~G, Dat, mean)

> Dat.mean
G X1 X2
1 F 8.5 2.5
2 M 3.5 7.5

It gives the mean of X1 and X2 by G variable, and the output is a data frame. The applied function (mean) in this case return a scale value for each group of X1 and X2 respectively. This is much like the aggregation function in SQL:

select G, avg(X1) as X1_mean, avg(X2) as X2_mean
from Dat group by G

This would work well as long as the function passed to aggregate returns a scale value. What if the applied function return a list? For example, we have a very simple moving average function:

ma = function(x, n) list(as.numeric(filter(x, rep(1/n,n),sides=1)))

Dat.ma = aggregate(.~G, Dat, ma, 3)

Let's see the output data.

> class(Dat.ma)
[1] "data.frame"
> dim(Dat.ma)
[1] 2 3

The output is still a 2*3 data frame. Our moving average time series are actually, as lists, sit in entries of the output data frame:

> Dat.ma[1,]
G X1 X2
1 F NA, NA, 9, 8 NA, NA, 2, 3
> Dat.ma[[1,2]]
[1] NA NA 9 8
> class(Dat.ma[1,2])
[1] "list"
> class(Dat.ma[[1,2]])
[1] "numeric"
> length(Dat.ma[[1,2]])

[1] 4

However, the nested data frame sometimes is not convenient for further data manipulation. Can we organize to the format close to the original data frame (not nested)? Our do.call and by function can help us!

Let's take a look the following function that organizes the output data frame (agg.out) from aggregation function:

agg.output.df = function(agg.out){
df = function(x){
r = do.call(data.frame,x)
colnames(r) = colnames(x)
r
}

r = by(agg.out, 1:nrow(agg.out), df)
r = do.call(rbind, r)
}

do.call function takes one row in agg.out, and organize it into a data frame. by function applies this to each row and generates a list of data frames, and do.call (rbind, r) then combines all the data frames into one data frame. Here is the output:

> Dat.ma.df = agg.output.df(Dat.ma)
> Dat.ma.df
G X1 X2
1.1 F NA NA
1.2 F NA NA
1.3 F 9 2
1.4 F 8 3
2.1 M NA NA
2.2 M NA NA
2.3 M 5 6
2.4 M 4 7
2.5 M 3 8
2.6 M 2 9
> dim(Dat.ma.df)
[1] 10 3

Combining those 'apply-like' functions together with do.call function can makes our data manipulation job easier and more efficient.

Pages