Jul 1, 2014

Aggregate() in R: what if a applied function returns a list?

In R, one of keys in improving the efficiency in data manipulation is to avoid for loops. When we need to apply the same function to all the lists in a data frame, functions like lapply, by, and aggregate are very useful to eliminate for loops. Here is a good summary on these different 'apply-like' functions.

Aggregate function calculates the user-specified summary statistics to each list in the data frame by segmentation. Here is a simple example:

Dat = data.frame(G = c(rep('F',4),rep('M',6)),X1 = 10:1, X2 = 1:10)
Dat.mean = aggregate(.~G, Dat, mean)

> Dat.mean
  G  X1  X2
1 F 8.5 2.5
2 M 3.5 7.5 

It gives the mean of X1 and X2 by G variable, and the output is a data frame. The applied function (mean) in this case return a scale value for each group of X1 and X2 respectively. This is much like the aggregation function in SQL:

select G, avg(X1) as X1_mean, avg(X2) as X2_mean
from Dat group by G

This would work well as long as the function passed to aggregate returns a scale value. What if the applied function return a list? For example, we have a very simple moving average function:

ma = function(x, n) list(as.numeric(filter(x, rep(1/n,n),sides=1)))

Dat.ma = aggregate(.~G, Dat, ma, 3)

Let's see the output data.

> class(Dat.ma)
[1] "data.frame"
> dim(Dat.ma)
[1] 2 3

The output is still a 2*3 data frame. Our moving average time series are actually, as lists, sit in entries of the output data frame:

> Dat.ma[1,]
  G           X1           X2
1 F NA, NA, 9, 8 NA, NA, 2, 3
> Dat.ma[[1,2]]
[1] NA NA  9  8
> class(Dat.ma[1,2])
[1] "list"
> class(Dat.ma[[1,2]])
[1] "numeric" 
> length(Dat.ma[[1,2]])

[1] 4

However, the nested data frame sometimes is not convenient for further data manipulation. Can we organize to the format close to the original data frame (not nested)? Our do.call and by function can help us!

Let's take a look the following function that organizes the output data frame (agg.out) from aggregation function:

agg.output.df = function(agg.out){
  df = function(x){
    r = do.call(data.frame,x)
    colnames(r) = colnames(x)
    r
  }
  
  r = by(agg.out, 1:nrow(agg.out), df)
  r = do.call(rbind, r)
}

do.call function takes one row in agg.out, and organize it into a data frame. by function applies this to each row and generates a list of data frames, and do.call (rbind, r)  then combines all the data frames into one data frame. Here is the output:

> Dat.ma.df = agg.output.df(Dat.ma)
> Dat.ma.df
    G X1 X2
1.1 F NA NA
1.2 F NA NA
1.3 F  9  2
1.4 F  8  3
2.1 M NA NA
2.2 M NA NA
2.3 M  5  6
2.4 M  4  7
2.5 M  3  8
2.6 M  2  9
> dim(Dat.ma.df)
[1] 10  3

Combining those 'apply-like' functions together with do.call function can makes our data manipulation job easier and more efficient.

No comments:

Post a Comment