Jun 17, 2015

group_by() and grouped_df

package dplyr is really a must tool in manipulating data. It provide lots of functions close to traditional SQL. If you are a SQL person, you may like this package very much after getting familiar with these functions such as inner_join, left_join, semi_join, and anti_join.

group_by is another commonly used function in dplyr. Combining group_by with n_distinct(), n(), min(), max(), median(), would achieve the those group by functions in SQL.

One point worth to mention is that group_by() return a data type that is essentially a data frame (or data table, depending on your input) but behaves slightly different when combining with other dplyr functions. They are most likely to be grouped_df or grouped_dt. The difference is that when you apply a dplyr function to the grouped_df (or grouped_dt), the function applies to each group, i.e., within the group. For example, apply arrange() to a grouped_df would sort the rows within each group, instead of over the whole dataset. If you want to sort the whole dataset, you need to convert the grouped_df to a regular data frame, using ungroup().

You can find a good intro about dplyr here.