Skip to content

Block-processing a data frame with plyr

August 12, 2009
tags:

David Smith at REvolutions blog shows how to split the data frame by the values of a variable, and how to perform some kind of operation on each segment using isplit function in iterators package in combination with foreach package. The example below creates three pdf-files in the working directory.

As it happens I was doing something similar when reading David’s post, so I present another alternative to accomplish the same task using plyr. Below are both the isplit and plyr versions for easy comparison – as you can see, the syntax is very similar. When dealing with large datasets I expect isplit to be faster on computers with multiple processors as paired with foreach it makes use of parallel computing capabilities of the latter.


Load data

> site.data <- structure(list(site = structure(c(1L, 1L, 1L, 1L,
+     1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
+     3L), .Label = c("ALBEN", "ALDER", "AMERI"), class = "factor"),
+     year = c(5L, 10L, 20L, 50L, 100L, 200L, 500L, 5L, 10L, 20L,
+         50L, 100L, 200L, 500L, 5L, 10L, 20L, 50L, 100L, 200L),
+     peak = c(101529.6, 117483.4, 132960.9, 153251.2, 168647.8,
+         184153.6, 204866.5, 6561.3, 7897.1, 9208.1, 10949.3,
+         12287.6, 13650.2, 15493.6, 43656.5, 51475.3, 58854.4,
+         68233.3, 75135.9, 81908.3)), .Names = c("site", "year",
+     "peak"), class = "data.frame", row.names = c(NA, -20L))

iterators & foreach

> require(foreach)
> sites <- isplit(site.data, site.data$site)
> foreach(site = sites) %dopar% {
+     pdf(paste(site$key[[1]], ".pdf", sep = ""))
+     plot(site$value$year, site$value$peak, main = site$key[[1]])
+     dev.off()
+ }

plyr

> require(plyr)
> pr <- function(df) {
+     pdf(paste(df$site[1], ".pdf", sep = ""))
+     plot(df$year, df$peak, main = df$site[1])
+     dev.off()
+ }
> d_ply(site.data, .(site), pr)

Or if one wanted all the plots in one pdf-file instead of separate files:

> pdf("sites.pdf")
> d_ply(site.data, .(site), function(df) {
+     plot(df$year, df$peak, main = df$site[1])
+ })
> dev.off()


Load data

> site.data <- structure(list(site = structure(c(1L, 1L, 1L, 1L,
+     1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
+     3L), .Label = c("ALBEN", "ALDER", "AMERI"), class = "factor"),
+     year = c(5L, 10L, 20L, 50L, 100L, 200L, 500L, 5L, 10L, 20L,
+         50L, 100L, 200L, 500L, 5L, 10L, 20L, 50L, 100L, 200L),
+     peak = c(101529.6, 117483.4, 132960.9, 153251.2, 168647.8,
+         184153.6, 204866.5, 6561.3, 7897.1, 9208.1, 10949.3,
+         12287.6, 13650.2, 15493.6, 43656.5, 51475.3, 58854.4,
+         68233.3, 75135.9, 81908.3)), .Names = c("site", "year",
+     "peak"), class = "data.frame", row.names = c(NA, -20L))

iterators & foreach

> require(foreach)
> sites <- isplit(site.data, site.data$site)
> foreach(site = sites) %dopar% {
+     pdf(paste(site$key[[1]], ".pdf", sep = ""))
+     plot(site$value$year, site$value$peak, main = site$key[[1]])
+     dev.off()
+ }

plyr

> require(plyr)
> pr <- function(df) {
+     pdf(paste(df$site[1], ".pdf", sep = ""))
+     plot(df$year, df$peak, main = df$site[1])
+     dev.off()
+ }
> d_ply(site.data, .(site), pr)

Or if one wanted all the plots in one pdf-file instead of separate files:

> pdf("sites.pdf")
> d_ply(site.data, .(site), function(df) {
+     plot(df$year, df$peak, main = df$site[1])
+ })
> dev.off()
About these ads
6 Comments leave one →
  1. Hadley Wickham permalink
    August 12, 2009 5:49 pm

    I’m planning to use iterators internally in a future release of plyr, so hopefully you’ll be able to have the best of both worlds.

  2. dggoldst permalink
    August 12, 2009 6:09 pm

    Nice use of dput! http://bit.ly/2iAZbS

  3. August 17, 2009 7:36 pm

    Nice post I found it useful, thank you.

    One thing I have notice with plyr is that because you are using a function to interate over a dataframe, if you want to return results to a variable outside the loop you need to use the “<<-" variable assignment.

  4. Ilya permalink
    October 19, 2009 4:39 am

    I wonder, can we somehow recreate example “Or if one wanted all the plots in one pdf-file instead of separate files:” with isplit/foreach, generating ggplot2 layers inside foreach? don’t sure about global/local scope here
    f.e.
    p = ggplot()

    foreach(i=iiter) {

    p <- p+geom_line()
    }

Trackbacks

  1. Interesting Articles for August 13th
  2. links for 2009-08-18 | dekay.org

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 171 other followers

%d bloggers like this: