Block-processing a data frame with plyr
David Smith at REvolutions blog shows how to split the data frame by the values of a variable, and how to perform some kind of operation on each segment using isplit function in iterators package in combination with foreach package. The example below creates three pdf-files in the working directory.
As it happens I was doing something similar when reading David’s post, so I present another alternative to accomplish the same task using plyr. Below are both the isplit and plyr versions for easy comparison – as you can see, the syntax is very similar. When dealing with large datasets I expect isplit to be faster on computers with multiple processors as paired with foreach it makes use of parallel computing capabilities of the latter.
Load data
> site.data <- structure(list(site = structure(c(1L, 1L, 1L, 1L,
+ 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
+ 3L), .Label = c("ALBEN", "ALDER", "AMERI"), class = "factor"),
+ year = c(5L, 10L, 20L, 50L, 100L, 200L, 500L, 5L, 10L, 20L,
+ 50L, 100L, 200L, 500L, 5L, 10L, 20L, 50L, 100L, 200L),
+ peak = c(101529.6, 117483.4, 132960.9, 153251.2, 168647.8,
+ 184153.6, 204866.5, 6561.3, 7897.1, 9208.1, 10949.3,
+ 12287.6, 13650.2, 15493.6, 43656.5, 51475.3, 58854.4,
+ 68233.3, 75135.9, 81908.3)), .Names = c("site", "year",
+ "peak"), class = "data.frame", row.names = c(NA, -20L))
|
iterators & foreach
> require(foreach)
> sites <- isplit(site.data, site.data$site)
> foreach(site = sites) %dopar% {
+ pdf(paste(site$key[[1]], ".pdf", sep = ""))
+ plot(site$value$year, site$value$peak, main = site$key[[1]])
+ dev.off()
+ }
|
plyr
> require(plyr)
> pr <- function(df) {
+ pdf(paste(df$site[1], ".pdf", sep = ""))
+ plot(df$year, df$peak, main = df$site[1])
+ dev.off()
+ }
> d_ply(site.data, .(site), pr)
|
Or if one wanted all the plots in one pdf-file instead of separate files:
> pdf("sites.pdf")
> d_ply(site.data, .(site), function(df) {
+ plot(df$year, df$peak, main = df$site[1])
+ })
> dev.off()
|
Load data
> site.data <- structure(list(site = structure(c(1L, 1L, 1L, 1L,
+ 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
+ 3L), .Label = c("ALBEN", "ALDER", "AMERI"), class = "factor"),
+ year = c(5L, 10L, 20L, 50L, 100L, 200L, 500L, 5L, 10L, 20L,
+ 50L, 100L, 200L, 500L, 5L, 10L, 20L, 50L, 100L, 200L),
+ peak = c(101529.6, 117483.4, 132960.9, 153251.2, 168647.8,
+ 184153.6, 204866.5, 6561.3, 7897.1, 9208.1, 10949.3,
+ 12287.6, 13650.2, 15493.6, 43656.5, 51475.3, 58854.4,
+ 68233.3, 75135.9, 81908.3)), .Names = c("site", "year",
+ "peak"), class = "data.frame", row.names = c(NA, -20L))
|
iterators & foreach
> require(foreach)
> sites <- isplit(site.data, site.data$site)
> foreach(site = sites) %dopar% {
+ pdf(paste(site$key[[1]], ".pdf", sep = ""))
+ plot(site$value$year, site$value$peak, main = site$key[[1]])
+ dev.off()
+ }
|
plyr
> require(plyr)
> pr <- function(df) {
+ pdf(paste(df$site[1], ".pdf", sep = ""))
+ plot(df$year, df$peak, main = df$site[1])
+ dev.off()
+ }
> d_ply(site.data, .(site), pr)
|
Or if one wanted all the plots in one pdf-file instead of separate files:
> pdf("sites.pdf")
> d_ply(site.data, .(site), function(df) {
+ plot(df$year, df$peak, main = df$site[1])
+ })
> dev.off()
|
I’m planning to use iterators internally in a future release of plyr, so hopefully you’ll be able to have the best of both worlds.
Nice use of dput! http://bit.ly/2iAZbS
Nice post I found it useful, thank you.
One thing I have notice with plyr is that because you are using a function to interate over a dataframe, if you want to return results to a variable outside the loop you need to use the “<<-" variable assignment.
I wonder, can we somehow recreate example “Or if one wanted all the plots in one pdf-file instead of separate files:” with isplit/foreach, generating ggplot2 layers inside foreach? don’t sure about global/local scope here
f.e.
p = ggplot()
foreach(i=iiter) {
p <- p+geom_line()
}