Skip to content

Data Manipulation in R to Create Football League Table

March 19, 2009

Introduction

Every once in a while I like to keep myself up-to-date with the latest developments in the Barclays Premier League football. It is possible to see the latest table on many websites, but very often only the historical scores are available and not the standings on that date.

football-data.co.uk provides a lot of historical match data which after some manipulation can be converted to a league table. The datafiles along with the field descriptions can be found here.

I decided to build my own application in R so that I would be able to view how the standings table looked like on any given date, as well as to visualise the relative performance of teams during the season.


Loading the data

After loading the raw match data into R, all the unnecessary information is excluded, some of the columns renamed, and the dates converted into date format. We are interested in the match results, so include only date, home team, away team, and the match result.

> library(plyr)
> library(reshape)
> df <- read.csv("E0.csv")
> df <- df[, c(2, 3, 4, 7)]
> df <- rename(df, c(Date = "date", HomeTeam = "H",
     AwayTeam = "A"))
> df$date <- as.Date(df$date, format = "%d/%m/%y")

The data structure now looks like this, one row per match includes all the required stats:

> head(df)
        date             H         A FTR
1 2008-08-16       Arsenal West Brom   H
2 2008-08-16        Bolton     Stoke   H
3 2008-08-16       Everton Blackburn   A
4 2008-08-16          Hull    Fulham   H
5 2008-08-16 Middlesbrough Tottenham   H
6 2008-08-16    Sunderland Liverpool   A

Preparing the data

This setup is still not good for making summaries needed to compile the league table, so further manipulation is needed. We convert it to a long format and sort to ease later calculations.

> dfm <- melt(df, measure.vars = c("H", "A"),
     variable_name = "loc")
> dfm <- rename(dfm, c(value = "team"))
> dfm <- dfm[order(dfm$date, dfm$team), c("date",
     "team", "loc", "FTR")]

The data structure has now changed, each row containing the match date, team name, whether it was a home or away game, and the winning team:

> head(dfm)
          date      team loc FTR
1   2008-08-16   Arsenal   H   H
262 2008-08-16 Blackburn   A   A
2   2008-08-16    Bolton   H   H
3   2008-08-16   Everton   H   A
263 2008-08-16    Fulham   A   H
4   2008-08-16      Hull   H   H

Point Calculations

Now that the data is properly structured, we need to calculate the points awarded to the teams for every match. Besides that, a cumulative point score is also added.

> dfm$pts <- as.integer(with(dfm, ifelse((loc ==
     "H" & FTR == "H"), 3, ifelse((loc == "A" &
     FTR == "A"), 3, ifelse(FTR == "D", 1, 0)))))
> dfm1 <- ddply(dfm, .(team), transform, cumpts = cumsum(pts))

Final Adjustments

Usually teams play two times a week, but there are exceptions to this rule. As league standings would need to be calculated after every matchday, then next we need to obtain the cumulative point score of each team after every matchday. Three steps are needed to accomplish this:

A) create a dataframe with all the possible match date and team combinations

> cdate <- unique(dfm1$date)
> cteam <- unique(dfm1$team)
> combs <- merge(cdate, cteam, all = TRUE)
> combs <- rename(combs, c(x = "date", y = "team"))

B) merge the dataframe combs with the other dataframe dfm1 containing the match results

> dfm2 <- merge(combs, dfm1, by = c("date", "team"),
     all.x = TRUE)
> dfm2 <- dfm2[order(dfm2$team, dfm2$date), ]

C) for days when the team did not play use the cumulative score from their previous matchday. Package zoo provides a very convenient function na.locf() to do this. Also add the team ranking using the rank function. The beauty of this is the proper handling of ties.

> require(zoo)
> dfm3 <- ddply(dfm2, .(team), transform, cumpts1 = na.locf(cumpts,
     na.rm = FALSE))
> dfm3 <- ddply(dfm3, .(date), transform, position = rank(-cumpts1,
     ties.method = "min"))

Leaguetable Function

The end goal was the ability to see the standing at my chosen date, so a function where a date could be specified was needed. Again, the task was split into first creating two dataframes and then merging these to get the final leaguetable output.

> leaguetable <- function(df, fdate) {
     dfm4a <- cast(dfm3, team ~ ., c(sum, length),
         subset = (!is.na(FTR) & date <= fdate),
         value = "pts")
     dfm4a <- rename(dfm4a, c(sum = "P", length = "G"))
     dfm4b <- cast(dfm3, team ~ pts, length,
         subset = (!is.na(FTR) & date <= fdate),
         value = "position")
     dfm4b <- rename(dfm4b, c(`0` = "L", `1` = "D",
         `3` = "W"))
     dfm4b <- dfm4b[, c(1, 4, 3, 2)]
     dfm4 <- merge(dfm4a, dfm4b)
     dfm4$rank <- rank(-dfm4$P, ties.method = "min")
     dfm4[order(dfm4$rank), c(7, 1:6)]
 }

Leaguetable

Finally it is time to find out who was in lead at the end of 2008…

> leaguetable(dfm3, "2008-12-31")
   rank          team  P  G  W D  L
9     1     Liverpool 45 20 13 6  1
5     2       Chelsea 42 20 12 6  2
2     3   Aston Villa 38 20 11 5  4
11    3    Man United 38 18 11 5  2
1     5       Arsenal 35 20 10 5  5
6     6       Everton 32 20  9 5  6
20    7         Wigan 28 19  8 4  7
8     8          Hull 27 20  7 6  7
7     9        Fulham 26 19  6 8  5
19   10      West Ham 25 20  7 4  9
4    11        Bolton 23 20  7 2 11
14   11    Portsmouth 23 20  6 5  9
10   13      Man City 22 20  6 4 10
13   13     Newcastle 22 20  5 7  8
16   13    Sunderland 22 20  6 4 10
12   16 Middlesbrough 20 20  5 5 10
15   16         Stoke 20 20  5 5 10
17   16     Tottenham 20 20  5 5 10
3    19     Blackburn 18 20  4 6 10
18   19     West Brom 18 20  5 3 12

Job accomplished. In the next post I will try to visualise the performance of the teams during the season.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: