Data Manipulation in R to Create Football League Table
Introduction
Every once in a while I like to keep myself up-to-date with the latest developments in the Barclays Premier League football. It is possible to see the latest table on many websites, but very often only the historical scores are available and not the standings on that date.
football-data.co.uk provides a lot of historical match data which after some manipulation can be converted to a league table. The datafiles along with the field descriptions can be found here.
I decided to build my own application in R so that I would be able to view how the standings table looked like on any given date, as well as to visualise the relative performance of teams during the season.
Loading the data
After loading the raw match data into R, all the unnecessary information is excluded, some of the columns renamed, and the dates converted into date format. We are interested in the match results, so include only date, home team, away team, and the match result.
> library(plyr) > library(reshape) > df <- read.csv("E0.csv") > df <- df[, c(2, 3, 4, 7)] > df <- rename(df, c(Date = "date", HomeTeam = "H", AwayTeam = "A")) > df$date <- as.Date(df$date, format = "%d/%m/%y") |
The data structure now looks like this, one row per match includes all the required stats:
> head(df) date H A FTR 1 2008-08-16 Arsenal West Brom H 2 2008-08-16 Bolton Stoke H 3 2008-08-16 Everton Blackburn A 4 2008-08-16 Hull Fulham H 5 2008-08-16 Middlesbrough Tottenham H 6 2008-08-16 Sunderland Liverpool A |
Preparing the data
This setup is still not good for making summaries needed to compile the league table, so further manipulation is needed. We convert it to a long format and sort to ease later calculations.
> dfm <- melt(df, measure.vars = c("H", "A"), variable_name = "loc") > dfm <- rename(dfm, c(value = "team")) > dfm <- dfm[order(dfm$date, dfm$team), c("date", "team", "loc", "FTR")] |
The data structure has now changed, each row containing the match date, team name, whether it was a home or away game, and the winning team:
> head(dfm) date team loc FTR 1 2008-08-16 Arsenal H H 262 2008-08-16 Blackburn A A 2 2008-08-16 Bolton H H 3 2008-08-16 Everton H A 263 2008-08-16 Fulham A H 4 2008-08-16 Hull H H |
Point Calculations
Now that the data is properly structured, we need to calculate the points awarded to the teams for every match. Besides that, a cumulative point score is also added.
> dfm$pts <- as.integer(with(dfm, ifelse((loc == "H" & FTR == "H"), 3, ifelse((loc == "A" & FTR == "A"), 3, ifelse(FTR == "D", 1, 0))))) > dfm1 <- ddply(dfm, .(team), transform, cumpts = cumsum(pts)) |
Final Adjustments
Usually teams play two times a week, but there are exceptions to this rule. As league standings would need to be calculated after every matchday, then next we need to obtain the cumulative point score of each team after every matchday. Three steps are needed to accomplish this:
A) create a dataframe with all the possible match date and team combinations
> cdate <- unique(dfm1$date) > cteam <- unique(dfm1$team) > combs <- merge(cdate, cteam, all = TRUE) > combs <- rename(combs, c(x = "date", y = "team")) |
B) merge the dataframe combs with the other dataframe dfm1 containing the match results
> dfm2 <- merge(combs, dfm1, by = c("date", "team"), all.x = TRUE) > dfm2 <- dfm2[order(dfm2$team, dfm2$date), ] |
C) for days when the team did not play use the cumulative score from their previous matchday. Package zoo provides a very convenient function na.locf() to do this. Also add the team ranking using the rank function. The beauty of this is the proper handling of ties.
> require(zoo) > dfm3 <- ddply(dfm2, .(team), transform, cumpts1 = na.locf(cumpts, na.rm = FALSE)) > dfm3 <- ddply(dfm3, .(date), transform, position = rank(-cumpts1, ties.method = "min")) |
Leaguetable Function
The end goal was the ability to see the standing at my chosen date, so a function where a date could be specified was needed. Again, the task was split into first creating two dataframes and then merging these to get the final leaguetable output.
> leaguetable <- function(df, fdate) { dfm4a <- cast(dfm3, team ~ ., c(sum, length), subset = (!is.na(FTR) & date <= fdate), value = "pts") dfm4a <- rename(dfm4a, c(sum = "P", length = "G")) dfm4b <- cast(dfm3, team ~ pts, length, subset = (!is.na(FTR) & date <= fdate), value = "position") dfm4b <- rename(dfm4b, c(`0` = "L", `1` = "D", `3` = "W")) dfm4b <- dfm4b[, c(1, 4, 3, 2)] dfm4 <- merge(dfm4a, dfm4b) dfm4$rank <- rank(-dfm4$P, ties.method = "min") dfm4[order(dfm4$rank), c(7, 1:6)] } |
Leaguetable
Finally it is time to find out who was in lead at the end of 2008…
> leaguetable(dfm3, "2008-12-31") rank team P G W D L 9 1 Liverpool 45 20 13 6 1 5 2 Chelsea 42 20 12 6 2 2 3 Aston Villa 38 20 11 5 4 11 3 Man United 38 18 11 5 2 1 5 Arsenal 35 20 10 5 5 6 6 Everton 32 20 9 5 6 20 7 Wigan 28 19 8 4 7 8 8 Hull 27 20 7 6 7 7 9 Fulham 26 19 6 8 5 19 10 West Ham 25 20 7 4 9 4 11 Bolton 23 20 7 2 11 14 11 Portsmouth 23 20 6 5 9 10 13 Man City 22 20 6 4 10 13 13 Newcastle 22 20 5 7 8 16 13 Sunderland 22 20 6 4 10 12 16 Middlesbrough 20 20 5 5 10 15 16 Stoke 20 20 5 5 10 17 16 Tottenham 20 20 5 5 10 3 19 Blackburn 18 20 4 6 10 18 19 West Brom 18 20 5 3 12 |
Job accomplished. In the next post I will try to visualise the performance of the teams during the season.
Hi really interesting! I tried to implement your Leaguetable adding GoalForHome, GoalForAway, GoalAgainst….but I can’t really sort it out how to include them to your ‘leaguetable function’. Could you help me?
Hi, This code is really helpful but I am not able to understand why my execution of leaguetable function gives an error that the length is null.
I am taking data for 2017 league.
My function execution command is: leaguetable(dfm3, “2008-12-31”)
This is the error: Error in if (lengths == 1) return(data) : argument is of length zero
Sorry.
My function execution command is: leaguetable(dfm3, “2018-12-31”)