Skip to content

Import a Text File with Repeating Titles

May 10, 2009

Kelly O’Day from the Charts & Graphs blog asked the following question:

I’ve been trying to figure out how to read this file: The repeating titles every 20 years has me stumped. Do you have a trick for reading this type of source data file into R?

This post shows how to import the text file, and quickly plot the monthly data from 1880-2009.

Data Import

> library(ggplot2)
> url <- c("")
> file <- c("GLB.Ts.txt")
> download.file(url, file)

The first 7 rows and the last 12 rows of the textfile contain instructions and additional information that need to be stripped out before the data becomes usable.

Find out the number of rows in a file, and exclude the last 12

> rows <- length(readLines(file)) - 12

Read the file in as a character vector, one line per row.

> lines <- readLines(file, n = rows)[8:rows]

Data Manipulation

Now we have an R vector with 143 lines. The instructions (that were just stripped out) note that missing data is marked by 4 asterisks “*“. As read.table throws an error when encountering the asterisks, these need to be substituted with NAs. It also appears that contrary to the instructions the number of asterisks used is not consistent, for example in year 1880 four and five asterisks are used to mark missing values. A regular expression that matches 3-5 consecutive *’s and replaces these with NA, takes care of this problem.

Use regexp to replace all the occurences of ** with NA

> lines2 <- gsub("\\*{3,5}", " NA", lines, perl = TRUE)

The next step is to convert the character vector to a dataframe. The colClasses are explicitly specified, otherwise all the variables would be converted to a factor, which we don’t want.

Convert the character vector to a dataframe.

> df <- read.table(textConnection(lines2), header = TRUE,
     colClasses = "character")
> closeAllConnections()

We are only interested in the montly data that is in the first 13 columns

> df <- df[, 1:13]

Next the numeric information will be converted from character to numeric format. In the process the repeating titles are coerced to NAs, making it easy to eliminate them in the next step.

Convert all variables (columns) to numeric format

> df <- colwise(as.numeric)(df)

Remove rows where Year=NA from the dataframe

> df <- df[!$Year), ]

Convert from wide format to long format

> dfm <- melt(df, id.var = "Year", variable_name = "Month")


> ggplot(dfm, aes(Year, value, group = Month)) +
     geom_line() + facet_grid(Month ~ .)
9 Comments leave one →
  1. May 11, 2009 5:36 pm


    Very good. Thanks, I needed that.

    Your script shows the power and flexibility of R.

    Kelly O’Day

  2. June 1, 2009 1:45 am

    Hello. Could someone please help:

    > require(‘ggplot2’)
    Loading required package: ggplot2
    Error in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
    ‘ggplot2’ is not a valid package — installed Matt

  3. June 1, 2009 1:46 am

    I have also tried the following command:

    > library(ggplot2)
    Error in library(ggplot2) :
    ‘ggplot2’ is not a valid package — installed < 2.0.0?

    • learnr permalink
      June 1, 2009 9:25 am

      It seems that there might be a problem with your R installation.
      Two things you could try:
      1) Reinstall R, make sure you are using the latest version.
      2) after you have reinstalled R, reinstall ggplot2 – install.packages("ggplot2")

      It should work then.

  4. Matt permalink
    June 16, 2009 8:32 pm

    Thank you.

  5. Jan permalink
    June 17, 2009 11:43 am

    Thanks for a great site! I am just starting to find my way around R.
    Like you, I used another program for all my data analysis (SAS) but got tired of the appalling graphing options so wanted to try R.
    Although not quite related to this post, I also “suffer” with datasets that are quirky.
    E.g. I have dates in three different formats. Two are ok, but the third causes me great headache. The date in question is reported as mmddyy (character string) but is riddled with errors due to a lack of logical testing of its validity when it is reported. I developed routines in SAS that did this fairly easily but haven’t found a way of doing this in R (or rather, my brain hasn’t gotten around to “think in R” yet).
    Documentation on R functions I find cryptic and not very helpful either.
    My problem is as follows: I want to test whether ddmmyy is a valid date, by testing whether yy is within a range that makes sense, mm is between 1 and 12, dd is between 1 and 31 and so on. If not, I want the value set to NA (not reported). After this is done, I want to check the combinations of data and another variable in the dataset against combinations in another table. For this I will use sqldf, with which I am already familiar.
    How do I proceed? I force ColClasses= character when importing.

    The result of this will be a plot like the one shown in this post, i.e. month-by-year.

    • learnr permalink
      June 27, 2009 9:20 pm

      ISOdate function makes the day/month validation easier – see below:

      > ISOdate(year = 2000, month = 14, day = 30)
      [1] NA
      > ISOdate(year = 2000, month = 11, day = 31)
      [1] NA
      > ISOdate(year = 2000, month = 11, day = 30)
      [1] "2000-11-30 12:00:00 GMT"
      > year <- 45
      >ISOdate(year = ifelse(year < 50, NA, year + 1900), month = 11, day = 30)
      [1] NA
      > year <- 60
      >ISOdate(year = ifelse(year < 50, NA, year + 1900), month = 11, day = 30)
      [1] "1960-11-30 12:00:00 GMT"


  1. Decadal Trend Rates in Global Temperature « Charts & Graphs
  2. ggplot2: Decadal Trend Rates in Global Temperature « Learning R

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: