Skip to content

Data Profiling in R

December 17, 2009
tags: , , ,

In 2006 UserR conference Jim Porzak gave a presentation on data profiling with R. He showed how to draw summary panels of the data using a combination of grid and base graphics.

data_profiling_porzak.png

Unfortunately the code has not (yet) been released as a package, so when I recently needed to quickly review several datasets at the beginning of an analysis project I started to look for alternatives. A quick search revealed two options that offer similar functionality: r2lUniv package and describe() function in Hmisc package.


r2lUniv

r2lUniv package performs quick analysis either on a single variable or on a dataframe by computing several statistics (frequency, centrality, dispersion, graph) for each variable and outputs the results in a LaTeX format. The output varies depending on the variable type.

> library(r2lUniv)

One can specify the text to be inserted in front of each section.

> textBefore <- paste("\\subsection{", names(mtcars),
+     "}", sep = "")
> rtlu(mtcars, "fileOut.tex", textBefore = textBefore)

The function rtluMainFile generates a LaTeX main document design and allows to further customise the report.

> text <- "\\input{fileOut.tex}"
> rtluMainFile("r2lUniv_report.tex", text = text)

The resulting tex-file can then be converted into pdf.

> library(tools)
> texi2dvi("r2lUniv_report.tex", pdf = TRUE, clean = TRUE)

A sample output for the mpg-variable:

data_profiling_r2lUniv.png

The final pdf-output can be seen here: r2lUniv_report.pdf.


Hmisc

The describe function in Hmisc package determines whether the variable is character, factor, category, binary, discrete numeric, and continuous numeric, and prints a concise statistical summary according to each. The latex report also includes a spike histogram displaying the frequency counts.

> library(Hmisc)
> db <- describe(mtcars, size = "normalsize")

The easiest and fastest way is to print the results to the console.

> db$mpg
mpg
      n missing  unique    Mean     .05     .10     .25     .50
     32       0      25   20.09   12.00   14.34   15.43   19.20
    .75     .90     .95
  22.80   30.09   31.30

lowest : 10.4 13.3 14.3 14.7 15.0
highest: 26.0 27.3 30.4 32.4 33.9

Alternatively, one can convert the describe object into a LaTeX file.

> x <- latex(db, file = "describe.tex")

cat is used to generate the tex-report.

> text2 <- "\\documentclass{article}\n\\usepackage{relsize,setspace}\n\\begin{document}\n\\input{describe.tex} \n\\end{document}"
> cat(text2, file = "Hmisc_describe_report.tex")
> library(tools)
> texi2dvi("Hmisc_describe_report.tex", pdf = TRUE)

A sample output for the mpg-variable:

data_profiling_describe.png

The final pdf-report can be seen here: Hmisc_describe_report.pdf.


Conclusion

Both of the functions provide similar snapshots of the data, however I prefer the describe function for its more concise output, and also for the option to print the analysis to the console. Whilst I like the summary plots generated by r2lUniv I find them hard to read in the pdf-report because of the small font-size of the labels.

About these ads
8 Comments leave one →
  1. Jan Vandermeer permalink
    January 28, 2010 12:04 am

    Hi;

    Still learning the whole R and latex thing. I ran the code that you gave to produce the mpg plots. I see the two latex documents that were produce and the subdirectory graphUniv. What I am missing is the insertion of the graphics into the two right boxes.

    The latex error is:

    Running ‘texi2dvi’ on ‘r2lUniv_report.tex’ failed.
    LaTeX errors:
    ! LaTeX Error: File `graphUniv/V1-boxplot’ not found.

    The graphUniv/V1-boxplot is referenced in

    \includegraphics[width=3cm]{graphUniv/V1-boxplot} of fileOut.tex.

    How do I get includegraphics to find the subdirectory?

    Jan

    • learnr permalink*
      January 28, 2010 1:23 am

      Have you checked whether you actually have any png-files in the graphUniv subdirectory?

    • learnr permalink*
      January 28, 2010 1:24 am

      Have you checked whether you actually have any png-files in the graphUniv subdirectory, and that they can be opened?
      This error seems to suggest otherwise.

  2. Jan Vandermeer permalink
    January 28, 2010 2:05 am

    Ooops! Should have mentioned that as well. The directory contains 22 png files named variously V(#) box, hist or bar.

    They open in a graphics view just fine.

    Jan

  3. Simon H permalink
    November 11, 2010 4:00 pm

    Hi, it looks like the r2lUniv package has been removed from CRAN, does anyone know the reasons for this?

    • learnr permalink*
      November 11, 2010 9:31 pm

      You’re right, it is not available any more. Maybe an email to the author could clarify the situation?

  4. jack permalink
    August 9, 2011 3:12 pm

    still not available…what is the story on that?

Trackbacks

  1. Generating PDF from R | Noah's blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 170 other followers

%d bloggers like this: