Creating Publication-Quality Graphics

Overview

Teaching: 60 min
Exercises: 20 min
Questions
  • How can I create publication-quality graphics in R?

Objectives
  • To be able to use ggplot2 to generate publication quality graphics.

  • To understand the basic grammar of graphics, including the aesthetics and geometry layers, adding statistics, transforming scales, and coloring or panelling by groups.

Plotting our data is one of the best ways to quickly explore it and the various relationships between variables.

There are three main plotting systems in R, the base plotting system, the lattice package, and the ggplot2 package.

Today we’ll be learning about the ggplot2 package, because it is the most effective for creating publication quality graphics.

ggplot2 is built on the grammar of graphics, the idea that any plot can be expressed from the same set of components: a data set, a coordinate system, and a set of geoms–the visual representation of data points.

The key to understanding ggplot2 is thinking about a figure in layers. This idea may be familiar to you if you have used image editing programs like Photoshop, Illustrator, or Inkscape.

Let’s start off with an example:

library("ggplot2")
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

So the first thing we do is call the ggplot function. This function lets R know that we’re creating a new plot, and any of the arguments we give the ggplot function are the global options for the plot: they apply to all layers on the plot.

We’ve passed in two arguments to ggplot. First, we tell ggplot what data we want to show on our figure, in this example the gapminder data we read in earlier. For the second argument we passed in the aes function, which tells ggplot how variables in the data map to aesthetic properties of the figure, in this case the x and y locations. Here we told ggplot we want to plot the “gdpPercap” column of the gapminder data frame on the x-axis, and the “lifeExp” column on the y-axis. Notice that we didn’t need to explicitly pass aes these columns (e.g. x = gapminder[, "gdpPercap"]), this is because ggplot is smart enough to know to look in the data for that column!

By itself, the call to ggplot isn’t enough to draw a figure:

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp))

We need to tell ggplot how we want to visually represent the data, which we do by adding a new geom layer. In our example, we used geom_point, which tells ggplot we want to visually represent the relationship between x and y as a scatter plot of points:

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()+
  scale_x_log10()

There are also built-in axis transformation methods, for example Log10-transformations. Unfortunately not all transformations are built in like a simple log-transformation, in which case we can do this:

ggplot(data = gapminder, aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point()

Challenge 1

Modify the example so that the figure visualise how life expectancy has changed over time:

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point()

Hint: the gapminder dataset has a column called “year”, which should appear on the x-axis.

Solution to challenge 1

Modify the example so that the figure visualise how life expectancy has changed over time:

ggplot(data = gapminder, aes(x = year, y = lifeExp)) + geom_point()

Aesthetics

We can also colour points based on different factors, such as by continent

ggplot(data = gapminder, aes(x = log(gdpPercap), y = lifeExp, colour = continent)) +
  geom_point()

Linear Regression

Let’s add a linear regression line to this plot.

ggplot(data = gapminder, aes(x = log(gdpPercap), y = lifeExp, colour = continent)) +
  geom_point()+
  geom_smooth(method = "lm")

Here we see a linear regression line for each continent, but what is the over all global trend? Let’s redraw that regression line to represent the global trend.

ggplot(data = gapminder, aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(aes(colour = continent))+
  geom_smooth(method = "lm", colour="black")

So, what is the actual equation of this trend line? What is the slope? correlation? p-value? We can calculate the linear model using the lm function. The format is y-axis ~ x-axis, data, and show to show the summary of the model we use the summary function of the lm output.

fit <- lm(lifeExp ~ log(gdpPercap), data = gapminder)
summary(fit)
## 
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = gapminder)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.778  -4.204   1.212   4.658  19.285 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -9.1009     1.2277  -7.413 1.93e-13 ***
## log(gdpPercap)   8.4051     0.1488  56.500  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.62 on 1702 degrees of freedom
## Multiple R-squared:  0.6522,	Adjusted R-squared:  0.652 
## F-statistic:  3192 on 1 and 1702 DF,  p-value: < 2.2e-16

Looking at the p-value, this shows that there is significant positive relationship between life expectancy and log-transformed GDP per capita. Let’s add the linear regression line equation and its significance to the plot!

To do this we need to extract this information from the summary(), and print out $y=a+bx$.

# Extract information to build an equation
a <- signif(fit$coefficients[[1]], 2)
b <- signif(fit$coefficients[[2]], 2)
r2 <- signif(summary(fit)$adj.r.squared,2)
pval <- signif(summary(fit)$coefficients[2,4], 2)
eq <- paste("y = ",a," + ", b, "x , r^2 = ", r2,", pval = ",pval)
eq
## [1] "y =  -9.1  +  8.4 x , r^2 =  0.65 , pval =  0"
# Create plot and annotate with equation
ggplot(data = gapminder, aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(aes(colour = continent))+
  geom_smooth(method = "lm", colour="black")+
  annotate("text", x=7, y=80,label = eq, colour = "black")

Layers

Using a scatter plot probably isn’t the best for visualizing change over time. Instead, let’s tell ggplot to visualize the data as a line plot:

ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) +
  geom_line()

Instead of adding a geom_point layer, we’ve added a geom_line layer. We’ve added the by aesthetic, which tells ggplot to draw a line for each country.

We can also combine geom_point and geom_line

ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) +
  geom_line()+
  geom_point()

We see two countries, one in Asia and one in Africa, that show a noticeable big drop in life expectancy. Let’s try to figure out which countries and years they are:

We can set logical criteria based on what is shown in the graph the year, continent, and life expectancy.

gapminder[gapminder$year > 1975 & gapminder$continent %in% c("Asia", "Africa") & gapminder$lifeExp < 35,]
##       country continent year lifeExp     pop gdpPercap
## 222  Cambodia      Asia 1977  31.220 6978607  524.9722
## 1293   Rwanda    Africa 1992  23.599 7290203  737.0686

These correspond to the years when the Cambodian genocide and the Rwandan genocide occurred. These are ways in which visualizing plots leads to insights in the data.

Box plots

The most popular plot besides a scatter plot might be a bar graph, but bar graphs do not show distribution of data within a group very well… Instead, using box plots is a great alternative. Let’s make one here:

ggplot(gapminder, aes(x= continent, y = log(gdpPercap), fill = continent))+
  geom_boxplot()

One common misconception is the middle bar that cuts across each group. This is the median of the group distribution, NOT the average. The edge of the boxes in a box plots show the quartile ranges.

Violin plots

A variation of a box plot is a violin plot. Violin plots show full distribution and density of data.

ggplot(gapminder, aes(x= continent, y = log(gdpPercap), fill = continent))+
  geom_violin()

We can also combine this with a box plot to get quartile information and make the box plot smaller so its within the violin curves.

ggplot(gapminder, aes(x= continent, y = log(gdpPercap), fill = continent))+
  geom_violin()+
  geom_boxplot(width = 0.1)

Now that we can see the distribution of log-transformed GDP per capita varies between continents, let’s see if any of the continents are significantly different from another. To test this we will run an one-way Analysis of Variance (ANOVA) model as shown below.

fit_anov <- aov(log(gdpPercap)~ continent, data = gapminder)
summary(fit)
## 
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = gapminder)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.778  -4.204   1.212   4.658  19.285 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -9.1009     1.2277  -7.413 1.93e-13 ***
## log(gdpPercap)   8.4051     0.1488  56.500  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.62 on 1702 degrees of freedom
## Multiple R-squared:  0.6522,	Adjusted R-squared:  0.652 
## F-statistic:  3192 on 1 and 1702 DF,  p-value: < 2.2e-16

We have a p-value less than $2*10^-16^$ and we reject the null hypothesis. As a result we can move onto a post-hoc test to see which continents are significant from each other using TukeyHSD to compute the Tukey Honest Significant Differences.

TukeyHSD(fit_anov, "continent", conf.level = 0.95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = log(gdpPercap) ~ continent, data = gapminder)
## 
## $continent
##                        diff        lwr        upr     p adj
## Americas-Africa   1.3669838  1.1884331  1.5455345 0.0000000
## Asia-Africa       0.8234016  0.6601195  0.9866838 0.0000000
## Europe-Africa     2.0961215  1.9279190  2.2643241 0.0000000
## Oceania-Africa    2.5299717  2.0013218  3.0586215 0.0000000
## Asia-Americas    -0.5435821 -0.7381069 -0.3490574 0.0000000
## Europe-Americas   0.7291378  0.5304649  0.9278107 0.0000000
## Oceania-Americas  1.1629879  0.6238687  1.7021070 0.0000000
## Europe-Asia       1.2727199  1.0876480  1.4577918 0.0000000
## Oceania-Asia      1.7065700  1.1723134  2.2408267 0.0000000
## Oceania-Europe    0.4338501 -0.1019308  0.9696310 0.1759429

Multi-panel figures

Earlier we visualized the change in life expectancy over time across all countries in one plot. Alternatively, we can split this out over multiple panels by adding a layer of facet panels. Focusing only on those countries with names that start with the letter “A” or “Z”.

Tip

We start by subsetting the data. We use the substr function to pull out a part of a character string; in this case, the letters that occur in positions start through stop, inclusive, of the gapminder$country vector. The operator %in% allows us to make multiple comparisons rather than write out long subsetting conditions (in this case, starts.with %in% c("A", "Z") is equivalent to starts.with == "A" | starts.with == "Z")

starts_with <- substr(gapminder$country, start = 1, stop = 1)
AZcountries <- gapminder[starts_with %in% c("A", "Z"), ]
ggplot(data = AZcountries, aes(x = year, y = lifeExp, color=continent)) +
  geom_line() + facet_wrap( ~ country)

Changing axis labels and adding a title

Let’s create a simple scatter plot again and change the axis labels and add a title

ggplot(data = gapminder, aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point()+
  labs(title = "Scatter plot of life expectancy v. GDP per capita",
      subtitle = paste(nrow(gapminder), "data points"))+
  xlab("Log-transformed GDP per capita")+
  ylab("Life expectancy")

Heatmap

Ggplot is able to create heat maps, but it has major limitations… To circumvent this we will install a new package called corrplot using install.packages.

install.packages("corrplot")

Once installation is finished, lets load it into our environment.

library(corrplot)
## corrplot 0.84 loaded

For this example we’re going to use the mtcars data set that is default available on R. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). See ?mtcars for more information regarding this data set.

Let’s take a look at mtcars and calculate the Pearson correlation to use in our heat map.

# Quick look at mtcars.
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# Pearson correlation is default. See ?cor for other available methods
cor_mtcars <- cor(mtcars) 
head(cor_mtcars)
##             mpg        cyl       disp         hp       drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.6999381  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.7102139  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.4487591  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.0000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.7124406  1.0000000
##             qsec         vs         am       gear       carb
## mpg   0.41868403  0.6640389  0.5998324  0.4802848 -0.5509251
## cyl  -0.59124207 -0.8108118 -0.5226070 -0.4926866  0.5269883
## disp -0.43369788 -0.7104159 -0.5912270 -0.5555692  0.3949769
## hp   -0.70822339 -0.7230967 -0.2432043 -0.1257043  0.7498125
## drat  0.09120476  0.4402785  0.7127111  0.6996101 -0.0907898
## wt   -0.17471588 -0.5549157 -0.6924953 -0.5832870  0.4276059
# Plot heatmap
corrplot(cor_mtcars, method = "circle")

In this heat map both size and colour corresponds to the Pearson coefficient. There are also other methods we can explore:

corrplot(cor_mtcars, method = "square")

corrplot(cor_mtcars, method = "number")

Heat maps are redundant, and we can eliminate this by taking a slice of the map.

corrplot(cor_mtcars, type = "upper")

corrplot(cor_mtcars, type = "lower")

We can also mix together shapes and numbers (or any two combinations of methods) to increase the amount of information we can represent in a heat map.

corrplot.mixed(cor_mtcars, lower = "number", upper = "circle")

The correlation matrix can be reordered according to the correlation coefficient. This is important to identify the hidden structure and pattern in the matrix. There are multiple methods avaialble for clustering, but we will only use the k-means hierarchical clustering method here:

corrplot(cor_mtcars, order = "hclust")

We can also direct corrplot to identify the two distinct groups we can see by eye.

corrplot(cor_mtcars, order = "hclust", addrect = 2)

And also groups that might not be immediately clear to us.

corrplot(cor_mtcars, order = "hclust", addrect = 3)

This is a taste of what you can do with ggplot2 and other plotting resources. R Studio provides a really useful [cheat sheet][cheat] of the different layers available, and more extensive documentation is available on the [ggplot2 website][ggplot-doc]. Finally, if you have no idea how to change something, a quick Google search will usually send you to a relevant question and answer on Stack Overflow with reusable code to modify! [cheat]: http://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf [ggplot-doc]: http://docs.ggplot2.org/current/

sample gallery of ggplots!

Key Points

  • Use ggplot2 to create plots.

  • Think about graphics in layers: aesthetics, geometry, statistics, scale transformation, and grouping.