-
Notifications
You must be signed in to change notification settings - Fork 7
Expand file tree
/
Copy pathggplot-workshop-KEY.Rmd
More file actions
184 lines (128 loc) · 8.95 KB
/
ggplot-workshop-KEY.Rmd
File metadata and controls
184 lines (128 loc) · 8.95 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
title: "2. Introduction to {ggplot2}"
author: "Britta Schumacher"
date: "Last updated: `r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
theme: yeti
toc: yes
toc_float: true
---
*FIRST*: Many thanks to Alison Horst for her gorgeous [aRt](https://github.com/allisonhorst/stats-illustrations).
# Data Visualization with `ggplot2`
### What is `ggplot2`?
<center>

</center>
`ggplot2` is the bread and butter of data visualization; it takes clean, organized, manipulated data, and (with your direction) builds beautiful, & more importantly, *communicative*, plots. `ggplot2` is based on the *grammar of graphics*, which asserts that we can build every graph from the same few components: 1) a clean and tidy **data set**; 2) a set of **geoms** that map out data points; and 3) a **coordinate system** (in the broadest sense; i.e., how will the data be mapped onto a graphical surface).
To actually display data values, we map out our variables to aesthetic things like **size**, **color**, and **x** and **y** locations; we also tell `ggplot2` the type of visualization we are interested in building (e.g., bar graph, box plot, line graph, density plots, etc.). `ggplot2` opens up [data science](https://r4ds.had.co.nz/introduction.html) to broader audiences and helps all of us communicate our science. Download [this](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) cheat sheet and save it somewhere accessible--it's an incredible tool to refer back to!
`ggplot2` works with `data.frames`, the data type we built in our previous `tidyverse` workshop. The data we feed into `ggplot2` consists of rows and columns that "live peacefully" together in a tidy `data.frame`. This means, often, we need to do some data cleaning up front to get our data into the **tidy** format we need for visualization.
# Now let's practice on some REAL*ish* data!
### Load the tidyverse and any additional required packages:
```{r}
library(tidyverse)
g <- read.csv("data/gapminder_data.csv")
```
### Explore:
We should first familiarize ourselves with the data.
```{r, eval = FALSE}
dim(g) # view dimensions of the df
head(g) # view first 10 rows of df
tail(g) # view last 10 rows of df
str(g) # view data structure of df
colnames(g) # view the columns of df
```
### Wrangle:
This dataset is **pretty big**--we'll want to wrangle it so that it only includes the information that we're interested in. We will:
a. filter for the African continent
b. select relevant columns of data
c. rename columns
d. create new columns.
Let's create an efficient workflow by combining **ALL** of these data wrangling steps into one; i.e., let's review harnessing the great POWERS of `tidyverse`!
```{r}
africa_simple <- g %>%
filter(continent %in% c("Africa")) %>%
select(1:3,lifeExp, gdpPercap) %>%
rename(gdp = gdpPercap) %>%
mutate(income_class = case_when(
gdp < 1000 ~ "low",
(gdp >= 1000 & gdp <= 6000) ~ "middle",
gdp > 6000 ~ "high"))
# recall how we save data
saveRDS(africa_simple, "./out-data/africa-tidy.RDS")
```
With this simplified and cleaned data set, we're ready to explore! Let's first isolate data we want to visualize by:
a. selecting Malawi, Tanzania, Mozambique, Zimbabwe, and Zambia
b. grouping observations by country
b. finding the average life expectancy, and per capita GDP for each country combination
```{r}
malawi <- africa_simple %>%
filter(country %in% c("Malawi", "Tanzania", "Mozambique", "Zimbabwe", "Zambia")) %>%
group_by(country) %>%
summarize(ave_life = ave(lifeExp),
ave_GDP = ave(gdp)) %>%
unique()
malawi_long <- africa_simple %>%
filter(country %in% c("Malawi", "Tanzania", "Mozambique", "Zimbabwe", "Zambia"))
```
### Column graph
Now that we have our data summarised and in tidy format, we're ready to make a plot! We want to:
a. create a column graph showing the summarised data by country
b. make it pretty
**Note:** Only the first 3 lines of the following code are necessary to make the plot. Everything else simply modifies the appearance and make it a bit more presentable. There are *tons* of ways to customize plots -- we explore only a few options below.
```{r, fig.align = 'center', fig.width = 15, fig.height = 10}
ggplot(malawi, aes(x = country, y = ave_life, fill = country)) + # fill = summary variables
geom_col(position = "dodge") + # separate columns for each age group
ggtitle("Average Life Expectancy") + # add a title
labs(x = "Country", y = "Average Life Expectancy", fill = "") + # add labels on the axes and legend
scale_fill_manual(values = c("darkseagreen3", "cadetblue", "orange","grey", "black")) + # change colors and names
theme(legend.position = "none",
axis.text = element_text(size = 17),
axis.title = element_text(size = 22),
plot.title = element_text(size = 34),
axis.text.x = element_text(angle = 45, hjust = 0.9)) # change the angle of the x axis text, change the horizontal adjustment
```
### Line graph
Let's visualize this data in a different way. Let's:
a. create a line graph that shows how GDP per capita and life expectancy relate by country
b. make it pretty
```{r fig.align = 'center', fig.width = 15, fig.height = 10}
ggplot(malawi_long) + # fill = summary variables
geom_point(aes(x = lifeExp, y = gdp, color = country), size = 7) + # separate columns for each age group
ggtitle("Country Life Expectancy and GDP Per Capita, by Country") + # add a title
labs(x = "Life Expectancy (years)", y = "GDP Per Capita ($)", color = "Country") + # add labels on the axes and legend
scale_color_manual(values = c("#44AA99", "#88CCEE", "#DDCC77", "#CC6677", "#117733")) + # change colors and names
theme_bw(base_size = 22) # change the theme
```
### Histogram
Let's visualize this data in a different way; Let's:
a. create a histogram that shows the range of life expectancy by country
b. make it pretty
```{r}
malawi_long <- malawi_long %>% mutate(country = as_factor(country))
ggplot(malawi_long, aes(x = country, y = lifeExp, fill = country)) +
geom_boxplot() + # make a boxplot
labs(x = "", y = "Life Expectancy", fill = "Country") + # give the graph labels
theme_minimal() + # change the theme
scale_fill_brewer(palette = "Dark2") + # use a preset palette
theme(axis.text.x= element_text(angle = 45, hjust = 1)) + # adjust the angle of the x axis text
theme(legend.position = "none") + # move legend to the bottom
guides(fill = guide_legend(nrow=2)) # give legend two rows
```
### Plots, plots, plots, plots, plots, plots, plots!
There are MANY different ways to visualize the same data:
<center>

</center>
... In the end we just have to choose the method that makes the most sense for our audience, our research questions, and our data!
Additionally, aRtists everywhere are using R to make ridiculously cool pieces of [generative](https://en.wikipedia.org/wiki/Generative_art) and patterned art. Check out some talented RLadies [here](https://ijeamaka-anyene.netlify.app/) and [here](https://djnavarro.net/projects/)! Plus, [#TidyTuesday](https://github.com/rfordatascience/tidytuesday) submissions can be [absolutely](https://tanyadoesscience.com/project/tidytuesday/), [wildly](https://martindevaux.com/2021/01/tidytuesday-art-collections/), [brilliantly](https://codingwithavery.com/posts/2020-08-12-tidy-tuesday-avatar/), good.
# Bonus
What's *especially* neat about `ggplot2` is that it interfaces seamlessly with `sf`, a package for spatial data wrangling, analysis, and viz. `sf` treats its spatial objects as `data.frames` meaning.... we can use all of the magic & power of the `tidyverse` to play with the data! Plus, these data look just like any other data.frame, except attached to them is a column `geometry` that houses the coordinate information for the observation!
This is how we build the maps in our OSDS reports!
There are many, many ways to make maps beautiful maps in `ggplot` using `sf`. Anyone who says R isn't built for spatial viz just hasn't tried hard enough--check out [this](https://timogrossenbacher.ch/2019/04/bivariate-maps-with-ggplot2-and-sf/) if you don't believe me (just about the most gorgeous, creative, and well-built map I've ever seen).
-----------
# Resources
* Emily Burchfield's data viz materials, [here](https://www.emilyburchfield.org/courses/data_viz/intro_to_ggplot_tutorial).
* [Secrets of a happy graphing life](https://stat545.com/secrets.html).
* [The ggplot2 cheatsheet!](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf)
* Chapters 5 - 12 in [this online textbook](https://clauswilke.com/dataviz/visualizing-amounts.html) provide a great overview of when to use different visualization strategies.