Bio-1B/Basic_Primate_Phylogenetics_Lab.Rmd at master · ds-modules/Bio-1B · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
title: "Primate Phylogenetics Lab Code"
author: "Isaac Krone"
date: "2025-06-18"
output: word_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.width=7.5, fig.height=5)
```

# Part A: Morphological Analysis

We'll use two software packages (Phangorn and pytools) to build phylogenetic trees and measure their properties. Both of these software packages are written in a coding language called "R" that is widely used by biologists.

## 1) Software installations and loading

If you are using the RStudio application or the RStudio environment in Juypter notebooks, run the code block below by clicking the green play button in the upper right-hand corner. This step will take a moment, and you'll see some normal code ouptut.

```{r install & load packages, message = FALSE}
install.packages("phangorn", quiet = TRUE)
install.packages("phytools", quiet = TRUE)
library(phytools)
library(phangorn)

```

## 2) Data import and explorations

Now that we have installed the proper software, we can load a character matrix. The code below reads the file "Morphological_Data.csv", which is the same character matrix that you see in your lab worksheet.

```{r load data}
#read character matrix
M_characters<- as.matrix(read.csv("Morphological_Data.csv", row.names = 1))
```

To see that this character data matrix is correct, we'll print it out.

```{r show matrix}
M_characters
```

Next, we'll convert this matrix into a format that the `phangorn` package can use. It's still the same data, just arranged a bit differently.

```{r convert to phydat}
M_phydata <- phyDat(M_characters,type = "USER", levels = 0:1)
```

## 3) Tree building and plotting

Now we're ready to build a phylogenetic tree. The `bab` function ("**B**ranch **A**nd **B**ound") finds all of the most parsimonius trees for a given character matrix. All of these trees will be stored with the name `morpho_trees`.

```{r build parsimony tree}
morpho_trees <- bab(M_phydata)
```

Since we are using *Solenodon* as an outgroup, we'll designate it as such using the `root` function.

```{r root tree}
morpho_trees <- root(morpho_trees, "Solenodon", resolve.root = TRUE)
```

Now we'll plot all of our most parsimonious trees.

```{r plot parsimony tree, out.width="100%", fig.cap="Morphological phylogenies of primates"}
par(mai = c(0.2,0.2,0.2,0.2))

plot(morpho_trees)
```

We found 2 most parsimonious trees. But how many trees are possible?

The number of possible phylogenetic trees for any number of tips $n$ is given by the formula below.

$$ ntrees = \frac{(2n-3)!}{2^{n-2}(n-2)!} $$

We can use the function `howmanytrees` to solve this for any given n. We have 10 taxa, but one of them, the outgroup, we already know the position of. So really, we are solving from n=9, since we want to know all of the relationships between the ingroup taxa.

```{r howmanytrees}
howmanytrees(9)
```

Use this to answer Question 4A on your lab handout.

The code below measures the parsimony of our two most parsimonious morphological trees.

```{r measure parsimony}
paste("Tree 1 parsimony score:", parsimony(morpho_trees[[1]],M_phydata))
paste("Tree 2 parsimony score:", parsimony(morpho_trees[[2]],M_phydata))

```

Use this to answer question 4B on your lab handout.

**Work on questions 4 - 13 on your lab handout before moving to part B.**

# Part B: Molecular Phylogeny

## 1) Data import and explorations

Using the same method we used for the morphological data, we can build a parsimony tree for the Epsilon Hemoglobin Gene.

First, load in genetic data. These data are stored a bit differently.

```{r load Epsilon Hemoglobin}
Hemoglobin <- read.phyDat(file = "Epsilon_Cooked.fasta", format = "fasta")
```

Since we are working with DNA data, we have 4 character states corresponding to the 4 bases in DNA, "A","G","C", and "T".

Let's take a look at what these data look like. The code below plots out a visual representation of the character matrix. Think of it as several DNA strands stretched out and laid next to each other. On the x axis, we have **sites**. A **site** is a specific base on a DNA strand.

```{r plot alignment, out.width="100%", fig.cap="Epsilon Hemoglobin gene alignment."}
#set up the plot space
par(mai = c(1, 1, 0.5, 0.1), cex.lab = 0.1)
#plot out our alignment
image(Hemoglobin)
mtext(text="site", side=1, line=2)
```

Notice all the black space: this corresponds to sites where genetic data are missing for certain taxa. It could be missing because of problems sequencing the DNA. But it's more likely that those data are missing because the animals don't have that chunk of DNA that some of the others do. Grey sites, marked "N" are places where we know there is DNA, but it's not clear which base is there. There is only one "N" site in our dataset, so we don't need to worry about this.

Looking closely at this plot, you can see that there are a lot of sites at which every one of our ten taxa has the same character state. For instance, just before base 500, we can see several uninterrupted vertical bars of color. Since all of the taxa have the same base at these sites, these **invariant sites** can't give us any useful phylogenetic information. Instead, we need to look at **variable sites**.

```{r inspect Hemoglobin}
paste("Number of sites:", length(attr(Hemoglobin,"index")))
paste("Number of variable sites:", length(Hemoglobin[[1]]))
```

What does this mean?

-   **the number of sites** is just how many sites are present in our dataset. This is the same as the length of the x-axis in the plot above.

-   the number of **variable sites** is the number of sites from the plot above at which at least one taxon has a different base than all of the other taxa.

Pause here and answer Question 14 in your lab worksheet.

## 2) Tree building, plotting, and character mapping

Once again, we'll use `bab` to find all the most parsimonious trees. Now, we find only one most parsimonious tree.

```{r parsimony hemoglobin tree,out.width="100%", fig.cap="Epsilon-Hemoglobin tree of primates."}
Hemoglobin_tree <- bab(Hemoglobin)

Hemoglobin_tree <- root(Hemoglobin_tree, "Goat", resolve.root = TRUE)

par(mai = c(0.2,0.2,0.2,0.2))
plot(Hemoglobin_tree)
```

Just as before, we can plot character state changes on the tree to understand the course of character evolution. For instance, for character #160, we can map each time the character state has changed. We can use our program to find the character states of #160 for each of the taxa we study. As before, assume that the character state in the outgroup (here, a goat) is the ancestral state.

```{r character states for molecular data}
matrix(c("a","c","g","t")[sapply(1:10, function(x) Hemoglobin[[x]][160])],
       dimnames = list(names(Hemoglobin),"character 160"),
       nrow = 10, ncol = 1, byrow = FALSE
       )
```

Just like with our morphological data, we can map every single molecular character onto this tree, and we would find the number of state changes across the tree is equal to the tree's parsimony score.

Why would we want to do this with molecular data, though? We usually aren't interested in how individual molecular characters have evolved, but we might be interested in *how much evolution has happened* on a tree. By mapping all of our molecular character state changes onto our tree, we can see how many characters differentiate different taxa, and because we have a large number of characters, this can give us an absolute measure of how different different species are. This measurement is called a **branch length**.

Rather than map all of the characters ourselves, we can do this computationally. The `acctran` function will count the number of character changes across the phylogeny. In the next tree, we will plot the number of character changes on each branch to indicate the **branch length**.

```{r add branch lengths, out.width="100%", fig.cap="Epsilon hemoglobin tree with numeric branch lengths."}
Hemoglobin_tree_branchlengths <- acctran(Hemoglobin_tree, Hemoglobin)
par(mai = c(0.2,0.2,0.1,0.2))
plot(Hemoglobin_tree_branchlengths,
           use.edge.length = FALSE)
edgelabels(round(Hemoglobin_tree_branchlengths[[1]]$edge.length,0),
           adj = c(0.5, 0.5),
           frame = "rect",
           col = "white", bg = "#333333")
```

Look at the branch leading to the Gorilla. This branch has a length of 5, meaning that there are 5 characters that changed between the gorilla and the most recent common ancestor of gorillas, humans, and chimpanzees. Likewise, the 6 on the branch leading from that ancestor to the common ancestor of humans and chimpanzees means that there were 6 molecular changes in between these two ancestors. That means that there are a total of 11 (5+6) differences between the gorilla and the most recent common ancestor of humans and chimpanzees.

Use this to answer Question 16 on your worksheet.

Note: often, when you see published phylogenetic trees, the figured branch lengths will all be to scale; so a branch of length 10 will be twice as long on the page as a branch of length 5. The tree above would look like this:

```{r plot scaled branch lengths, out.width="100%", fig.cap="Epsilon Hemoglobin tree with scaled branch lengths"}
Hemoglobin_tree_branchlengths <- acctran(Hemoglobin_tree, Hemoglobin)
par(mai = c(0.2,0.2,0.1,0.2))
plot(Hemoglobin_tree_branchlengths,
           use.edge.length = TRUE)
```

**Complete all the questions including the synthesis questions on your worksheet as a group. Sign the contribution sheet and turn in your work before you leave the lab.**