Second Week
Major Goals
This week focuses on DNA sequences: how to work with them, where to obtain them if you prefer not to sequence organisms yourself, and how to infer phylogenetic trees.
Lectures and videos provide a step-by-step guide on handling sequence data and conducting phylogenetic analyses. Each day, tutorials will cover a different method, and you must complete each tutorial and its exercises to progress to the next day’s topic.
The week begins with assembling, checking, and exporting your raw sequences (PCR products sequenced last week) to generate high-quality consensus sequences. We start with your DNA sequences to help you become familiar with (1) Sanger DNA sequencing, (2) sequence evaluation (or What’s the difference between a bad and good DNA sequence?), (3) ambiguous (wobble) DNA positions and where do they come from, and (4) DNA sequences derived from public data repositories.
Note
At the end of the week, you will know…
Different kinds of sequence file types.
How to use public databases.
How to edit sequences.
How to check if sequencing results are correct.
What an multiple sequence alignment is.
What a model of sequence evolution is and why it is important for phylogenetic analysis.
What the difference of Cluster algorithms and Search algorithms is when constructing phylogenetic trees.
What ML and BI means.
Monday
Today we will start with recapitulating what you learned last week and discuss the method of Sanger sequencing. After that, you start with processing your sequencing results in Geneious Prime i.e., you will assemble, check and correct the raw reads that have been assigned to you (see sequence assignment list) and export the respective consensus sequences. Then, you can start reading the sections about Geneious Prime and Genbank (see Database and Search Strategies), which introduces you to the handling of sequence data and how to use Genbank, a public sequence data repository.
By doing the exercises in this tutorial, you will generate a toy dataset, which you will be using for the whole week. All following tutorials and exercises are based on this toy dataset.
The basic idea is, that all of you work with the same toy dataset, which makes it easier to compare results. However, it is also fine if you add some of your own sequences (those you checked and exported earlier today).
Tasks of the Day
Read section Geneious Prime and check out the Geneious Prime User Manual.
See the sequence assignment list.
Check out, which raw reads have been assigned to you.
Open Geneious Prime and create the folder Monday/Tutorial_1 and subfolders for each gene.
Local
├── Monday
│ └── Tutorial_1
│ ├── 18S
│ ├── 28S
| └── COI
└── …
Download and import your raw sequences to Geneious Prime.
Find the matching raw reads i.e., the forward and the reverse sequence(s) of the same sample (Note that 18S consists of more than two sequences).
Assemble the matching read pairs (Align/Assemble -> De Novo Assemble), store them in separate subfolders (Check the box Save in sub-folder).
Name your consensus sequences in the following format:
<Sample number>_<Genus>_<Species>_<Gene>_<Initials>(eg.1_Acrogalumna_longisetosa_18S_BH).
Local
├── Monday
│ └── Tutorial_1
│ ├── 18S
| | └── 1_Acrogalumna_longisetosa_18S_BH
│ ├── 28S
| └── COI
└── …
Check the consensus sequence and correct ambiguous positions.
Export the consensus sequences as FASTA files to your PC.
Upload the consensus files here.
Attention
Never use space or special characters (e.g., ä, ., :) in sequence or file names; always separate words with underscores _. Most sequence editors and phylogenetic programs are very sensitive when it comes to sequence names and file formats. You will save a lot of time, if your file names are compatible right from the start.
Read sections Database and Search Strategies and Downloading and Saving.
Open NCBI GenBank and select the ‘Nucleotide’ database in your web browser of choice.
Bookmark the page.
Open the form and answer the question. Click here for the form. Please make a copy for yourself! Do not forget to enter your name!
Download the sequences from NCBI GenBank with the accession numbers given in the form as separate sequence files in FASTA format.
Draw a phylogenetic tree of the six major Oribatida groups.
Write the names of the major groups on the branches and the species names from Tutorial 3 at the tips.
Take a picture of your drawing and upload it here.
For all species from Tutorial 3 (you can find the species names again here), download the 18S rDNA sequences from NCBI GenBank (just as for EF).
Use the Clipboard option to save all sequences in FASTA format as a single file (name the file
Tutorial_5_Oribatida_18S.fas).
Attention
There is no 18S rDNA sequence available for Carabodes femoralis, use Carabodes subarcticus. For Platynothrus peltifer, four 18S rDNA sequences are available, download the one with following accession number EF091422 (it’s the longest sequence of the four sequences available).
Hint
A rule of thumb: If two or more sequences are available for a species, always choose the longest sequence.
What do you consider the key benefits of an online database?
Write down your answer on a sheet of paper.
Take the sequences from Tutorial 3 and copy them to subfolder Tutorial 6.
Local
├── Monday
│ ├── Tutorial_1
│ └── Tutorial_6
└── …
Change all sequence names from GenBank to:
<GENUS>_<SPECIES>_<ACCESSION NUMBER>_<GENE>(e.g.Archegozetes_longisetosus_EF081321_EF).
Local
├── Monday
│ ├── Tutorial_1
│ └── Tutorial_6
│ ├── Archegozetes_longisetosus_EF081321_EF
│ └── …
└── …
Open the file
Tutorial_5_Oribatida_18S.fasfrom Tutorial 5 with your local text editor of choice (e.g., Notepad++, Editor).Change the sequence names from GenBank just as in Tutorial 6 (
<GENUS>_<SPECIES>_<ACCESSION NUMBER>_<GENE>).Import the file to Geneious Prime in a new subfolder Tutorial_7 (as separate sequences).
Local
├── Monday
│ ├── Tutorial_1
│ ├── Tutorial_5
│ ├── Tutorial_6
│ └── Tutorial_7
│ ├── Archegozetes_longisetosus_EF081321_18S
│ └── …
└── …
Note
You now have two datasets with +/- identical taxon sampling but with two different genes. Awesome!
Now you can add (import) some of your own 18S rDNA sequences.
Your own sequences should be named in the same logic as the sequences from NCBI.
Since no accession numbers are available for your sequences, you may replace accession number with
own, to quickly identify your own sequences among the others, for example:Archegozetes_longisetosus_own_18S.
Important
Do not add more than four of your own sequences, please. It is helpful to keep the dataset small, because larger datasets will require longer running times (i.e., longer waiting time for you). It will also be more difficult to focus on the most relevant information.
Tip
Just in case, you can read about Geneious Prime again under Sections.
Tuesday
Today, we focus on sequence alignments and their significance in analyzing genetic data. In this tutorial, you will perform sequence alignments using your toy datasets with Geneious Prime.
Remember, sequence files—whether aligned or not—can be saved in various file formats, and the required input format may vary depending on the software you use. If the format is incorrect, the software will not function as expected. Understanding the correct input file format is essential to overcoming initial challenges when working with phylogenetic software.
Note
At the end of the day, you know…
How an alignment is generated by the Needleman-Wunsch algorithm.
How computer algorithms (basically) perform.
The meaning of penalty values and their effects on alignments.
How to find criteria that will help you to decide if an alignment is good or not.
The difference between sequence file formats, and the difference between multifasta and alignment files and how to recognize them.
Important
The different properties of coding and non-coding sequences will not be explained explicitly and we assume that you already know what reading frames are. However, if you are lost, do not hesitate to ask one of the tutors or me.
Tasks of the Day
Use your DNA sequences from Monday, namely from Tutorial 6 and Tutorial 7 to generate alignments in Geneious Prime using the parameters below (all other parameters keep in default mode).
In order to do this, mark all sequences in the repective folder and click on
Align/Assemble -> Multiple Align -> Geneious Alignment.
Local
├── Monday
│ ├── …
│ ├── Tutorial_6
│ | ├── Archegozetes_longisetosus_EF081321_EF
│ | └── …
│ └── Tutorial_7
│ ├── Archegozetes_longisetosus_EF081321_18S
│ └── …
└── …
Attention
Use a period . not a comma , when typing the penalty values!
Change the names of the alignments F2 like this
18S_Tutorial_1_a_aln(<GENE>_<TUTORIAL>_<ALIGNMENT LETTER>_aln.) and drag or move them to a new subfolder called Tuesday/Tutorial_1.
Local
├── Monday
└── Tuesday
└── Tutorial_1
├── EF_Tutorial_1_a_aln
├── EF_Tutorial_1_b_aln
├── EF_Tutorial_1_c_aln
├── 18S_Tutorial_1_d_aln
├── 18S_Tutorial_1_e_aln
└── 18S_Tutorial_1_f_aln
Open the form and answer the questions. Click here.
Compare your results with your neighbour.
Read section Sequence Editing.
Download the zip file.
Open each file in your local text editor of choice (i.e. Editor or Notepad++ for Windows) and answer the questions given in the form. Click here.
Open the form and answer the questions. Click here.
Wednesday
Today, we have three learning modules:
Note
By the end of the day, you will:
Understand how phylogenetics accounts for evolutionary changes in DNA sequences, including past changes that are not immediately visible.
Grasp the concept of clustering algorithms, their limitations, and their advantages over search algorithms.
Have constructed four phylogenetic trees using your toy dataset.
Experience the process of a clustering algorithm by manually calculating and drawing a UPGMA tree.
Have practiced drawing phylogenetic trees by hand.
Tasks of the Day
Download and install jmodeltest2 on your PC (you may use this download link directy jmodeltest-2.1.10).
Read section Models of Sequence Evolution.
Use jModelTest to calculate the best fitting model of sequence evolution, for both AIC and BIC calculations (see section Models of Sequence Evolution for how to work with jModelTest).
Use your best trimmed (cut) alignments for EF and 18S, respectively from Tuesday/Tutorial_1.
Safe the HTML log file from jmodeltest2 and answer the questions here.
Read section How to Infer Phylogenetic Trees.
Read section How To Draw Phylogenetic Trees. Don’t be confused—this section primarily focuses on the standalone version of FigTree. However, all the settings explained here are also available in the Geneious Prime plugin.
See also this viewing-and-formatting-trees in Geneious Prime.
Create two subfolders named Wednesday/Tutorial_2/EF and Wednesday/Tutorial_2/18S.
Local
├── Monday
├── Tuesday
| └── Tutorial_1
└── Wednesday
└── Tutorial_2
├── 18S
└── EF
Copy your best trimmed alignments from EF and 18S (from Tuesday/Tutorial_1) into their respective subfolders.
For both alignments calculate a NJ tree using the Jukes-Cantor model of sequence evolution (Tree -> Geneious Tree Builder -> Genetic Distance Model: Jukes-Cantor) with 1000 bootstrap replicates (Resample tree -> Resampling Method:Bootstrap + Number of Replicates: 1000).
Root the tree using Zercon sp. (Click on the end of the branch leading to Zercon sp. and hit Root in the subpanel). Why Zercon sp. again?
Indicate in the file name that this tree uses the Jukes-Cantor model, for example,
EF_JC_model.
For both alignments calculate a NJ tree using the Tamura-Nei model of sequence evolution and 1000 bootstrap replicates.
Root the tree using Zercon sp.
Indicate in the file name that this tree uses the Tamura-Nei model, for example,
EF_TN_model.
Present the trees from Exercise 1 and Exercise 2 as phylograms in PowerPoint.
Display the trees with increasing node order (see the right panel and click on Formatting -> Order branches -> Ordering: increasing) and export them as JPEG (File -> Save as Image File).
Display the NJ trees of EF and 18S on separate slides/pages in PowerPoint (or any other presentation software).
Open the form and answer the questions. Click here. Do not forget to enter your name (but only after you answered all questions – otherwise you know)!
Attention
Complete all exercises and questions by hand with pen and paper!
We will discuss them either in the afternoon or tomorrow morning.
Draw by hand all unrooted tree topologies that are possible for four taxa (A, B, C, D).
In one of the trees, use arrows to indicate where the tree might be rooted.
Draw all possible combinations for a rooted tree with four taxa (A, B, C, D)?
How many topologies are possible for a rooted tree with four taxa (A, B, C, D)?
Attention
Some topologies might be redundant!
Draw the following tree given in Newick Format by hand:
((((A,(B,(C,D))),E),(F,G)),H);.Check your topology using FigTree in Geneious Prime.
Why are trees with four taxa interesting to mathematicians compared to trees with two or three taxa?
What is the difference between a cladogram, a phylogram, and a chronogram?
Phylogeography is the study of the genetic structure of species within or between geographic regions. If populations are geographically distant from each other, gene flow is usually reduced and both populations accumulate mutations independently, which increases genetic distance between taxa. If gene flow continues between geographically distant populations, or if they share a common ancestor from which they recently separated, their genetic distance is comparatively small.
Note
In the course of a Master’s thesis, a student investigates the relationships of two populations of the oribatid mite Steganacarus magnus (SM) from Germany (D) and France (F). To understand the relationships between the two populations, the student sequenced the COI mitochondrial gene of seven individuals and generated a matrix that shows the genetic distances between all individuals (see distance matrix under Exercise).
Attention
Do it all by hand with pen and paper!
To infer if the two populations have a recent common ancestor, draw a UPMGA tree and calculate the length of all tree branches.
Write down the tree with all distance calculations and intermediate distance matrixes.
Interpret the tree in a phylogeographic context.
Are both populations genetically separated or are there any indications for gene flow or dispersal?
SM_D1 |
SM_D2 |
SM_D3 |
SM_D4 |
_SM_F1 |
SM_F2 |
SM_F3 |
|
|---|---|---|---|---|---|---|---|
SM_D1 |
|||||||
SM_D2 |
5 |
||||||
SM_D3 |
6 |
1 |
|||||
SM_D4 |
42 |
39 |
40 |
||||
SM_F1 |
5 |
2 |
3 |
39 |
|||
SM_F2 |
67 |
68 |
71 |
70 |
68 |
||
SM_F3 |
72 |
73 |
74 |
72 |
73 |
6 |
Thursday
Today, it’s all about search algorithms. You will learn the basics of the two most common methods for calculating phylogenetic trees – Maximum Likelihood in the morning and Bayesian Inference in the afternoon.
Both methods are widely used, because they are more thorough than clustering algorithms (such as UPGMA or NJ) and they approach the mathematical part of inferring phylogenetic trees from different angles. You will hear more about this in the Lectures that are accompanied with the two sections.
Both programs can be installed as plugins in Geneious Prime. See Tutorial 1 and Tutorial 2 for doing so.
Note
Both programs can also be controlled via the command line – you may use this approach during the third week to improve the computing performance of both RAxML (download here) and MrBayes (download here).
While working through the exercises, many topics you have been dealing with earlier this week will come up again, such as input file format or Models of Sequence Evolution.
Note
At the end of the day you will…
Know the difference between cluster and search algorithms.
Know why search algorithms take so much longer for analysing genetic data than cluster algorithms.
Know that ML uses likelihoods, and MrBayes uses posterior probabilities.
Know what MCMC is and for which type of analysis it is used.
Be able to interpret the different statistics MrBayes provides.
Understand the meaning of prior and posterior probabilities.
Understand the difference between bootstrap support and posterior probabilites and why they are not directly comparable.
Tasks of the Day
Read section RAxML. Don’t be confused—this section primarily focuses on the command-line version of RAxML. However, all the settings explained here are also available in the Geneious Prime plugin.
Install the RAxML plugin in Geneious Prime Tools -> Plugins -> Available Plugins.
Create two new subfolders for the RAxML analyses of EF and 18S in Geneious Prime.
Local
├── Monday
├── Tuesday
├── Wednesday
└── Thursday
└── Tutorial_1
├── 18S
└── EF
Copy your best alignments from EF and 18S (from Tuesday/Tutorial_1) into their respective subfolders.
Start the ML analyses with following parameters Tree -> RAxML:
GTR GAMMA I Nucleotide Model:GTR GAMMA I
Rapid bootstrapping and search for best-scoring ML tree Algorithm:Rapid bootstrapping and search for best-scoring ML tree: Command line: -f a -x 1
500 bootstrap replicates Number of starting trees or bootstrap replicates:500
Any other parameter in default settings
Write down how long the analyses took (in seconds).
Read section MrBayes. Don’t be confused—this section primarily focuses on the command-line version of MrBayes. However, all the settings explained here are also available in the Geneious Prime plugin.
Install the MrBayes plugin in Geneious Prime Tools -> Plugins -> Available Plugins.
Create two new subfolders for the MrBayes analyses of EF and 18S in Geneious Prime.
Local
├── Monday
├── Tuesday
├── Wednesday
└── Thursday
├── Tutorial_1
└── Tutorial_2
├── 18S
└── EF
Copy your best alignments from EF and 18S (from Tuesday/Tutorial_1) into their respective subfolders.
Start the Bayesian Inference using MrBayes Tree -> MrBayes with following parameters:
Use GTR+G+I as model of sequence evolution Substitution Model:GTR + Rate Variation: invgamma
Set the outgroup Outgroup: Zercon sp.
Use 1 million generations Chain Length:1,000,000 and sample every 100th generation Subsampling Freq:100
Use a burn-in of 10% Burn-in Length:1000
Write down how long the analysis took (minutes + seconds).
Which parameter-settings deviate from the default settings?
What is the Average standard deviation of split frequencies of your analyses? Use
EF_Tutorial_1_b_aln - Posterior outputand look for the tab Raw Posterior Output in the lower panel. There you will find a column StdDev(s). Click on Show entire ### bytes (may be very slow) to show the whole output.
Note
The Average standard deviation of split frequencies is a measure used in Bayesian phylogenetic inference to assess convergence and stability of the MCMC (Markov Chain Monte Carlo) chains during the analysis. The split frequency measures how often a particular split appears across all sampled trees from an MCMC chain.
< 0.01 — generally accepted as indicating good convergence; the two MCMC runs are sampling from the same distribution
< 0.05 — acceptable for a preliminary analysis, but ideally you should run the chains longer
> 0.05 — indicates the runs have not converged and results should not be trusted; chains need to run longer
Note
The choice of priors (setting of parameters prior to the analysis) is important for Bayesian Inferences, as they influence the computing time and the search efficiency in the tree space. However, priors are usually unknown, so we will use flat priors instead!
Open the form and answer the questions. Click here. Do not forget to enter your name!
Import all trees you made into PowerPoint.
Separate the trees according to gene, ML and BI analyses, respectively.
Save them on a DIN A4 page.
Label the nodes with corresponding bootstrap values and posterior probabilities.
What are the main differences between the ML and MrBayes trees?
Friday
Now you know all the essential steps and methods how to calculate a phylogenetic tree from sequence data. You may have realized that you had to use different file formats for different programs and different programs for different analyses.
You should also know that you can work with sequence data and make phylogenetic trees in R. One big advantage of using R is, that you can do all analyses in one software, without reformatting the input files.
The other big advantage of R is, that you can do awesome downstream analyses with your phylogenetic tree, like analysing trait evolution when you have trait data for your taxa, or analyse community data. But this is another story.
This day is dedicated to introduce you into the basic commands in R that enable you to calculate a phylogenetic tree. Of course: R walks along the analytical path from sequence to tree in its very own way. However, this may even help you to better remember or even understand the single steps that are involved in building a phylogenetic tree from scratch.
Depending on your present day R skills, you may only skim through some of the sections. You will see which are relevant for you to read.
Note
At the end of the day, you will…
Be more versatile and confident when working with genetic data in R.
Tasks of the Day
Read section Ape package.
Read section Getting Started with R.
Download the R script and the example files here.
Export your sequences from Monday/Tutorial_6 and Monday/Tutorial_7 as FASTA files to your PC. Name them
Oribatida_18S.fasandOribatida_EF.fas, respectively.Open R or RStudio and set the folder containing the files as the working directory.
Remember to (download and) activate all required packages.
Align the multifasta sequences
Oribatida_EF.fasandOribatida_18S.fasusing themsa( )function in R.Use the CLUSTAL algorithm and set 10 and 0.1 as gap opening and gap penalties, respectively.
Save the alignments as
EF_aln1.fasand18S_aln1.fas.Open the alignments in Geneious Prime, check and trim to the shortest sequence.
Export the trimmed alignments as
EF_aln2.fasand18S_aln2.fasto your PC preferably in the same folder as your other files.
How long (bp) is the untrimmed alignment for 18S and EF?
How long (bp) is the trimmed alignment for 18S and EF?
Calculate a Neighbor Joining tree based on p-distances for
EF_aln2.fasand18S_aln2.fas.Save the distance matrix for each alignment as
csv, name themdistance_EF.csvanddistance18S.csv, to your PC.Calculate 1000 bootstraps for each tree.
Plot each tree neatly (
ladders.right = FALSE,cex = 0.7), displaying bootstrap values as percentages inlightbluetext color, enclosed by circles with awhitebackground..Save the NJ trees with nodelabels as JPG: 1) The
NJ_EF.trewithredtip labels and 2) the``NJ_18S.tre`` withlightbluetip labels.
Calculate the model of sequence evolution in R for the trimmed alignments EF_aln2.fas and 18S_aln2.fas.
What is the best fitting model for EF and 18S?
Calculate ML trees for
EF_aln2.fasand18S_aln2.fas, respectively.Root both trees with the outgroup Zercon.
Plot both trees in one graphic, with facing tip labels. EF with
greenand 18S withyellowgreentip labels.Display bootstrap values enclosed in
redcircles with apink1background.Save both trees in one plot as PDF to your PC, name it
ML_EF_18S.pdf.
Are the NJ and ML trees calculated in R similar to the trees calculated in Geneious Prime?
Can you see fundamental differences?
Do you consider both ways (R and MrBayes/RaXML in Geneious Prime) as comparable?