Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals

Phylogenetics articles from across Nature Portfolio

Phylogenetics is the attempt to reconstruct the evolutionary relationships between species. Historically, this was done using quantitative morphological data, but modern methods rely more heavily on DNA sequence data.

phylogenetic analysis research journal

Frequent jumps from human hosts

An analysis of publicly available viral genomes explores the evolutionary dynamics of host jumps and shows that humans are as much a source of viral spillover events to other animals as they are recipients.

  • Christophe Dessimoz

Latest Research and Reviews

phylogenetic analysis research journal

Wolbachia supergroup A in Enoplognatha latimana (Araneae: Theridiidae) in Poland as an example of possible horizontal transfer of bacteria

  • Edyta Konecka
  • Paweł Szymkowiak

phylogenetic analysis research journal

Characterization of tumor evolution by functional clonality and phylogenetics in hepatocellular carcinoma

Clonality study in HCC finds diverse evolution patterns. Linear HCC is less aggressive, with GTF2IRD2B driver mutations. Non-linear has shallow/deep branching patterns with frequent TP53 driver mutations.

  • Zeynep Kacar
  • Xin Wei Wang

phylogenetic analysis research journal

Anaerobic fungi in the tortoise alimentary tract illuminate early stages of host-fungal symbiosis and Neocallimastigomycota evolution

Here, Pratt et al identify a community of deep-branching anaerobic fungi in tortoise feces. Multiple characteristics underpinning their success in tortoise, as opposed to scarcity in mammalian alimentary tracts, are presented.

  • Carrie J. Pratt
  • Casey H. Meili
  • Noha H. Youssef

phylogenetic analysis research journal

Revealing uncertainty in the status of biodiversity change

This study presents an approach to deal with spatial, temporal and phylogenetic non-independence in large-scale analyses of biodiversity change, improving trend estimation and inference across scales.

  • T. F. Johnson
  • A. P. Beckerman
  • R. P. Freckleton

phylogenetic analysis research journal

The complete mitochondrial genome of Chibiraga houshuaii (Lepidoptera, Limacodidae) and its phylogenetic implications

  • Yanpeng Cai

phylogenetic analysis research journal

Evolution, types, and distribution of flight control devices on wings and elytra in bark beetles

  • Jakub Białkowski
  • Robert Rossa
  • Jakub Goczał

Advertisement

News and Comment

phylogenetic analysis research journal

Comparative genomics uncover the evolutionary history of butterfly and moth chromosomes

Using over 200 chromosomal genomes to reconstruct 250 million years of evolutionary history, we define the 32 linkage groups (Merian elements) that were present in the ancestor of Lepidoptera. We chart the dynamics of chromosome fusion and fission that accompanied the global diversification of Lepidoptera.

phylogenetic analysis research journal

A decaploid pitcher plant genome reveals a novel role for recessive subgenomes

Genomic polyploidy is prevalent in the plant kingdom, giving rise to dominant and recessive subgenomes. We show that the recessive subgenomes of the pitcher plant Nepenthes gracilis had a crucial role in the acquisition of novel genes with species-specific function such as dioecy and carnivory.

phylogenetic analysis research journal

Evolutionary genomic analysis for ALL

The advantage of genomic monitoring over cytogenetics for clinical assessment of leukemia is illustrated by a case of pediatric acute lymphoblastic leukemia in which a lesion underlying lethal end-stage myeloid disease could be detected by whole-genome sequencing years before the risk manifested cytogenetically.

  • Lauren M. Harmon
  • Timothy J. Triche Jr

phylogenetic analysis research journal

The origin of animals and fungi

By studying the closest extant unicellular relatives of animals and fungi, a study in Nature contrasts the evolutionary trajectories leading to the origin of these groups.

phylogenetic analysis research journal

John Macfarlane was the first to recognize Eukaryota as a group

  • Yegor Shɨshkin

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

phylogenetic analysis research journal

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

A workflow with R: Phylogenetic analyses and visualizations using mitochondrial cytochrome b gene sequences

Contributed equally to this work with: Emine Toparslan, Kemal Karabag, Ugur Bilge

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft

Current address: Department of Agricultural Biotechnology, Faculty of Agriculture, Akdeniz University, Antalya, Turkey

Affiliation Institute of Natural and Applied Sciences, Akdeniz University, Antalya, Turkey

ORCID logo

Roles Conceptualization, Investigation, Resources, Validation, Visualization, Writing – review & editing

Affiliation Department of Agricultural Biotechnology, Faculty of Agriculture, Akdeniz University, Antalya, Turkey

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – review & editing

* E-mail: [email protected]

Affiliation Department of Biostatistics and Medical Informatics, Faculty of Medicine, Akdeniz University, Antalya, Turkey

  • Emine Toparslan, 
  • Kemal Karabag, 

PLOS

  • Published: December 15, 2020
  • https://doi.org/10.1371/journal.pone.0243927
  • Peer Review
  • Reader Comments

Fig 1

Phylogenetic analyses can provide a wealth of information about the past demography of a population and the level of genetic diversity within and between species. By using special computer programs developed in recent years, large amounts of data have been produced in the molecular genetics area. To analyze these data, powerful new methods based on large computations have been applied in various software packages and programs. But these programs have their own specific input and output formats, and users need to create different input formats for almost every program. R is an open source software environment, and it supports open contribution and modification to its libraries. Furthermore, it is also possible to perform several analyses using a single input file format. In this article, by using the multiple sequences FASTA format file (.fas extension) we demonstrate and share a workflow of how to extract haplotypes and perform phylogenetic analyses and visualizations in R. As an example dataset, we used 120 Bombus terrestris dalmatinus mitochondrial cytochrome b gene (cyt b ) sequences (373 bp) collected from eight different beehives in Antalya. This article presents a short guide on how to perform phylogenetic analyses using R and RStudio.

Citation: Toparslan E, Karabag K, Bilge U (2020) A workflow with R: Phylogenetic analyses and visualizations using mitochondrial cytochrome b gene sequences. PLoS ONE 15(12): e0243927. https://doi.org/10.1371/journal.pone.0243927

Editor: Michael Scott Brewer, East Carolina University, UNITED STATES

Received: July 20, 2020; Accepted: November 30, 2020; Published: December 15, 2020

Copyright: © 2020 Toparslan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting information files.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Phylogenetic relationships are mostly calculated using computer programs with several mathematical models. Although there are many software packages to estimate parameters, they don’t work together in a common workflow that can compute these parameters in one task [ 1 ]. To use these software packages, datasets need to be in different input file formats. Therefore, users have to prepare different input files for almost every program. These are; .fas, .mas, .meg, .arp, .gtx, .str, .nwk, .tree and .txt files with different formats. This way of working can cause increased workload and time loss [ 2 ]. At this point, there is a need for a platform where analyses can be performed in a single framework. Although R (programming language) is a software environment for statistical computing and graphics, it is increasingly used in bioinformatics and phylogenetic data analysis thanks to advanced packages and libraries [ 2 – 11 ].

R is an environment for linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering [ 10 ] and graphics. It is free and it enables static and dynamic program analyses [ 12 ]. On the other hand, RStudio is an open-source Integrated Development Environment (IDE) for the R programming language. RStudio is a software that combines various components of R, such as console, resource editing, graphics, history, help, in one workbench [ 13 ].

R and RStudio create an R Markdown document provided by the rmarkdown package, which can store all code snippets, analyses, results, and images in a document [ 14 ]. With knitr package, a new Markdown file is created and converted into different file formats such as PDF, HTML, Word etc. by using pandoc() function [ 15 ]. Additionally, with R Markdown, journal articles and multi-part books can be written, and websites and blogs can be generated [ 14 ]. Therefore, they provide a wide range of options and are quite practical.

One of the strongest biomarkers used to estimate phylogenetic relationships is also mitochondrial DNA. It is frequently used in phylogenetic research and it is possible to group individuals as haplotypes by defining variations in the mtDNA for every population. Moreover, a haplotype network based on nucleotide differences between haplotypes can be created [ 16 ].

We demonstrate how to perform phylogenetic analyses and graphics in a single workflow using R for mtDNA sequences. Additionally, we have shared all the R commands in these analyses for everyone to use. Furthermore, some of these commands can be used directly, and some are in modifiable form for users who have samples with different sequence lengths and numbers.

Packages, libraries and commands

As R is an open source software, a huge number of packages have been created to date [ 10 ]. R packages stored as a library in the R environment are a combination of functions, commands and sample data. We defined the packages and commands used in this study below. We shared the R code and mitochondrial cyt b sequences used in our article as S1 and S2 Appendices.

msa: Multiple alignment analysis package.

The msa package (version 1.18.0) is a unified R/Bioconductor interface and implements three multiple sequence alignment methods (ClustalW, ClustalOmega and MUSCLE). They do not need any other external software tools because they are all integrated in the package. Sequence types that this package can read for alignment are "B", "DNA", "RNA" or "AA" that is a single string specifying the type of sequences contained in the FASTA format file (.fa, .fas, or .fasta) or fastaq file. The readDNAStringSet() function and its family: readBStringSet() , readRNAStringSet() , readAAStringSet() load sequences from an input file (or multiple input files) into an XStringSet object [ 4 , 17 ]. Results are stored as objects provided by the Biostrings package. Therefore, multiple sequence alignment process is inherited from the Biostrings package [ 4 ].

The msaConvert() command enables the conversion of multiple sequence alignment objects to formats used in other analysis packages. It can convert to 6 different formats; "seqinr::alignment" , "bios2mds::align" , "ape::AAbin", "ape::DNAbin" , "phangorn::phyDat" , and "bio3d::fasta" [ 4 ].

bios2mds: From biological sequences to multidimensional scaling.

This package (version 1.2.3) realizes the analysis of biological sequences by metric Multidimensional Scaling (MDS). It has a variety of functions such as reading multiple sequence alignments, exporting aligned objects, calculating distance matrices, performing MDS analysis, and visualizing results [ 8 ].

adegenet: Exploratory analysis of genetic and genomic data.

adegenet package (version 2.1.3) is a toolset to explore genetic and genomic data. It generates class "genind" for hierarchical population structure, class "genpop" for alleles counts by populations, and class "genlight" for genome-wide SNP data [ 5 ].

The fasta2DNAbin() command reads FASTA format files and outputs a DNAbin object containing either the full alignments or only SNPs. At the same time, this command processes the massive datasets with its memory-efficiency [ 5 ].

ape: Analyses of phylogenetics and evolution.

The ape (version 5.3) is a package that can read, write, plot and manipulate phylogenetics data. Moreover, it has many functions such as comparing these data in a phylogenetic framework, character analysis of ancestors, reading, and writing nucleotide sequences [ 9 ]. The dist.dna() function calculates distances from DNA sequences by computing a matrix of pairwise distances from DNA sequences [ 9 ]. Also, the nj() function performs the neighbor-joining tree estimation of Saitou and Nei [ 18 ]. The boot.phylo() function performs the bootstrap analysis, and it can be used with the ggtree() command for visualization of the phylogenetic tree with bootstrap values. Additionally, it can extract data from Bioconductor and work together with adegenet and pegas packages [ 2 , 7 ].

ggtree: An R package for visualization of tree and annotation data.

The ggtree , R/Bioconducter package (version 2.0.4), is created for visualization of phylogenetic analysis such as annotation of the phylogenetic tree and other phylogenetic relationship structures. The ggtree() command is visualizing the phylogenetic tree as a tree object or as a phylo object by the as.phylo() command [ 11 ].

ggplot2: Create elegant data visualizations using the grammar of graphics.

The ggplot2 package (version 3.3.1) that is the extension of ggtree is a package that declaratively generates graphics based on Graphics Grammar [ 11 , 19 ]. It explains how to match variables with aesthetics and which graphical principles to use. It provides a better plotting of the graphics obtained with the ggtree package with a set of layers such as geom_tiplab() or geom_treescale() [ 11 , 20 ].

stats-package: The R stats package.

This package (version 3.6.3), which includes commands for statistical calculations and random number generation, provides methods for hierarchical cluster analysis based on a set of dissimilarities. The dist() command calculates the distances between the lines of a data matrix. It can use six different distance measures which are "euclidean" , "maximum" , "manhattan" , "canberra" , "binary" or "minkowski" . The heatmap() function it provides creates a heat map using the distance [ 10 ].

haplotypes: Manipulating DNA sequences and estimating unambiguous haplotype network with statistical parsimony.

The haplotypes package (version 1.1.2) reads and manipulates aligned DNA sequences, supports indel coding methods, shows base substitutions and indels, calculates absolute pairwise distances between DNA sequences. It provides or infers haplotypes by using identical DNA sequences or absolute pairwise character difference matrix. Furthermore, this package gives genealogical relationships among haplotypes using estimation of statistical parsimony and plots its networks [ 3 ].

pegas: Population and evolutionary genetics analysis system.

The pegas package (version 0.13) provides commands for reading, writing from different DNA sequences files including from VCF files. It generates plots, analyzing and manipulating allelic and haplotypic data. It requires packages ape and adegenet , making an integrated environment for population genetic data analysis. Additionally, it realizes the analysis of basic statistics, linkage disequilibrium, Fst and Amova, HWE, haplotype networks, minimum spanning tree and network, and median-joining networks [ 2 ].

Material and method

As an example, we used mitochondrial cyt b sequences (373 bp) dataset from 120 Bombus terrestris dalmatinus belonging to 8 different populations (Aksu = 15, Bayatbadem = 15, Demre = 15, Phaselis = 15, Geyikbayir = 15, Kumluca = 15, Termessos = 15, Firm = 15). Populations were grouped according to the regions from where they were collected; the Aksu, Demre and Kumluca populations belong to greenhouse regions, while the Bayatbadem, Phaselis, Geyikbayir and Termessos populations belong to nature areas and the commercial population was obtained from a firm which is located in Antalya. We want to show how to obtain multiple sequence alignments, haplotype networks, heat map and phylogenetic trees from a FASTA format input file using R (4.0.3. version) [ 10 ]. For all these analyses and graphics, we benefited from both R packages and short R commands.

Preparing the dataset

The file with the .fas extension obtained from the sequencing process was used as the input file. The sample names in the data were tagged with their population name and sample number. Names and numbers were separated by underscores or spaces, for example, “Kumluca_6” or “Bayatbadem 24”. This naming method allows extracting unique names as population names from sample names with the help of a short command. Thus, the name of the population in all the analyses do not need to be entered again.

Multiple sequence alignment and plotting aligned FASTA format file

The readDNAStringSet() command supported in Biostrings package (version 2.54.0) was used to read FASTA format file [ 17 ]. With msa() function implemented in msa package, all samples were aligned to the same length by ClustalW algorithm and stored as DNAStringSet object [ 4 , 17 ]. The as.DNAbin() function provided by ape package (version 5.3) was used to store multiple sequence alignments as a DNAbin object [ 7 ]. In this stage, the trim.Ends() function implemented in ips package (version 0.0.11) can be used for trimming the sequences [ 21 ]. The msaplot() command provided by ggtree package and ggplot2 package was used to demonstrate the aligned sequences with the phylogenetic tree [ 11 ]. Geometric layers ( geom_tiplab() , scale_color_continuous() , geom_tiplab() , geom_treescale() ) belonging to ggplot2 package were used fordetailing the tree [ 11 ]. To construct the phylogenetic tree, the dist.dna() function implemented in ape package was used [ 7 ]. The pairwise distance of the DNA sequences was computed with K80 model derived by Kimura [ 22 ]. The phylogenetic tree was estimated using the nj() function implemented in ape package [ 7 ]. The branch lengths of the tree have been colored to represent the genetic distance. As stated in the commands below, "lightskyblue1" was used for the longest branch of the tree and "coral4" was used for the shortest branch. Each of the nucleotides was represented by a different color. A, C, G, and T nucleotides have been colored with "rosybrown" , "sienna1" , "lightgoldenrod1" , and "lightskyblue1" , respectively, as stated in the commands below.

ggt <- ggtree (tree, cex = 0.8, aes (color = branch.length)) +

  scale_color_continuous (high = ‘lightskyblue1’,low = ‘coral4’) +

  geom_tiplab (align = TRUE, size = 2) +

  geom_treescale (y = -5, color = "coral4", fontsize = 4)

msaplot (ggt, nbin, offset = 0.009, width = 1, height = 0.5,

  color = c ( rep ("rosybrown", 1), rep ("sienna1", 1),

   rep ("lightgoldenrod1", 1), rep ("lightskyblue1", 1)))

Extraction of haplotypes

We wrote dynamic short R commands to find out information about haplotypes and sequence variations. Firstly, we converted the DNAbin object to the DNA matrix (120x373) using the as.matrix() command provided by R base package [ 10 ]. Secondly, by comparing the sequences, we extracted the haplotype number, haplotype frequency and variable regions. Thirdly, we identified unique haplotype sequences by ignoring common nucleotides between haplotypes and by printing variable regions.

The number of haplotypes per population was calculated using haplotypes package and short R commands [ 3 ]. Firstly, DNAbin object was converted to an object of class "DNA" using the as.dna() command which is provided by haplotypes package. Then haplotypes were extracted and grouped using the haplotype() and grouping() commands, respectively [ 3 ]. Finally, the population frequency matrix was extracted.

Haplotype distance matrix and heat map

Distance between the haplotypes was calculated by using dist.hamming() function from pegas package [ 2 ]. The Hamming distance method is a calculation of the pairwise distance matrix for the corresponding symbols between two strings of equal length [ 23 ]. Our data set consisted of haplotype sequences with 41 base pair long strings. We first separated each string into nucleotide arrays with strsplit() function, and formed a (20x41) haplotype sequences matrix, and then called dist.hamming() function for computing Hamming distance matrix.

For the construction of a heat map, we extracted the symmetric distance matrix (20x20) from the haplotype sequences matrix (20x41) using simple R commands. For this calculation, we compared the haplotype sequences in pairs, counting the nucleotide differences between them and writing them on a symmetric matrix. Then, we used this matrix for the visualization of heat map with the heatmap() command provided by stats package [ 10 ].

Haplotype network

The haplotype() and haploNet() functions implemented in pegas package were used for the construction of the haplotype network [ 2 ]. In this section, we wanted to show that data in R can be modified quickly and easily, creating multiple options for analysis. For this reason, we have shown three different haplotype graphs that were represented with different colors as hierarchical using the same data set. Thus, we have created options for those working both in individual datasets and those working with larger populations or groups. While the first haplotype network was represented by individuals, the second haplotype network was represented by populations and the third haplotype network was represented by groups. All haplotype networks were also plotted in different colors.

The haplotype network represented by individuals has been colored using rainbow colors defined as the default and the names and colors of the samples were described using fill argument in the legend() command, as below.

plot (net, size = attr (net, "freq"), scale.ratio = 2, cex = 0.6,

 labels = TRUE, pie = ind.hap, show.mutation = 1, font = 2,

 fast = TRUE)

legend (x = 57,y = 15, colnames (ind.hap), fill = rainbow ( ncol (ind.hap)),

  cex = 0.52, ncol = 6, x.intersp = 0.2, text.width = 11)

We chose special colors for the haplotype network represented by the populations. For the haplotype network, the desired colors were defined as a list in bg argument in the plot() command, as below.

bg <- c ( rep ("dodgerblue4", 15), rep ("olivedrab4",15),

   rep ("royalblue2", 15), rep ("red",15), rep ("olivedrab3",15),

   rep ("skyblue1", 15), rep ("olivedrab1", 15),

   rep ("darkseagreen1", 15))

plot (net, size = attr (net, "freq"), bg = bg, scale.ratio = 2, cex = 0.7,

 labels = TRUE, pie = ind.hap,show.mutation = 1, font = 2, fast = TRUE)

The names and colors of samples were described as a list in fill argument in the legend() command, as below.

hapcol <- c ("Aksu", "Demre", "Kumluca", "Firm", "Bayatbadem",

   "Geyikbayir", "Phaselis", "Termessos")

ubg < - c ( rep ("dodgerblue4",1), rep ("royalblue2",1),

   rep ("skyblue1",1),

   rep ("red",1), rep ("olivedrab4",1), rep ("olivedrab3",1),

   rep ("olivedrab1",1), rep ("darkseagreen1",1))

legend (x = -35,y = 45, hapcol, fill = ubg, cex = 0.8, ncol = 1, bty = "n",

 x.intersp = 0.2)

For the construction of the haplotype network represented by groups, each individual has been renamed with the name of the group to which it belongs. The sample names in the DNAbin object were replaced with the group names to which they belong with a few simple commands, and the haplotype network represented by the groups was plotted. The desired color set for the network diagram was defined in a list for the gbg argument in the plot() command, as below.

gbg <- c ( rep ("red"), rep ("blue"), rep ("green"))

plot (netg, size = attr (netg, "freq"), bg = gbg, scale.ratio = 2, cex = 0.7,

 labels = TRUE, pie = ind.hapg, show.mutation = 1, font = 2, fast = TRUE)

Colors of the groups were defined as a list in fill argument in the legend() command, as below.

legend (x = -35,y = 45, colnames (ind.hapg), fill = c ("red","blue","green"),

 cex = 0.8, ncol = 1, bty = "n", x.intersp = 0.2)

Phylogenetic trees

We demonstrated the circular phylogenetic tree by using ggtree , ggplot2 , ape , and stats packages [ 7 , 10 , 11 ]. To construct the phylogenetic tree, the dist.dna() and nj() commands were used supported by stats package. We have shown two circular phylogenetic trees. In the first tree, populations have been colored using the aes(color = Populations) command inherited from ggtree() and were drawn using ggplot2 package, as below.

emos <- ggtree (tree, layout = ‘circular’,

  branch.length = ‘branch.length’, lwd = 0.5) +

  xlim (-0.1, NA)

  groupOTU (emos, krp, ‘Populations’) +

  aes (color = Populations) +

  theme (legend.position = “right”) +

  geom_tiplab ( names (nbin), cex = 1.7, offset = 0.002) +

  guides (color = guide_legend (override.aes = list (size = 2.5))) +

  geom_treescale (x = -0.1,color = “coral4”, fontsize = 3, offset = 9)

In the second tree, the phylogenetic tree was colored according to branch lengths representing genetic distance. The aes(color = branch.length) command was used for coloring branches. Colors were defined using the scale_color_continuous() command. As stated in the commands below, "lightskyblue1" color was used for the longest branch and "coral4" color was used for the shortest branch.

ggtree (tree,layout = ‘circular’, branch.length = ‘branch.length’,

aes (color = branch.length), lwd = 0.5) +

xlim (-0.1, NA) +

geom_tiplab ( names (nbin), size = 1.7, offset = 0.002) +

scale_color_continuous (high = ‘lightskyblue1’,low = ‘coral4’) +

geom_treescale (x = -0.1, color = “coral4”, fontsize = 3, offset = 9)

On the other hand, we have constructed the phylogenetic relationship between haplotypes by using haplotype sequences. In this stage, treeio package (version 1.10.0) [ 24 ] was used with ggtree package. We calculated the genetic distance with the dist.hamming() function supported by pegas package [ 2 ]. The nj() function was used for neighbor-joining tree estimation. The confidence level between the branches was calculated using 100 bootstrap replicates by the boot.phylo() function implemented in the ape package [ 20 ]. The confidence interval was defined according to Kress et al. [ 25 ] criteria, as strong for 85% and above, moderate for 70–85%, weak for 50–70%, and poor for 50% and below. We colored node points using the scale_fill_manual() command inherited from the ggtree() command. As stated below, "black" , "red" , "pink1" , and "white" colors were selected according to the suggested four confidence intervals, respectively.

D <- dist.hamming (mat7) #pegas package

htre<- nj (D)

bp <- boot.phylo (htre, mat7, B = 100, function (x) nj ( dist.hamming (x)))

bp2 <- data.frame (node = 1: Nnode (htre) + Ntip (htre), bootstrap = bp)

htree <- full_join (htre, bp2, by = “node”)

boothap <- ggtree (htree, size = 1, branch.length = ‘branch.length’) +

  geom_tiplab (size = 4) +

  geom_nodepoint ( aes (fill = cut (bootstrap, c (0,50,70,85,100)),

  shape = 21, size = 4) +

  theme_tree (legend.position = c (0.85, 0.2)) +

  scale_fill_manual (values = c ("black","red","pink1","white",

   guide = ‘legend’,

   name = ‘Bootstrap Percentage (BP)’,

   breaks = c (‘(85,100]’, ‘(70,85]’,

    ‘(50,70]’, ‘(0,50]’),

   labels = expression (BP > = 85, 70 < = BP * “<85”,

     50 < = BP * “<70”, BP < 50))

The mitochondrial cty b with length 373 base pairs belonging to 120 Bombus terrestris dalmatinus was aligned with the ClustalW method using the msa package [ 4 ]. The aligned sequences were visualized by matching them with the phylogenetic tree by using the ggtree() and msaplot() commands [ 11 ]. In Fig 1 , the branch lengths of the tree have been colored to represent the genetic distance. Each of the nucleotides was represented by a different color. Thus, color changes on the plot have revealed nucleotide differences between the samples.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

The phylogenetic tree belonging to the mitochondrial cty b gene of 120 Bombus terrestris dalmatinus and nucleotide differences have been schematized. Colors both on the phylogenetic tree and on the branch scale represent genetic distance. mtDNA cyt b sequences have been colored by four different colors shown in the ‘seq’ column for each nucleotide. Color changes on the aligned sequences represent nucleotide differences.

https://doi.org/10.1371/journal.pone.0243927.g001

The distance scale has shown 0.6% genetic variation per nucleotide substitution. Each row of the multiple sequence alignments corresponding to the names at the end of the tree represents a complete sequence.

Haplotype sequences were obtained from the DNA matrix object by comparing each row with rows and each column with columns. As shown in Table 1 , when nucleotides were compared, the same nucleotides between haplotypes were expressed with dots while different nucleotides were directly written.

thumbnail

https://doi.org/10.1371/journal.pone.0243927.t001

The number of haplotypes per population was extracted using haplotypes package and short R commands. The most common haplotype was H11 that was submitted on NCBI (GenBank: MH221884.1), while the H16 and H14 haplotypes were observed in only one sample. While the Firm population has seven different haplotypes, only two haplotypes were detected in the Demre, Kumluca, and Termessos populations ( Table 2 ).

thumbnail

https://doi.org/10.1371/journal.pone.0243927.t002

The haplotype distance matrix was computed from the haplotype sequence matrix (20x41) using the dist.hamming() function. The distance matrix shows the difference in the number of nucleotides between the two haplotypes. In Table 3 , the smallest difference between the haplotype pairs was one nucleotide between the H9-H11, H10-H11, H12-H15 and H15-H16 haplotypes. The largest difference was 35 nucleotides between H5-H14 haplotypes.

thumbnail

https://doi.org/10.1371/journal.pone.0243927.t003

A heat map was constructed with phylogenetic trees from the haplotype sequences matrix using the heatmap() function [ 10 ] ( Fig 2 ). The haplotype distance matrix was extracted by using our own code from the haplotype sequence matrix (20x41). The matrix obtained was a symmetric version of Hamming distance matrix which is used to construct the heat map. The Heat Map is fully compatible with the haplotype distance matrix given in Table 3 .

thumbnail

Each branch of the phylogenetic tree represents the corresponding haplotype in the matrix. We defined the close relationships with "darkred" color and far relationships with "white" color.

https://doi.org/10.1371/journal.pone.0243927.g002

Haplotype networks were constructed using pegas package [ 2 ]. Circle sizes were provided to the plot() command using size = attr(x, "freq") argument; in this case based on the number of individuals each hapolotype has [ 10 ]. Each link between the haplotypes is a distance that showed the number of nucleotides between the two haplotypes. Every line on the links is a representation of one nucleotide. We demonstrated three haplotype networks represented by individuals ( Fig 3 ), populations ( Fig 4a ) and groups ( Fig 4b ).

thumbnail

Each slice in the circles represents an individual. Branch lengths between the circles have been represented by genetic distance between haplotypes. Every dash on the lines is a representation of one nucleotide. The rainbow() command was used for coloring the individuals. Haplotype network was demonstrated using the plot() command and names of individuals were added with the legend() command.

https://doi.org/10.1371/journal.pone.0243927.g003

thumbnail

(a) Populations have been colored in blue, red, and green representing the groups (greenhouse, firm and nature) they belong to. (b) Groups have been colored compatibly with the population haplotype network. Circle sizes were calculated based on the number of individuals they had. Each slice in the circles represents an individual. Every dash on the lines is a representation of one nucleotide. To visualize the two plots at the same time, the plot(new, TRUE) command was used after the first plot command and was continued with the second plot command.

https://doi.org/10.1371/journal.pone.0243927.g004

In Fig 3 , indivudals have been colored rainbow colors as default by the plot() command. In Fig 4a , the populations representing the greenhouse group have been colored in 3 different shades of blue, the Firm population belonging to the firm group was colored in red, and the populations of the nature group have been colored in 4 different shades of green. In Fig 4b , samples belonging to the greenhouse, firm, and nature groups have been colored blue, red and green, respectively. Therefore, each pie chart here represents individuals belonging to that group as a whole.

In Figs 3 and 4a , the largest circle in the center represents H11. Since the circles are schematized in the form of a pie chart according to the number of individuals, the name of the H11 haplotype does not appear in the small-sized images. But it looks quite clear in high resolution and large scale images or in the form of the larger pie chart ( Fig 4b ). 20 haplotypes with 71 links were determined. The closest haplotypes were H9-H10-H11; H12-H15, and H15-H16, while the most distant haplotypes were the H5-H14 haplotypes (Figs 3 and 4 ). Every dash on the lines is a representation of one nucleotide. The circles have been drawn according to their number of samples using " size = attr(net, "freq") " command. Therefore, the dashes (H11-H7; H11-H8, H11- H9, H11-H10) around the largest circle (H11) can’t be seen. If this command is removed and run again, all the mutation difference numbers can be seen as the dashes on the lines.

Phylogenetic trees were constructed using ggtree and ggplot2 packages. Tree estimation was calculated by the neighbor-joining method supported by ape package. While phylogenetic tree in Fig 5 was colored to represent populations, in Fig 6 it was colored to represent genetic distance. The biggest clade in both Figs 5 and 6 consisted of samples from Kumluca (12), Geyikbayir (6), Firm (4), Phaselis (7), Demre (13), Bayatbadem (6), and Aksu (7). Mostly, Aksu and Firm samples created more than one different phylogenetic clades.

thumbnail

The distance scale has shown 0.6% genetic variation per nucleotide substitution.

https://doi.org/10.1371/journal.pone.0243927.g005

thumbnail

https://doi.org/10.1371/journal.pone.0243927.g006

We demonstrated the phylogenetic relationship between haplotypes using the bootstrap method ( Fig 7 ). The distance was estimated by the Hamming distance method of nucleotide differences between the two sequences [ 2 ]. The confidence interval was defined as strong for 85% and above, moderate for 70–85%, weak for 50–70%, and poor for 50% and below [ 25 ]. The bootstrap values were specified by coloring according to these confidence intervals.

thumbnail

Colored internal nodes represent the bootstrap confidence level.

https://doi.org/10.1371/journal.pone.0243927.g007

There are many software packages available today that compute and visualize population genetic statistics and phylogenetics. Some of the most used software packages are MEGA, DnaSP, splitsTree, TASSEL and Arlequin [ 1 , 2 ]. With the tremendous advancements in technology in recent years, large amounts of molecular genetic data have been obtained. To process and analyze these data, there is a need for powerful computer software and hardware. Almost every software environment has its own specific input and output file format. While some analyses can be completed in one program, one software or in one workflow, the majority of them can only be completed in combined software packages. These multiple workflows create an imperative to convert inter-program input/output format changes, which is sometimes difficult and mostly time-consuming. Here we show phylogenetic relationships and statistical findings in a single workflow using R, on a sample mtDNA dataset. By sharing all the commands we use, we aim to present a ready-made format for researchers working in this field. Thus, by using only one FASTA format file as input, we are able to output multiple sequence alignments, haplotype sequences, heat map, haplotype networks, and phylogenetic trees. We have demonstrated the use and plotting of R in phylogenetic analyses, using both packages and R codes, taking advantage of R’s free language and free license. We tried to make some commands as modifiable as possible. One of these was the creation of a haplotype network [ 2 ]. We shared three haplotype networks representing individuals, populations, and groups. Some arguments such as coloring or plot scaling can be modified in accordance with other data sets. We demonstrated three different colored phylogenetic trees representing the populations, DNA distance and haplotypes bootstrap tree [ 11 , 20 ]. Likewise, by changing the commands we use for phylogenetic trees, arguments such as colors, tree type, and graphic scaling can be modified. We wanted to show that R can be used for researchers who study mtDNA and are interested in phylogenetics. While input-output files are frequently needed in phylogenetic analysis software, we showed that 8 different outputs can be obtained from a single .fas extension file and shared all the packages, libraries and commands we used in these analyses. At the same time, we recommend using RStudio for visualization studies, with features such as easier editing of the code with the source pane, easier manipulation of visuals with the plot pane, and to be a reminder for arguments or definitions. Consequently, the haplotypes sequences, heat map, haplotype networks, and phylogenetic trees gave results that are completely compatible with each other. We believe that R and RStudio will be increasingly used in phylogenetic analyses and visualization due to the fact that it is an open source and always up-to-date environment, free for all as well as open for researcher contributions such as this one.

Supporting information

S1 appendix. r codes..

It is for phylogenetic analyses by using FASTA format file.

https://doi.org/10.1371/journal.pone.0243927.s001

S2 Appendix. 120 mitochondrial cyt b sequences of B . terrestris dalmatinus .

It is an example dataset as FASTA format sequences file.

https://doi.org/10.1371/journal.pone.0243927.s002

Acknowledgments

The data used in this project were provided by the project numbered FYL-2016-1502, supported by the Scientific Research Projects Coordination Unit of Akdeniz University. We would like to thank Philippa Price for proofreading this article.

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 3. Aktas C. haplotypes: Manipulating DNA Sequences and Estimating Unambiguous Haplotype Network with Statistical Parsimony. [Internet]. 2020. https://cran.r-project.org/web/packages/haplotypes/index.html
  • 5. Jombart T, Kamvar ZN, Collins, C, Lustrik, R, Beugin, M, Knaus BJ, et al. adegenet: Exploratory Analysis of Genetic and Genomic Data. The Comprehensive R Archive Network. 2020 [cited 2020 Apr 10]. https://cran.r-project.org/web/packages/adegenet/adegenet.pdf
  • 10. R Core Team. 2019. R: A language and environment for statistical computing [Internet]. Vienna, Austria: R Foundation for Statistical Computing; 2019. http://www.R-project.org
  • 12. Morandat F, Hill B, Osvald L, Vitek J. Evaluating the design of the R language. In: Noble J, editor. Objects and Functions for Data Analysis. Berlin: Springer; 2012. pp. 104–131.
  • 14. Xie Y, Allaire JJ, Grolemund G. R markdown: The definitive guide [Internet]. 2018. Boca Raton, Florida: Chapman; Hall/CRC; 2018. https://bookdown.org/yihui/rmarkdown
  • 15. Xie Y. knitr: A General-Purpose Package for Dynamic Report Generation in R [Internet]. 2020. https://cran.r-project.org/web/packages/knitr/index.html
  • 17. Pagès H, Gentleman R, Aboyoun P, DebRoy S. Biostrings: String objects representing biological sequences, and matching algorithms [Internet]. 2020. https://bioconductor.org/packages/release/bioc/html/Biostrings.html
  • 21. Heibl C. ips: Interfaces to phylogenetic software in R [Internet]. 2019. https://cran.r-project.org/web/packages/ips/index.html

Phylogenetic Analysis

  • First Online: 25 September 2021

Cite this chapter

Book cover

  • Manoj Kumar Gupta 3 ,
  • Gayatri Gouda 3 ,
  • S. Sabarinathan 3 , 4 ,
  • Ravindra Donde 3 ,
  • N. Rajesh 5 ,
  • Pallabi Pati 6 ,
  • Sushil Kumar Rathore 7 ,
  • Lambodar Behera 3 &
  • Ramakrishna Vadde 5  

951 Accesses

In this chapter, the authors attempt to understand the underlying phylogeny principle and how researchers implement diverse methods to discover the appropriate phylogeny. Results obtained revealed that phylogenetic trees reflect evolutionary past as a canonical framework. Phylogenetic tree building step essentially comprises of five steps: (a) selecting molecular markers; (b) multiple sequence alignment; (c) determining the best evolutionary model; (d) determination of tree building method; and (e) assessment of tree reliability. Phylogenetic trees have various functional uses in different biological fields, such as conservation biology, epidemiology, forensics, cancer evolution, HIV transmission, gene expression prediction, protein structure prediction, and drug design. However, researchers face different challenges for generating a more accurate tree, like memory efficiency and implementation and optimization of the likelihood function. The authors believe, in the near future, the development of exciting new algorithms, which dramatically reduce the necessary amount of likeliness assessment, combined with enhanced knowledge of previously described high-performance machine problems in the group, is likely to detect more accurate phylogenetic tree that include 10,000–20,000 sequences. Additionally, it will also permit the tree inferences on medium-sized PC.

  • Molecular markers
  • Bootstrapping
  • Character-based
  • Distance-based
  • Jackknifing

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

Bayesian inference

Chloroplast DNA

Non-synonymous

Genotyping-by-sequencing

Hypothetical taxonomic units

Internal transcribed spacer

Jukes and Cantor

Last common ancestor

Last universal common ancestor

Maximum-like

Multiple sequence alignment

Operational taxonomic units

Polymerase chain reaction

Ultra-conserved elements

Yang Z, Rannala B. Molecular phylogenetics: principles and practice. Nat Rev Genet. 2012 May;13(5):303–14.

Article   CAS   PubMed   Google Scholar  

Woese CR, Fox GE. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. PNAS. 1977 Nov 1;74(11):5088–90.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Crick FH. On protein synthesis. Symp Soc Exp Biol. 1958;12:138–63.

CAS   PubMed   Google Scholar  

Zuckerkandl E, Pauling L. Molecules as documents of evolutionary history. J Theor Biol. 1965 Mar 1;8(2):357–66.

Sanger F. Chemistry of insulin: determination of the structure of insulin opens the way to greater understanding of life processes. Science. 1959 May 15;129(3359):1340–4.

Doolittle RF, Feng DF. Reconstructing the evolution of vertebrate blood coagulation from a consideration of the amino acid sequences of clotting proteins. Cold Spring Harb Symp Quant Biol. 1987 Jan 1;52:869–74.

Fitch WM, Margoliash E. Construction of phylogenetic trees. Science. 1967 Jan 20;155(3760):279–84.

Edwards SV. Is a new and general theory of molecular systematics emerging? Evolution. 2009 Jan;63(1):1–19.

Mäser P, Thomine S, Schroeder JI, Ward JM, Hirschi K, Sze H, et al. Phylogenetic relationships within cation transporter families of Arabidopsis. Plant Physiol. 2001 Aug 1;126(4):1646–67.

Article   PubMed   PubMed Central   Google Scholar  

Marra MA, Jones SJM, Astell CR, Holt RA, Brooks-Wilson A, Butterfield YSN, et al. The genome sequence of the SARS-associated coronavirus. Science. 2003 May 30;300(5624):1399–404.

Gray RD, Drummond AJ, Greenhill SJ. Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science. 2009 Jan 23;323(5913):479–83.

Salipante SJ, Horwitz MS. Phylogenetic fate mapping. Proc Natl Acad Sci U S A. 2006 Apr 4;103(14):5448–53.

Baumann J. Use of homeoplastic auditory ossicles for chain defects within the scope of tympanoplasty. Z Laryngol Rhinol Otol. 1971 Feb;50(2):95–102.

Kuzuya T, Kimura Y, Hoshida S, Kodama K, Nakamura N, Hamanaka Y, et al. The effect of CV-4151, a selective inhibitor of thromboxane synthetase, on prostanoid formation and platelet aggregation in humans. Cardiovasc Drugs Ther. 1988 Dec;2(5):693–700.

Gronau I, Hubisz MJ, Gulko B, Danko CG, Siepel A. Bayesian inference of ancient human demography from individual genome sequences. Nat Genet. 2011 Sep 18;43(10):1031–4.

Paten B, Herrero J, Fitzgerald S, Beal K, Flicek P, Holmes I, et al. Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res. 2008 Nov;18(11):1829–43.

Roy SS, Dasgupta R, Bagchi A. A review on phylogenetic analysis: a journey through modern era. Comput Mol Biosci. 2014 Sep 30;4(3):39–45.

Article   Google Scholar  

Scott AD, Baum DA. Phylogenetic tree. In: Kliman RM, editor. Encyclopedia of evolutionary biology [Internet]. Oxford: Academic Press; 2016. p. 270–6. [cited 2020 Oct 21]. Available from: http://www.sciencedirect.com/science/article/pii/B9780128000496002031 .

Chapter   Google Scholar  

Choudhuri S. Chapter 9 - Phylogenetic analysis**The opinions expressed in this chapter are the author’s own and they do not necessarily reflect the opinions of the FDA, the DHHS, or the Federal Government. In: Choudhuri S, editor. Bioinformatics for beginners [Internet]. Oxford: Academic Press; 2014. p. 209–18. [cited 2018 Nov 6]. Available from: http://www.sciencedirect.com/science/article/pii/B9780124104716000098 .

Ding G, Yu Z, Zhao J, Wang Z, Li Y, Xing X, et al. Tree of life based on genome context networks. PLoS One. 2008 Oct 9;3(10):e3357.

Article   PubMed   PubMed Central   CAS   Google Scholar  

Xiong J. Essential bioinformatics. Cambridge: Cambridge University Press; 2006. 360 p.

Book   Google Scholar  

Munjal G, Hanmandlu M, Srivastava S. Phylogenetics algorithms and applications. Ambient Communications and Computer Systems. 2018 Dec 10;904:187–94.

Article   PubMed Central   Google Scholar  

Kapli P, Yang Z, Telford MJ. Phylogenetic tree building in the genomic age. Nat Rev Genet. 2020 Jul;21(7):428–44.

El-Kebir M, Oesper L, Acheson-Field H, Raphael BJ. Reconstruction of clonal trees and tumor composition from multi-sample sequencing data. Bioinformatics. 2015 Jun 15;31(12):i62–70.

Paradis E, Claude J, Strimmer K. APE: analyses of Phylogenetics and evolution in R language. Bioinformatics. 2004 Jan 22;20(2):289–90.

Lord E, Leclercq M, Boc A, Diallo AB, Makarenkov V. Armadillo 1.1: an original workflow platform for designing and conducting phylogenetic analysis and simulations. PLoS One. 2012;7(1):e29903.

Suchard MA, Redelings BD. BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics. 2006 Aug 15;22(16):2047–8.

Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian Phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol. 2012 Aug;29(8):1969–73.

Jiang Y, Qiu Y, Minn AJ, Zhang NR. Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing. Proc Natl Acad Sci U S A. 2016 Sep 13;113(37):E5528–37.

Huson DH, Scornavacca C. Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. Syst Biol. 2012 Dec 1;61(6):1061–7.

Article   PubMed   Google Scholar  

Jeon Y-S, Lee K, Park S-C, Kim B-S, Cho Y-J, Ha S-M, et al. EzEditor: a versatile sequence alignment editor for both rRNA- and protein-coding genes. Int J Syst Evol Microbiol. 2014 Feb;64(Pt 2):689–91.

Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One. 2010 Mar 10;5(3):e9490.

Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 2015 Jan;32(1):268–74.

Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007 Aug;24(8):1586–91.

Hellmuth M, Wieseke N, Lechner M, Lenhof H-P, Middendorf M, Stadler PF. Phylogenomics with paralogs. Proc Natl Acad Sci U S A. 2015 Feb 17;112(7):2058–63.

Lanfear R, Frandsen PB, Wright AM, Senfeld T, Calcott B. PartitionFinder 2: new methods for selecting partitioned models of evolution for molecular and morphological phylogenetic analyses. Mol Biol Evol. 2017 Mar 1;34(3):772–3.

Thomas GH, Hartmann K, Jetz W, Joy JB, Mimoto A, Mooers AO. PASTIS: an R package to facilitate phylogenetic assembly with soft taxonomic inferences. Methods Ecol Evol. 2013;4(11):1011–7.

Schliep KP. Phangorn: phylogenetic analysis in R. Bioinformatics. 2011 Feb 15;27(4):592–3.

Deshwar AG, Vembu S, Yung CK, Jang GH, Stein L, Morris Q. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 2015 Feb 13;16(1):35.

Brown JW, Walker JF, Smith SA. Phyx: phylogenetic tools for unix. Bioinformatics. 2017 Jun 15;33(12):1886–8.

Wheeler WC, Lucaroni N, Hong L, Crowley LM, Varón A. POY version 5: phylogenetic analysis using dynamic homologies under multiple optimality criteria. Cladistics. 2015;31(2):189–96.

Darriba D, Taboada GL, Doallo R, Posada D. ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics. 2011 Apr 15;27(8):1164–5.

Knight R, Maxwell P, Birmingham A, Carnes J, Caporaso JG, Easton BC, et al. PyCogent: a toolkit for making sense from sequence. Genome Biol. 2007 Aug 21;8(8):R171.

Stamatakis A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006 Nov 1;22(21):2688–90.

Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. 2019 Nov 1;35(21):4453–5.

Sun Z, Zhu Q, Xiong Y, Sun Y, Mou L, Zhang L. TreeGen: A Tree-Based Transformer Architecture for Code Generation. arXiv:191109983 [cs] [Internet]. 2019 Nov 28. [cited 2020 Dec 14]; Available from: http://arxiv.org/abs/1911.09983 .

Boc A, Diallo AB, Makarenkov V. T-REX: a web server for inferring, validating and visualizing phylogenetic trees and networks. Nucleic Acids Res. 2012 Jul;40(Web Server issue):W573–9.

Dong W, Liu J, Yu J, Wang L, Zhou S. Highly Variable Chloroplast Markers for Evaluating Plant Phylogeny at Low Taxonomic Levels and for DNA Barcoding. PLoS One [Internet]. 2012 Apr 12;7(4). [cited 2020 Oct 22]; Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3325284/ .

Gielly L, Taberlet P. The use of chloroplast DNA to resolve plant phylogenies: noncoding versus rbcL sequences. Mol Biol Evol. 1994 Sep;11(5):769–77.

Zhu W-D, Nie Z-L, Wen J, Sun H. Molecular phylogeny and biogeography of Astilbe (Saxifragaceae) in Asia and eastern North America. Bot J Linn Soc. 2013 Feb 1;171(2):377–94.

Akhani H, Malekmohammadi M, Mahdavi P, Gharibiyan A, Chase MW. Phylogenetics of the Irano-Turanian taxa of Limonium (Plumbaginaceae) based on ITS nrDNA sequences and leaf anatomy provides evidence for species delimitation and relationships of lineages. Bot J Linn Soc. 2013 Mar 1;171(3):519–50.

Townsend TM, Alegre RE, Kelley ST, Wiens JJ, Reeder TW. Rapid development of multiple nuclear loci for phylogenetic analysis using genomic resources: an example from squamate reptiles. Mol Phylogenet Evol. 2008 Apr;47(1):129–42.

Smith DR. Mutation rates in plastid genomes: they are lower than you might think. Genome Biol Evol. 2015 Apr 13;7(5):1227–34.

Small RL, Cronn RC, Wendel JF. Use of nuclear genes for phylogeny reconstruction in plants. Aust Systematic Bot. 2004;17(2):145–70.

Article   CAS   Google Scholar  

Boekhorst J, Snel B. Identification of homologs in insignificant blast hits by exploiting extrinsic gene properties. BMC Bioinformatics. 2007 Sep 21;8(1):356.

Tekaia F. Inferring Orthologs: open questions and perspectives. Genomics Insights. 2016;9:17–28.

Sang T. Utility of low-copy nuclear gene sequences in plant phylogenetics. Crit Rev Biochem Mol Biol. 2002;37(3):121–47.

Bragg JG, Potter S, Bi K, Moritz C. Exon capture phylogenomics: efficacy across scales of divergence. Mol Ecol Resour. 2016 Sep;16(5):1059–68.

Rubin BER, Ree RH, Moreau CS. Inferring phylogenies from RAD sequence data. PLoS One. 2012;7(4):e33394.

Peñalba JV, Smith LL, Tonione MA, Sass C, Hykin SM, Skipwith PL, et al. Sequence capture using PCR-generated probes: a cost-effective method of targeted high-throughput sequencing for nonmodel organisms. Mol Ecol Resour. 2014 Sep;14(5):1000–10.

PubMed   Google Scholar  

Bi K, Vanderpool D, Singhal S, Linderoth T, Moritz C, Good JM. Transcriptome-based exon capture enables highly cost-effective comparative genomic data collection at moderate evolutionary scales. BMC Genomics. 2012 Aug 17;13:403.

Li C, Ortí G, Zhang G, Lu G. A practical approach to phylogenomics: the phylogeny of ray-finned fish (Actinopterygii) as a case study. BMC Evol Biol. 2007 Mar 20;7:44.

Portik DM, Smith LL, Bi K. An evaluation of transcriptome-based exon capture for frog phylogenomics across multiple scales of divergence (class: Amphibia, order: Anura). Mol Ecol Resour. 2016 Sep;16(5):1069–83.

Lemmon EM, Lemmon AR. High-throughput genomic data in systematics and Phylogenetics. Annu Rev Ecol Evol Syst. 2013 Nov 23;44(1):99–121.

Faircloth BC, McCormack JE, Crawford NG, Harvey MG, Brumfield RT, Glenn TC. Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. Syst Biol. 2012 Oct;61(5):717–26.

Weitemier K, SCK S, Cronn RC, Fishbein M, Schmickl R, McDonnell A, et al. Hyb-Seq: Combining target enrichment and genome skimming for plant phylogenomics. Appl Plant Sci. 2014 Sep;2(9):1400042.

Schmickl R, Liston A, Zeisek V, Oberlander K, Weitemier K, Straub SCK, et al. Phylogenetic marker development for target enrichment from transcriptome and genome skim data: the pipeline and its application in southern African Oxalis (Oxalidaceae). Mol Ecol Resour. 2016 Sep;16(5):1124–35.

Kadlec M, Bellstedt DU, Le Maitre NC, Pirie MD. Targeted NGS for species level phylogenomics: “made to measure” or “one size fits all”? PeerJ. 2017;5:e3569.

Yue F, Shi J, Tang J. Simultaneous phylogeny reconstruction and multiple sequence alignment. BMC Bioinformatics. 2009 Jan 30;10(Suppl 1):S11.

Higgins DG, Sharp PM. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene. 1988 Dec 15;73(1):237–44.

Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994 Nov 11;22(22):4673–80.

Notredame C. Recent progress in multiple sequence alignment: a survey. Pharmacogenomics. 2002 Jan;3(1):131–44.

Notredame C, Higgins DG, Heringa J. T-coffee: a novel method for fast and accurate multiple sequence alignment11Edited by J. Thornton Journal of Molecular Biology. 2000 Sep 8;302(1):205–17.

Ashkenazy H, Sela I, Levy Karin E, Landan G, Pupko T. Multiple sequence alignment averaging improves phylogeny reconstruction. Syst Biol. 2019 Jan 1;68(1):117–30.

Blackshields G, Wallace IM, Larkin M, Higgins DG. Analysis and comparison of benchmarks for multiple sequence alignment. In Silico Biol. 2006;6(4):321–39.

Chang J-M, Di Tommaso P, Notredame C. TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Mol Biol Evol. 2014 Jun;31(6):1625–37.

Collingridge PW, Kelly S. MergeAlign: improving multiple sequence alignment performance by dynamic reconstruction of consensus multiple sequence alignments. BMC Bioinformatics. 2012 May 30;13:117.

Lake JA. The order of sequence alignment can bias the selection of tree topology. Mol Biol Evol. 1991 May;8(3):378–85.

Penn O, Privman E, Landan G, Graur D, Pupko T. An alignment confidence score capturing robustness to guide tree uncertainty. Mol Biol Evol. 2010 Aug;27(8):1759–67.

Lutzoni F, Wagner P, Reeb V, Zoller S. Integrating ambiguously aligned regions of DNA sequences in phylogenetic analyses without violating positional homology. Syst Biol. 2000 Dec;49(4):628–51.

Lücking R, Hodkinson BP, Stamatakis A, Cartwright RA. PICS-Ord: unlimited coding of ambiguous regions by pairwise identity and cost scores ordination. BMC Bioinformatics. 2011 Jan 7;12:10.

Wheeler WC. Sequence alignment, parameter sensitivity, and the phylogenetic analysis of molecular data. Syst Biol. 1995 Sep 1;44(3):321–31.

Privman E, Penn O, Pupko T. Improving the performance of positive selection inference by filtering unreliable alignment regions. Mol Biol Evol. 2012 Jan;29(1):1–5.

Sela I, Ashkenazy H, Katoh K, Pupko T. GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters. Nucleic Acids Res. 2015 Jul 1;43(W1):W7–14.

Arenas M. Trends in substitution models of molecular evolution. Front Genet [Internet]. 2015;6. [cited 2018 Nov 6]; Available from: https://www.frontiersin.org/articles/10.3389/fgene.2015.00319/full#B99 .

Jukes TH, Cantor CR. Chapter 24 - Evolution of protein molecules. In: Munro HN, editor. Mammalian protein metabolism [Internet]. New York: Academic Press; 1969. p. 21–132. [cited 2020 Oct 23]. Available from: http://www.sciencedirect.com/science/article/pii/B9781483232119500097 .

Collins DW, Jukes TH. Rates of transition and Transversion in coding sequences since the human-rodent divergence. Genomics. 1994 Apr 1;20(3):386–96.

Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980 Dec;16(2):111–20.

Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981 Nov 1;17(6):368–76.

Zharkikh A. Estimation of evolutionary distances between nucleotide sequences. J Mol Evol. 1994 Sep 1;39(3):315–29.

Hasegawa M, Kishino H, Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985;22(2):160–74.

Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 1994 Sep 1;39(3):306–14.

Shoemaker JS, Fitch WM. Evidence from nuclear sequences that invariable sites should be considered when sequence divergence is calculated. Mol Biol Evol. 1989 May;6(3):270–89.

Darriba D, Taboada GL, Doallo R, Posada D. jModelTest 2: more models, new heuristics and parallel computing. Nat Methods. 2012 Aug;9(8):772.

Arenas M, Posada D. Simulation of genome-wide evolution under heterogeneous substitution models and complex multispecies coalescent histories. Mol Biol Evol. 2014 May;31(5):1295–301.

Sumner JG, Jarvis PD, Fernández-Sánchez J, Kaine BT, Woodhams MD, Holland BR. Is the general time-reversible model bad for molecular Phylogenetics? Syst Biol. 2012 Dec 1;61(6):1069–74.

Gatto L, Catanzaro D, Milinkovitch MC. Assessing the applicability of the GTR nucleotide substitution model through simulations. Evol Bioinformatics Online. 2007 Feb 4;2:145–55.

Google Scholar  

Jayaswal V, Jermiin LS, Poladian L, Robinson J. Two stationary nonhomogeneous Markov models of nucleotide sequence evolution. Syst Biol. 2011 Jan;60(1):74–86.

Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol. 1994 Sep;11(5):715–24.

Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 1994 Sep;11(5):725–36.

Gupta MK, Vadde R. Genetic basis of adaptation and maladaptation via balancing selection. Zoology. 2019 Jul;10:125693.

Gupta MK, Vadde R. Divergent evolution and purifying selection of the Type 2 diabetes gene sequences in Drosophila: a phylogenomic study. Genetica [Internet]. 2020 Aug 17 . [cited 2020 Aug 29]; https://doi.org/10.1007/s10709-020-00101-7 .

Kosakovsky Pond SL, Frost SDW. Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol Biol Evol. 2005 May 1;22(5):1208–22.

Article   PubMed   CAS   Google Scholar  

Pond SLK, Frost SDW. A genetic algorithm approach to detecting lineage-specific variation in selection pressure. Mol Biol Evol. 2005 Mar;22(3):478–85.

Yang Z, Nielsen R, Goldman N, Pedersen AM. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 2000 May;155(1):431–49.

Wong WSW, Sainudiin R, Nielsen R. Identification of physicochemical selective pressure on protein encoding nucleotide sequences. BMC Bioinformatics. 2006 Mar 16;7:148.

Schneider A, Cannarozzi GM, Gonnet GH. Empirical codon substitution matrix. BMC Bioinformatics. 2005 Jun 1;6(1):134.

Yang Z, Nielsen R. Mutation-selection models of codon substitution and their use to estimate selective strengths on codon usage. Mol Biol Evol. 2008 Mar;25(3):568–79.

Misawa K. A codon substitution model that incorporates the effect of the GC contents, the gene density and the density of CpG islands of human chromosomes. BMC Genomics. 2011 Aug 6;12:397.

Perez-Jimenez R, Inglés-Prieto A, Zhao Z-M, Sanchez-Romero I, Alegre-Cebollada J, Kosuri P, et al. Single-molecule paleoenzymology probes the chemistry of resurrected enzymes. Nat Struct Mol Biol. 2011 May;18(5):592–6.

Alvarez-Ponce D, Fares MA. Evolutionary rate and duplicability in the Arabidopsis thaliana protein–protein interaction network. Genome Biol Evol. 2012 Dec 1;4(12):1263–74.

Fares MA, Barrio E, Sabater-Muñoz B, Moya A. The evolution of the heat-shock protein GroEL from Buchnera, the primary endosymbiont of aphids, is governed by positive selection. Mol Biol Evol. 2002 Jul 1;19(7):1162–70.

Dayhoff MO, Schwartz RM, Orcutt BC. 22 a model of evolutionary change in proteins. In: Atlas of protein sequence and structure. Silver Spring: National Biomedical Research Foundation; 1978. p. 345–52.

Adachi J, Waddell PJ, Martin W, Hasegawa M. Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J Mol Evol. 2000;50(4):348–58.

Kosiol C, Goldman N. Different versions of the Dayhoff rate matrix. Mol Biol Evol. 2005 Feb 1;22(2):193–9.

Liberles DA, Teichmann SA, Bahar I, Bastolla U, Bloom J, Bornberg-Bauer E, et al. The interface of protein structure, protein biophysics, and molecular evolution. Protein Sci. 2012;21(6):769–85.

Halpern AL, Bruno WJ. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol. 1998;15(7):910–7.

Taverna DM, Goldstein RA. The distribution of structures in evolving protein populations. Biopolymers. 2000;53(1):1–8.

Arenas M, Sánchez-Cobos A, Bastolla U. Maximum-likelihood phylogenetic inference with selection on protein folding stability. Mol Biol Evol. 2015 Aug 1;32(8):2195–207.

Pardi F, Gascuel O. Combinatorics of distance-based tree inference. PNAS. 2012 Oct 9;109(41):16443–8.

Felsenstein J, Felenstein J. Inferring phylogenies, vol. 2. Sunderland: Sinauer Associates; 2004.

Yang Z. Computational molecular evolution. Oxford: OUP; 2006. 375 p.

Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4(4):406–25.

Roch S. Toward extracting all phylogenetic information from matrices of evolutionary distances. Science. 2010 Mar 12;327(5971):1376–9.

Steel M. A basic limitation on inferring phylogenies by pairwise sequence comparisons. J Theor Biol. 2009 Feb 7;256(3):467–72.

Huelsenbeck JP. Is the Felsenstein zone a fly trap? Syst Biol. 1997 Mar;46(1):69–74.

Whelan S, Liò P, Goldman N. Molecular phylogenetics: state-of-the-art methods for looking into the past. Trends Genet. 2001 May;17(5):262–72.

Huelsenbeck JP, Larget B, Miller RE, Ronquist F. Potential applications and pitfalls of Bayesian inference of phylogeny. Syst Biol. 2002;51(5):673–88.

Holder M, Lewis PO. Phylogeny estimation: traditional and Bayesian approaches. Nat Rev Genet. 2003 Apr;4(4):275–84.

Challa S, Neelapu NRR. Phylogenetic trees: applications, construction, and assessment. In: Hakeem KR, Shaik NA, Banaganapalli B, Elango R, editors. Essentials of bioinformatics, In silico life sciences: agriculture [Internet], vol. III. Cham: Springer International Publishing; 2019. p. 167–92. https://doi.org/10.1007/978-3-030-19318-8_10 . [cited 2020 Oct 24].

Kishino H, Hasegawa M. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea. J Mol Evol. 1989 Aug 1;29(2):170–9.

Shimodaira H, Hasegawa M. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol. 1999 Aug 1;16(8):1114.

Stamatakis A. Phylogenetics: applications, Software and Challenges. Cancer Genomics Proteomics. 2005 Sep 1;2(5):301–5.

Charalambous M, Trancoso P, Stamatakis A. Initial experiences porting a bioinformatics application to a graphics processor. In: Bozanis P, Houstis EN, editors. Advances in informatics, Lecture notes in computer science. Berlin: Springer; 2005. p. 415–25.

Stamatakis A, Ott M, Ludwig T. RAxML-OMP: an efficient program for phylogenetic inference on SMPs. In: Malyshkin V, editor. Parallel computing technologies. Berlin: Springer; 2005. p. 288–302. (Lecture Notes in Computer Science).

Kosakovsky Pond SL, Muse SV. Column sorting: rapid calculation of the phylogenetic likelihood function. Syst Biol. 2004 Oct;53(5):685–92.

Zmasek CM, Eddy SR. ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics. 2001 Apr 1;17(4):383–4.

Hughes T, Hyun Y, Liberles DA. Visualising very large phylogenetic trees in three dimensional hyperbolic space. BMC Bioinformatics. 2004 Apr 29;5(1):48.

Plaisant C, Grosjean J, Bederson BB. Spacetree: supporting exploration in large node link tree, design evolution and empirical evaluation. In: IEEE Symposium on Information Visualization, 2002 INFOVIS 2002; 2002. p. 57–64.

Arvelakis A, Reczko M, Stamatakis A, Symeonidis A, Tollis IG. Using treemaps to visualize phylogenetic trees. In: Oliveira JL, Maojo V, Martín-Sánchez F, Pereira AS, editors. Biological and medical data analysis. Berlin: Springer; 2005. p. 283–93. (Lecture Notes in Computer Science).

Stolk B, Abdoelrahman F, Koning A, Wielinga P, Neefs J-M, Stubbs A, et al. Mining the human genome using virtual reality. In: Proceedings of the Fourth Eurographics Workshop on Parallel Graphics and Visualization. Goslar, DEU: Eurographics Association; 2002. p. 17–21. (EGPGV ‘02).

Carrizo SF. Phylogenetic trees: an information visualisation perspective. In: Proceedings of the second conference on Asia-Pacific bioinformatics, vol. 29. Darlinghurst: Australian Computer Society, Inc.; 2004. p. 315–20. (APBC ‘04).

Download references

Conflicts of Interest

Additional information.

Figure 9.1b has been used under the terms of the Creative Commons Attribution License [ 20 ].

Author information

Authors and affiliations.

Crop Improvement Division, ICAR-National Rice Research Institute, Cuttack, Odisha, India

Manoj Kumar Gupta, Gayatri Gouda, S. Sabarinathan, Ravindra Donde & Lambodar Behera

Department of Seed Science and Technology, College of Agriculture, Odisha University of Agriculture and Technology, Bhubaneswar, Odisha, India

S. Sabarinathan

Department of Biotechnology and Bioinformatics, Yogi Vemana University, Kadapa, Andhra Pradesh, India

N. Rajesh & Ramakrishna Vadde

District Headquarter Hospital, Ganjam, Odisha, India

Pallabi Pati

Department of Zoology, Khallikote Autonomous College, Ganjam, Odisha, India

Sushil Kumar Rathore

You can also search for this author in PubMed   Google Scholar

Editor information

Editors and affiliations.

Manoj Kumar Gupta

Lambodar Behera

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.

About this chapter

Gupta, M.K. et al. (2021). Phylogenetic Analysis. In: Gupta, M.K., Behera, L. (eds) Bioinformatics in Rice Research. Springer, Singapore. https://doi.org/10.1007/978-981-16-3993-7_9

Download citation

DOI : https://doi.org/10.1007/978-981-16-3993-7_9

Published : 25 September 2021

Publisher Name : Springer, Singapore

Print ISBN : 978-981-16-3992-0

Online ISBN : 978-981-16-3993-7

eBook Packages : Biomedical and Life Sciences Biomedical and Life Sciences (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Phylogenetics Algorithms and Applications

Geetika munjal.

18 The NorthCap University, Gurugram, India

Madasu Hanmandlu

19 IIT Delhi, New Delhi, India

Sangeet Srivastava

Phylogenetics is a powerful approach in finding evolution of current day species. By studying phylogenetic trees, scientists gain a better understanding of how species have evolved while explaining the similarities and differences among species. The phylogenetic study can help in analysing the evolution and the similarities among diseases and viruses, and further help in prescribing their vaccines against them. This paper explores computational solutions for building phylogeny of species along with highlighting benefits of alignment-free methods of phylogenetics. The paper has also discussed the application of phylogenetic study in disease diagnosis and evolution.

Introduction

Phylogenetics can be considered as one of the best tools for understanding the spread of contagious disease, for example, transmission of the human immunodeficiency virus (HIV) and the origin and subsequent evolution of the severe acute respiratory syndrome (SARS) associated coronavirus (SCoV) [ 1 ]. Earlier, morphological traits were used for assessing similarities between species and building phylogenetic trees. Presently, phylogenetics relies on information extracted from genetic material such as deoxyribonucleic acid (DNA), ribonucleic acid (RNA) or protein sequences [ 2 ]. Methods used for phylogenetic inference have changed drastically during the past two decades: from alignment-based to alignment-free methods [ 3 ]. This paper has reviewed various methods under phylogenetic tree construction from character to distance methods and alignment-based to alignment-free methods. A brief review of phylogenetic tree applications is also given in cancer studies.

Literature Review

A phylogenetic tree can be unrooted or rooted, implying directions corresponding to evolutionary time, i.e. the species at the leaves of a tree relate to the current day species. The species can be expressed as DNA strings which are formed by combining four nucleotides A, T, C and G (A—adenine, T—thymine, C—cytosine and G—guanine). In literature, various string processing algorithms are reported which can quickly analyse these DNA and RNA sequences and build a phylogeny of sequences or species based on their similarity and dissimilarity. A high similarity among two sequences usually implies significant functional or structural likeliness, and these sequences are closely related in the phylogenetic tree. To get more precise information about the extent of similarity to some other sequence stored in a database, we must be able to compare sequences quickly with a set of sequences. For this, we need to perform the multiple sequence comparison. Dynamic programming concepts facilitate this comparison using alignment methods, but it involves more computation. Moreover, the iterative computational steps limit its utility for long length sequences [ 3 ]. Alignment-free methods overcome this limitation as they follow alternative metrics like word frequency or sequence entropy for finding similarity between sequences.

Methods of Phylogenetic Tree Construction

Phylogenetic tree generation consists of sequence alignment where the resulting tree reveals how alignment can influence the tree formation. Alignment-based methodologies are probably the most widely used tools in sequence analysis problems [ 4 ]. They consist of arranging two sequences: one on the top of another to highlight their common symbols and substrings. An alignment method is based on alignment parameters including insertion, deletions and gaps which play a pivotal role in the construction of the phylogenetic tree. A phylogenetic tree is formed as an outcome of sequence analysis performed on the DNA or RNA strings [ 5 ]. Sequence comparison reveals the patterns of shared history between species, helping in the prediction of ancestral states. The comparison of sequences also helps in understanding the biology of living organisms which is required to find similarity and relationship among species. For sequence comparison, we can follow alignment-based or alignment-free methods [ 3 , 6 , 7 ].

Sequence Alignment

Sequence alignment is a method to identify homologous sequences. It is categorized as pairwise alignment in which only two sequences are compared at a time whereas in multiple sequence alignment more than two sequences are compared. Alignment-based can be global or local [ 8 , 9 ]. These alignment-based algorithms can also be used with distance methods to express the similarity between two sequences, reflecting the number of changes in each sequence. Figure  1 gives a hierarchical view of various methods for phylogenetic tree building.

An external file that holds a picture, illustration, etc.
Object name is 468283_1_En_17_Fig1_HTML.jpg

Hierarchical view of phylogenetic methods 2

Character-Based Methods

The character-based methods compare all sequences simultaneously considering one character/site at a time. These are maximum parsimony and maximum likelihood. These methods use probability and consider variation in a set of sequences [ 10 ]. Both approaches consider the tree with the best score tree, which requires the smallest number of changes to perform alignment. Maximum parsimony method suffers badly from the long-branch attraction and gives the least information about the branch lengths [ 10 ]. In such cases, if two external branches are separated by short internal branches, it leads to the incorrect tree. Some of the salient features of character-based methods are mentioned in Table  1 .

Table 1

Comparison of different phylogenetic tree construction methods

Distance-Based Methods

Distance-based methods use the dissimilarity (the distance) between the two sequences to construct trees. They are much less computationally intensive than the character based methods are mostly accurate as they take mutations into count. For tree generation, generally, hierarchical clustering is used in which dendrograms (clusters) are created. Table  1 briefly compares various phylogenetic tree construction methods.

Alignment-Based Versus Alignment-Free Sequence Comparison

Multiple alignments of related sequences may often yield the most helpful information on its phylogeny. However, it can produce incorrect results when applied to more divergent sequence rearrangements [ 3 ]. Some computationally intensive multiple alignment methods align sequences strictly based on the order in which they receive them. Multiple sequence alignment methods emphasize that more closely related sequences should be aligned first. In cases of sequences being less related to one another, however, sharing a common ancestor may be clustered separately [ 11 ]. This implies that they can be more accurately aligned, but may result in incorrect phylogeny. Alignment can provide an optimized tree if a recursive approach is followed; however, this will increase the complexity of the problem. If the differences among the lengths of sequences are very high, the alignment performance significantly impacts tree generation.

The use of dynamic programming in alignment makes computation more complicated, and iterative steps limit their utility for large datasets. Therefore, consistent efforts have been made in developing and improving multiple sequence alignment methods for supporting variable length sequences with high accuracy and also for aligning a larger number of sequences simultaneously. Because of the problems associated with alignment-based phylogeny the importance of alignment-free methods is apparent [ 3 ]. Hence, the alignment quality affects the relationship created in a phylogenetic tree based on the consideration discussed above.

Alignment-Free Methods for Sequence Analysis

Alignment-free methods proposed in recent years can be classified into various categories as shown in Fig.  1 . These include k-tuple based on the word frequencies, methods that represent the sequence without using the word frequencies, i.e. compression algorithms probabilistic methods and information theory-based method. In the k-tuple method, a genetic sequence is represented by a frequency vector of fixed length subsequence and the similarity or dissimilarity measures are found based on the frequency vector of subsequence. The probabilistic methods represent the sequences using the transition matrix of a Markov chain [ 12 ] of a pre-specified order, and comparison of two sequences is done by finding the distance between two transition matrices. Graphical representation comprising 2D or 3D or even 20D methods provides an easy way to view, sort and compare various sequences. Graphical representation further helps in recognizing major characteristics among similar biological sequences.

As discussed k-tuple method uses k-words to characterize the compositional features of a sequence numerically. A biological sequence is numerically converted into a vector or a matrix composed of the word frequency. The k-word frequency provides a fast arithmetic speed and can be applied to full sequences. The problem with k-tuple is a big value of k that poses a challenge in the computing time and space, and k-word methods underestimate or even ignore the importance of its location. The string-based distance measure uses substring matches with k mismatches.

Application of Phylogenetics in Cancer Studies

Cancer research is considered one of the most significant areas in the medical community. Mutations in genomic sequences are responsible for cancer development and increased aggressiveness in patients [ 13 , 14 ]. The combination of all such genes mutations, or progression pathways, across a population can be summarized in a phylogeny describing the different evolutionary pathways [ 9 ]. Application of the phylogenetic tree can be explored for finding similarities among breast cancer subtypes based on gene data [ 14 , 15 ]. Discovery of genes associated in cancer subtype help researchers to map different pathways to classify cancer subtypes according to their mutations. Methods of phylogenetic tree inference have proliferated in cancer genome studies such as breast cancer [ 13 ]. Phylogenetic can capture important mutational events among different cancer types; a network approach can also capture tumour similarities.

It has been observed from the literature that in cancer disease, the driver genes change the cancer progression, and it even affects the participation of other genes thus generating gene interaction network. Phylogenetic methods can solve the problem of class prediction by using a classification tree. Phylogenetic methods give us a deeper understanding of biological heterogeneity among cancer subtype.

The research focuses on the various methods of sequence analysis to generate phylogenetic trees. The limitations associated with sequence alignment methods lead to the development of alignment-free sequence analysis. However, most of the existing alignment-free methods are unable to build an accurate tree so more refinement is required in alignment-free methods. The phylogenetic study is not limited to species evolution, but disease evolution as well. Extending phylogenetic to disease diagnosis can give birth to new treatment options and understanding its progression.

Acknowledgements

The research is funded by Department of Science and Technology, Delhi, under the sanction number SR/WOS-A/ET-1015/2015.

Contributor Information

Yu-Chen Hu, Phone: +886886886426328001, Email: wt.ude.up@uhcy .

Shailesh Tiwari, Email: [email protected] .

Krishn K. Mishra, Email: ni.ca.tinnm@mkk .

Munesh C. Trivedi, Email: [email protected] .

Geetika Munjal, Email: ude.aidniucn@akiteeg , Email: [email protected] .

Madasu Hanmandlu, Email: moc.liamg@uldnamhm .

Sangeet Srivastava, Email: ude.aidniucn@avatsavirsteegnas .

IMAGES

  1. Phylogenetic analysis and expression patterns of SHMT genes. (A

    phylogenetic analysis research journal

  2. Phylogenetic Analysis of the 37 study entry samples (six digit number

    phylogenetic analysis research journal

  3. Phylogenetic analysis. Phylogenetic tree of TMEM16/ANO sequences from

    phylogenetic analysis research journal

  4. [PDF] Interactive Tree Of Life (iTOL): an online tool for phylogenetic

    phylogenetic analysis research journal

  5. The Genealogical World of Phylogenetic Networks: Journal of

    phylogenetic analysis research journal

  6. Molecular Phylogenetic analysis by Maximum Likelihood method

    phylogenetic analysis research journal

VIDEO

  1. Phylogenetic analysis using MEGA software

  2. Causal analysis: Dynamic and phylogenetic structural equation models

  3. Research Methodologies Genomics and phylogenetic analyses 1 03072020

  4. Phylogenetic Tree Service

  5. Phylogenetic and Phylogenetic software || phylip @bscbiowallah @PhysicsWallah #semester_class

  6. Molecular Phylogenetic Analysis

COMMENTS

  1. Phylogenetics - Latest research and news | Nature

    Phylogenetics is the attempt to reconstruct the evolutionary relationships between species. Historically, this was done using quantitative morphological data, but modern methods rely more heavily ...

  2. Phylogenetic analysis | PLOS ONE

    Morphological and genetic analysis for the diversity conservation of rare species, Thamnaconus multilineatus (Tetraodontiformes: Monacanthidae) Tae-Sik Yu, Kiyun Park, Kyeong-Ho Han, Ihn-Sil Kwak.

  3. A workflow with R: Phylogenetic analyses and ... - PLOS

    Phylogenetic analyses can provide a wealth of information about the past demography of a population and the level of genetic diversity within and between species. By using special computer programs developed in recent years, large amounts of data have been produced in the molecular genetics area. To analyze these data, powerful new methods based on large computations have been applied in ...

  4. Molecular Phylogenetics and Evolution | Journal ...

    Molecular Phylogenetics and Evolution is dedicated to bringing Darwin's dream within grasp - to "have fairly true genealogical trees of each great kingdom of Nature." The journal provides a forum for molecular studies that advance our understanding of phylogeny and evolution, further the development of phylogenetically more accurate taxonomic classifications, and ultimately bring a unified ...

  5. Phylogenetic Analysis | SpringerLink

    Before the development of DNA sequence technology, phylogenetic trees have been predominantly employed to identify the connections between various organisms [].Ever since the development of sequencing technologies and use of sequence data for phylogenetics, there has been drastic improvement in our understanding of the tree of life, and tremendous advancement has been made in Darwin’s dream ...

  6. A biologist’s guide to Bayesian phylogenetic analysis - PMC

    When phylogenetic analysis is conducted on rooted trees under the clock or relaxed clock models 69, rooted trees are commonly assigned a prior using a model of cladogenesis such as the Yule process and the birth-death-sampling process 70. Note that all those models favour balanced trees, and the impact of the prior on the posterior ...

  7. Phylogenetics Algorithms and Applications - PMC

    Phylogenetics is a powerful approach in finding evolution of current day species. By studying phylogenetic trees, scientists gain a better understanding of how species have evolved while explaining the similarities and differences among species. The phylogenetic study can help in analysing the evolution and the similarities among diseases and ...

  8. (PDF) A Review on Phylogenetic Analysis: A Journey through ...

    Phylogenetic analysis is done to elucidate the genetic relationship and to further classify the evolutionary relationships at the genus level (Cui et al., 2015;Guan et al., 2020). Phylogenetic ...