The 16S rRNA gene is widely used for philogenetic analysis due to its evolutionary conservation, however, it also shows insufficient resolution at the species level. To remedy this problem, it has been introduced the multiple gene-based phylogenies. The most used genes for this type of analysis are the ones coding for subunits of ubiquitous enzymes, in this case, we’ll be using the β-subunit of DNA gyrase (gyrB), the β-subunit of RNA polymerase (rpoB), the β-subunit of ATP synthase F0F1 (atpD), and the translation initiation factor IF-2 (infB) (GLAESER; KÄMPFER, 2015).

Obtaining gene sequences

The sequences of the 4 genes were obtained from NCBI, using the id provided in the supplement to the study by Glaeser and Kämpfer (2015). The Enterobacter mori LMG 25706, Leclercia adecarboxylata LMG 2803, Enterobacter massiliensis JC163 and Erwinia tasmaniensis Et1/99 sequences were not included due to their unavailability at NCBI.

How do I get the sequences using biopython?

There’s a ‘LISTofGENEids.txt’ with the list of ids for each one of the four genes

from Bio import Entrez
Entrez.email = "YOUR_EMAIL@gmail.com " #Replace with your email

with open('LISTofGENEids.txt') as file:
  gene_codes = file.read().splitlines() #Create a list with the gene codes in the file
for actualGene in gene_codes:
  #Do the research on the NCBI
  handle = Entrez.efetch(db="nucleotide", id=actualGene, rettype="fasta", retmode="text")
  labelFasta = handle.readline().strip()
  fasta_data = handle.read()
  if fasta_data != "": #Check if the file isn't empty
    with open(f"YOURfolder/{actualGene}.fasta", "w") as fasta_file:
      fasta_file.write(labelFasta+'\n'+fasta_data) #Write a FASTA file with the gene sequences, named after each gene code
  else:
    print((f"Gene {actualGene} not find."))

The Xenorhabdus nematophila used as outgroup to root the tree, was obtained directly from NCBI through the link: https://www.ncbi.nlm.nih.gov/nuccore/NC_014228.1. The sequences of those 4 genes were obtained manually and added to the dataset.

The raw data can be found (here). Download the MLSAEnterobacteriaceae folder in your local machine, and provide the path='pathto/MLSAEnterobacteriaceae/seqs/' to the folder.

library(rSWeeP)
# supply here the path to MLSA data
folder = 'pathto/MLSAEnterobacteriaceae/seqs/'

Sequences analisys

SWeeP

As a first step, vectorize the sequences using SWeeP.

library(rSWeeP)
sw = SWeePlite(input = folder, psz = 1000, extension = '.fasta',seqtype='NT', mask = c(5,5,5), bin=FALSE, verbose=FALSE)
## Starting projection. Please wait.

Phylogenetic Tree

We present a tree rooted Xenorhabdus nematophila ATCC 19061.

library(ape)
library(ggtree)
mdist = dist(sw$proj,method='euclidean')
tree = nj(mdist)
tree = root(tree,outgroup='Xenorhabdus_nematophila_ATCC_19061')
p<- ggtree(tree,branch.length="none")+ geom_tiplab()+ xlim(0,50)
p

Metrics

Metrics can be implemented to check the quality of the trees. The metrics are available:

  • PCCI: PhyloTaxonomic Consistency Cophenetic Index. Phylogenetic tree evaluation function, estimate of how grouped the samples of the same taxon are in the phylogenetic tree.
  • PMPG: Percentage of Mono or Paraphyletic Groups. Phylogenetic tree evaluation function, returns the percentage of Mono/Paraphyletic

First, create a genus list.

genus_sw =c()
for(i in 1:length(tree$tip.label)){
  genus_sw= c(genus_sw, stringr::str_split(tree$tip.label[i],'_')[[1]][1])
} 

Create a metadata: first column with label names, and second column with the corresponding taxon names. Provide the metadata to obtain the metrics.

mt = data.frame(sp=tree$tip.label, genus = genus_sw)

PCCI(tree,mt)
## $tab
##              taxa      cost
## 1       Brenneria 1.0000000
## 2     Citrobacter 0.7500000
## 3     Cronobacter 0.8472222
## 4         Dickeya 1.0000000
## 5    Enterobacter 0.7530864
## 6         Erwinia 0.8429752
## 7      Gibbsiella 1.0000000
## 8      Klebsiella 0.6400000
## 9        Kluyvera 1.0000000
## 10      Kosakonia 1.0000000
## 11     Lelliottia 1.0000000
## 12      Lonsdalea 1.0000000
## 13 Mangrovibacter 1.0000000
## 14        Pantoea 0.8789062
## 15 Pectobacterium 0.9387755
## 16  Pluralibacter 1.0000000
## 17       Rahnella 0.8611111
## 18     Raoultella 0.7500000
## 19       Samsonia 1.0000000
## 20       Serratia 0.8533333
## 21      Tatumella 0.7200000
## 22    Xenorhabdus 1.0000000
## 23      Yokenella 1.0000000
## 
## $mean
## [1] 0.9058874
PMPG(tree,mt)
## $tab
##              taxa  mono  para
## 1       Brenneria  TRUE FALSE
## 2     Citrobacter FALSE FALSE
## 3     Cronobacter  TRUE FALSE
## 4         Dickeya  TRUE FALSE
## 5    Enterobacter FALSE FALSE
## 6         Erwinia FALSE  TRUE
## 7      Gibbsiella  TRUE FALSE
## 8      Klebsiella FALSE FALSE
## 9        Kluyvera  TRUE FALSE
## 10      Kosakonia  TRUE FALSE
## 11     Lelliottia  TRUE FALSE
## 12      Lonsdalea  TRUE FALSE
## 13 Mangrovibacter  TRUE FALSE
## 14        Pantoea  TRUE FALSE
## 15 Pectobacterium  TRUE FALSE
## 16  Pluralibacter  TRUE FALSE
## 17       Rahnella FALSE FALSE
## 18     Raoultella FALSE  TRUE
## 19       Samsonia  TRUE FALSE
## 20       Serratia  TRUE FALSE
## 21      Tatumella FALSE FALSE
## 22    Xenorhabdus  TRUE FALSE
## 23      Yokenella  TRUE FALSE
## 
## $percMono
## [1] 69.56522
## 
## $percPara
## [1] 8.695652
## 
## $metric
## [1] 78.26087

Running MASH for comparison

First, sketch all the Fasta files, then generate the matrix

mash sketch -s 10000 -a -o sketches  *.fasta
mash dist sketches.msh sketches.msh > distances.tab

After that, we’ll need to shape the output matrix into the format of a distance matrix

import pandas as pd
distances = pd.read_csv('distances.tab', sep='\t', header=None, names=['seq1', 'seq2', 'distance', 'p_value', 'matching_hashes'])
distance_matrix = distances.pivot(index='seq1', columns='seq2', values='distance') #Reshape the mash dataframe in the format of a distance matrix
distance_matrix.to_csv('distance_matrix_MASH_final.csv')

The distance matrix can be found (here).

Phylogenetic tree

Tree rooted Xenorhabdus nematophila ATCC 19061

library(ape)
library(ggtree)
dt = read.csv('distance_matrix_MASH_final.csv')
rownames(dt) = dt[,1] #Rename the index by organism names
dt = dt[,-1] #Remove the first column with the organism names

mdist_mash= dist(dt,method='euclidean')
tree_mash = nj(mdist_mash)
tree_mash = root(tree_mash,outgroup='Xenorhabdus_nematophila_ATCC_19061.fasta')
p_mash<- ggtree(tree_mash,branch.length="none")+ geom_tiplab()+ xlim(0,50)
p_mash

Metrics

Obtaining the same metrics as we did for SWeeP

library(rSWeeP)
genus_mash = c()
for(i in 1:length(tree_mash$tip.label)){
genus_mash = c(genus_mash, stringr::str_split(tree_mash$tip.label[i],' ')[[1]][1])
}
mt_mash = data.frame(sp=tree_mash$tip.label, genus = genus_mash)
PMPG(tree_mash, mt_mash)
## $tab
##                                        taxa  mono  para
## 1                                 Brenneria FALSE FALSE
## 2                               Citrobacter FALSE FALSE
## 3                               Cronobacter FALSE FALSE
## 4                                   Dickeya FALSE FALSE
## 5                              Enterobacter FALSE FALSE
## 6                                   Erwinia FALSE FALSE
## 7                                Gibbsiella  TRUE FALSE
## 8                                Klebsiella FALSE FALSE
## 9                                  Kluyvera FALSE FALSE
## 10                                Kosakonia  TRUE FALSE
## 11                               Lelliottia  TRUE FALSE
## 12                                Lonsdalea  TRUE FALSE
## 13                           Mangrovibacter  TRUE FALSE
## 14                                  Pantoea  TRUE FALSE
## 15                           Pectobacterium FALSE FALSE
## 16                            Pluralibacter  TRUE FALSE
## 17                                 Rahnella FALSE FALSE
## 18                               Raoultella  TRUE FALSE
## 19                                 Samsonia  TRUE FALSE
## 20                                 Serratia FALSE FALSE
## 21                                Tatumella FALSE FALSE
## 22 Xenorhabdus_nematophila_ATCC_19061.fasta  TRUE FALSE
## 23                                Yokenella  TRUE FALSE
## 
## $percMono
## [1] 47.82609
## 
## $percPara
## [1] 0
## 
## $metric
## [1] 47.82609
PCCI(tree_mash, mt_mash)
## $tab
##                                        taxa      cost
## 1                                 Brenneria 0.7500000
## 2                               Citrobacter 0.7500000
## 3                               Cronobacter 0.8263889
## 4                                   Dickeya 0.7222222
## 5                              Enterobacter 0.7160494
## 6                                   Erwinia 0.6033058
## 7                                Gibbsiella 1.0000000
## 8                                Klebsiella 0.7600000
## 9                                  Kluyvera 0.7500000
## 10                                Kosakonia 1.0000000
## 11                               Lelliottia 1.0000000
## 12                                Lonsdalea 1.0000000
## 13                           Mangrovibacter 1.0000000
## 14                                  Pantoea 0.8398438
## 15                           Pectobacterium 0.7755102
## 16                            Pluralibacter 1.0000000
## 17                                 Rahnella 0.8611111
## 18                               Raoultella 1.0000000
## 19                                 Samsonia 1.0000000
## 20                                 Serratia 0.8266667
## 21                                Tatumella 0.8800000
## 22 Xenorhabdus_nematophila_ATCC_19061.fasta 1.0000000
## 23                                Yokenella 1.0000000
## 
## $mean
## [1] 0.8722217

References

Glaeser, S. P., & Kämpfer, P. (2015). Multilocus sequence analysis (MLSA) in prokaryotic taxonomy. Systematic and applied microbiology, 38(4), 237-245.