The 16S rRNA gene is widely used for philogenetic analysis due to its evolutionary conservation, however, it also shows insufficient resolution at the species level. To remedy this problem, it has been introduced the multiple gene-based phylogenies. The most used genes for this type of analysis are the ones coding for subunits of ubiquitous enzymes, in this case, we’ll be using the β-subunit of DNA gyrase (gyrB), the β-subunit of RNA polymerase (rpoB), the β-subunit of ATP synthase F0F1 (atpD), and the translation initiation factor IF-2 (infB) (GLAESER; KÄMPFER, 2015).
The sequences of the 4 genes were obtained from NCBI, using the id provided in the supplement to the study by Glaeser and Kämpfer (2015). The Enterobacter mori LMG 25706, Leclercia adecarboxylata LMG 2803, Enterobacter massiliensis JC163 and Erwinia tasmaniensis Et1/99 sequences were not included due to their unavailability at NCBI.
There’s a ‘LISTofGENEids.txt’ with the list of ids for each one of the four genes
from Bio import Entrez
Entrez.email = "YOUR_EMAIL@gmail.com " #Replace with your email
with open('LISTofGENEids.txt') as file:
gene_codes = file.read().splitlines() #Create a list with the gene codes in the file
for actualGene in gene_codes:
#Do the research on the NCBI
handle = Entrez.efetch(db="nucleotide", id=actualGene, rettype="fasta", retmode="text")
labelFasta = handle.readline().strip()
fasta_data = handle.read()
if fasta_data != "": #Check if the file isn't empty
with open(f"YOURfolder/{actualGene}.fasta", "w") as fasta_file:
fasta_file.write(labelFasta+'\n'+fasta_data) #Write a FASTA file with the gene sequences, named after each gene code
else:
print((f"Gene {actualGene} not find."))
The Xenorhabdus nematophila used as outgroup to root the tree, was obtained directly from NCBI through the link: https://www.ncbi.nlm.nih.gov/nuccore/NC_014228.1. The sequences of those 4 genes were obtained manually and added to the dataset.
The raw data can be found (here).
Download the MLSAEnterobacteriaceae folder in your local machine, and
provide the path='pathto/MLSAEnterobacteriaceae/seqs/'
to
the folder.
library(rSWeeP)
# supply here the path to MLSA data
folder = 'pathto/MLSAEnterobacteriaceae/seqs/'
As a first step, vectorize the sequences using SWeeP.
library(rSWeeP)
sw = SWeePlite(input = folder, psz = 1000, extension = '.fasta',seqtype='NT', mask = c(5,5,5), bin=FALSE, verbose=FALSE)
## Starting projection. Please wait.
We present a tree rooted Xenorhabdus nematophila ATCC 19061.
library(ape)
library(ggtree)
mdist = dist(sw$proj,method='euclidean')
tree = nj(mdist)
tree = root(tree,outgroup='Xenorhabdus_nematophila_ATCC_19061')
p<- ggtree(tree,branch.length="none")+ geom_tiplab()+ xlim(0,50)
p
Metrics can be implemented to check the quality of the trees. The metrics are available:
First, create a genus list.
genus_sw =c()
for(i in 1:length(tree$tip.label)){
genus_sw= c(genus_sw, stringr::str_split(tree$tip.label[i],'_')[[1]][1])
}
Create a metadata: first column with label names, and second column with the corresponding taxon names. Provide the metadata to obtain the metrics.
mt = data.frame(sp=tree$tip.label, genus = genus_sw)
PCCI(tree,mt)
## $tab
## taxa cost
## 1 Brenneria 1.0000000
## 2 Citrobacter 0.7500000
## 3 Cronobacter 0.8472222
## 4 Dickeya 1.0000000
## 5 Enterobacter 0.7530864
## 6 Erwinia 0.8429752
## 7 Gibbsiella 1.0000000
## 8 Klebsiella 0.6400000
## 9 Kluyvera 1.0000000
## 10 Kosakonia 1.0000000
## 11 Lelliottia 1.0000000
## 12 Lonsdalea 1.0000000
## 13 Mangrovibacter 1.0000000
## 14 Pantoea 0.8789062
## 15 Pectobacterium 0.9387755
## 16 Pluralibacter 1.0000000
## 17 Rahnella 0.8611111
## 18 Raoultella 0.7500000
## 19 Samsonia 1.0000000
## 20 Serratia 0.8533333
## 21 Tatumella 0.7200000
## 22 Xenorhabdus 1.0000000
## 23 Yokenella 1.0000000
##
## $mean
## [1] 0.9058874
PMPG(tree,mt)
## $tab
## taxa mono para
## 1 Brenneria TRUE FALSE
## 2 Citrobacter FALSE FALSE
## 3 Cronobacter TRUE FALSE
## 4 Dickeya TRUE FALSE
## 5 Enterobacter FALSE FALSE
## 6 Erwinia FALSE TRUE
## 7 Gibbsiella TRUE FALSE
## 8 Klebsiella FALSE FALSE
## 9 Kluyvera TRUE FALSE
## 10 Kosakonia TRUE FALSE
## 11 Lelliottia TRUE FALSE
## 12 Lonsdalea TRUE FALSE
## 13 Mangrovibacter TRUE FALSE
## 14 Pantoea TRUE FALSE
## 15 Pectobacterium TRUE FALSE
## 16 Pluralibacter TRUE FALSE
## 17 Rahnella FALSE FALSE
## 18 Raoultella FALSE TRUE
## 19 Samsonia TRUE FALSE
## 20 Serratia TRUE FALSE
## 21 Tatumella FALSE FALSE
## 22 Xenorhabdus TRUE FALSE
## 23 Yokenella TRUE FALSE
##
## $percMono
## [1] 69.56522
##
## $percPara
## [1] 8.695652
##
## $metric
## [1] 78.26087
First, sketch all the Fasta files, then generate the matrix
mash sketch -s 10000 -a -o sketches *.fasta
mash dist sketches.msh sketches.msh > distances.tab
After that, we’ll need to shape the output matrix into the format of a distance matrix
import pandas as pd
distances = pd.read_csv('distances.tab', sep='\t', header=None, names=['seq1', 'seq2', 'distance', 'p_value', 'matching_hashes'])
distance_matrix = distances.pivot(index='seq1', columns='seq2', values='distance') #Reshape the mash dataframe in the format of a distance matrix
distance_matrix.to_csv('distance_matrix_MASH_final.csv')
The distance matrix can be found (here).
Tree rooted Xenorhabdus nematophila ATCC 19061
library(ape)
library(ggtree)
dt = read.csv('distance_matrix_MASH_final.csv')
rownames(dt) = dt[,1] #Rename the index by organism names
dt = dt[,-1] #Remove the first column with the organism names
mdist_mash= dist(dt,method='euclidean')
tree_mash = nj(mdist_mash)
tree_mash = root(tree_mash,outgroup='Xenorhabdus_nematophila_ATCC_19061.fasta')
p_mash<- ggtree(tree_mash,branch.length="none")+ geom_tiplab()+ xlim(0,50)
p_mash
Obtaining the same metrics as we did for SWeeP
library(rSWeeP)
genus_mash = c()
for(i in 1:length(tree_mash$tip.label)){
genus_mash = c(genus_mash, stringr::str_split(tree_mash$tip.label[i],' ')[[1]][1])
}
mt_mash = data.frame(sp=tree_mash$tip.label, genus = genus_mash)
PMPG(tree_mash, mt_mash)
## $tab
## taxa mono para
## 1 Brenneria FALSE FALSE
## 2 Citrobacter FALSE FALSE
## 3 Cronobacter FALSE FALSE
## 4 Dickeya FALSE FALSE
## 5 Enterobacter FALSE FALSE
## 6 Erwinia FALSE FALSE
## 7 Gibbsiella TRUE FALSE
## 8 Klebsiella FALSE FALSE
## 9 Kluyvera FALSE FALSE
## 10 Kosakonia TRUE FALSE
## 11 Lelliottia TRUE FALSE
## 12 Lonsdalea TRUE FALSE
## 13 Mangrovibacter TRUE FALSE
## 14 Pantoea TRUE FALSE
## 15 Pectobacterium FALSE FALSE
## 16 Pluralibacter TRUE FALSE
## 17 Rahnella FALSE FALSE
## 18 Raoultella TRUE FALSE
## 19 Samsonia TRUE FALSE
## 20 Serratia FALSE FALSE
## 21 Tatumella FALSE FALSE
## 22 Xenorhabdus_nematophila_ATCC_19061.fasta TRUE FALSE
## 23 Yokenella TRUE FALSE
##
## $percMono
## [1] 47.82609
##
## $percPara
## [1] 0
##
## $metric
## [1] 47.82609
PCCI(tree_mash, mt_mash)
## $tab
## taxa cost
## 1 Brenneria 0.7500000
## 2 Citrobacter 0.7500000
## 3 Cronobacter 0.8263889
## 4 Dickeya 0.7222222
## 5 Enterobacter 0.7160494
## 6 Erwinia 0.6033058
## 7 Gibbsiella 1.0000000
## 8 Klebsiella 0.7600000
## 9 Kluyvera 0.7500000
## 10 Kosakonia 1.0000000
## 11 Lelliottia 1.0000000
## 12 Lonsdalea 1.0000000
## 13 Mangrovibacter 1.0000000
## 14 Pantoea 0.8398438
## 15 Pectobacterium 0.7755102
## 16 Pluralibacter 1.0000000
## 17 Rahnella 0.8611111
## 18 Raoultella 1.0000000
## 19 Samsonia 1.0000000
## 20 Serratia 0.8266667
## 21 Tatumella 0.8800000
## 22 Xenorhabdus_nematophila_ATCC_19061.fasta 1.0000000
## 23 Yokenella 1.0000000
##
## $mean
## [1] 0.8722217
Glaeser, S. P., & Kämpfer, P. (2015). Multilocus sequence analysis (MLSA) in prokaryotic taxonomy. Systematic and applied microbiology, 38(4), 237-245.