This tutorial presents the use of the rSWeeP tool on an example with bacterial proteome data (complete set of translated CDSs). The data includes the genera Azoarcus, Aromatoleum and related genera belonging to the Betaproteobacterial.
The data presented here refer to the study by Raittz et al. (2021) - (paper’s link)
By vectorizing the sequences using the rSWeeP
package,
we can carry out a series of analyses and visualizations of the
data:
The raw data can be found (here).
Download the Azoarcus folder in your local machine, and provide the
path='pathto/Azoarcus/seqs/'
to the folder.
As a first step, provide the path of the sequences to the rSWeePlite function. This function will vectorize the amino acid sequences according to the parameters provided. For more details on parameterizations, see the tutorial. This stage can take around 20-25 minutes.
library(rSWeeP)
# supply here the path to Azoarcus data
path = 'pathto/Azoarcus/seqs/'
sw = SWeePlite(path,seqtype='AA',bin=TRUE,mask=c(4),psz=1500,verbose=FALSE)
#> Starting projection. Please wait.
The parameters above can be modified.
A 4-mer or 5-mer mask such as mask=c(4)
(which
corresponds to [1111]) or mask = c(5)
is ideal for genomic
data.
The size of the projection can vary from psz=1000
to
psz=5000
, depending on the type and volume of data.
The bin=TRUE
parameter indicates that the HDV (see tutorial)
uses presence-absence of k-mer patterns information, and do not count
the number of k-mer patterns found.
The seqtype='AA'
parameter indicates that the dataset is
amino acid content.
The verbose=FALSE
parameter indicates that the process
will not indicate progress (use verbose=TRUE
to follow the
progress of the process).
Once you have the vectors, you can implement the aforementioned analyses. We demonstrate some possibilities below.
First, load the metadata. The metadata provided only indicates the genus to which each sample belongs.
pathMetadata = 'pathto/metaAzo.csv'
mt = read.csv(pathMetadata)
vunq = unique(mt$genus)
ncl = length(vunq)
cl = mt$genus
for (i in 1:ncl) {
cl[mt$genus==vunq[i]] = i
}
We obtain the PCA and visualize the first 4 components.
pca_output <- prcomp (sw$proj , scale = FALSE)
par(mfrow=c(1,2))
plot(pca_output$x[,1] , pca_output$x [,2] , xlab = 'PC-1' , ylab = 'PC-2' , pch =20 , col = cl)
legend("topleft" , vunq , col = as.character(c(1:ncl)) , pch =20)
plot(pca_output$x[,3] , pca_output$x [,4] , xlab = 'PC-3' , ylab = 'PC-4' , pch =20 , col = cl)
Note that the genera are clearly grouped together. Also note, for example, that genera Azospira and Dechloromonas overlap in PC-1 x PC-2, but move apart in PC-3 x PC-4, becoming separable.
We obtain the UMAP visualization.
library(umap)
umap_output <- umap(sw$proj)
plot(umap_output$layout[,1],umap_output$layout[,2] , xlab = 'UMAP-1' , ylab = 'UMAP-2' , pch =20, col=cl)
legend("bottomleft" , vunq , col = as.character(c(1:ncl)) , pch =20)
Note that the heatmap groups the genera together. On the scale, the darker, the greater the relationship between the organisms.
data = pca_output$x[,1:5]
rownames(data) = mt$genus
mdist = dist(data,method='euclidean')
heatmap(as.matrix(mdist),symm=T,col=colorRampPalette(c('#000000','#FFFFFF'))(10))
We present an unrooted tree. There are some inconsistencies with regard to genera, which are discussed in the study by Raittz et al. (2021).
library(ape)
library(ggtree)
mdist = dist(sw$proj,method='euclidean')
tr = nj(mdist)
p<- ggtree(tr)+ geom_tiplab()
p