This tutorial presents the use of the rSWeeP tool on an example with bacterial proteome data (complete set of translated CDSs). The data includes the genera Azoarcus, Aromatoleum and related genera belonging to the Betaproteobacterial.

The data presented here refer to the study by Raittz et al. (2021) - (paper’s link)

By vectorizing the sequences using the rSWeeP package, we can carry out a series of analyses and visualizations of the data:

  • obtaining the principal components (PCA)
  • low-dimensional visualizations such as tSNE and UMAP
  • building a phylogenetic tree
  • heatmap relating the organisms
  • clustering to group the organisms - specially useful for not classified samples

1. Obtain the data

The raw data can be found (here). Download the Azoarcus folder in your local machine, and provide the path='pathto/Azoarcus/seqs/' to the folder.

2. SWeeP Vectorisation

As a first step, provide the path of the sequences to the rSWeePlite function. This function will vectorize the amino acid sequences according to the parameters provided. For more details on parameterizations, see the tutorial. This stage can take around 20-25 minutes.

library(rSWeeP)
# supply here the path to Azoarcus data
path = 'pathto/Azoarcus/seqs/'
sw = SWeePlite(path,seqtype='AA',bin=TRUE,mask=c(4),psz=1500,verbose=FALSE)
#> Starting projection. Please wait.

The parameters above can be modified.

A 4-mer or 5-mer mask such as mask=c(4) (which corresponds to [1111]) or mask = c(5) is ideal for genomic data.

The size of the projection can vary from psz=1000 to psz=5000, depending on the type and volume of data.

The bin=TRUE parameter indicates that the HDV (see tutorial) uses presence-absence of k-mer patterns information, and do not count the number of k-mer patterns found.

The seqtype='AA' parameter indicates that the dataset is amino acid content.

The verbose=FALSE parameter indicates that the process will not indicate progress (use verbose=TRUE to follow the progress of the process).

3. Analysis

Once you have the vectors, you can implement the aforementioned analyses. We demonstrate some possibilities below.

First, load the metadata. The metadata provided only indicates the genus to which each sample belongs.

pathMetadata = 'pathto/metaAzo.csv'
mt = read.csv(pathMetadata)
  
vunq = unique(mt$genus)
ncl = length(vunq)
cl = mt$genus 
for (i in 1:ncl) {
  cl[mt$genus==vunq[i]] = i
}

2.1. PCA

We obtain the PCA and visualize the first 4 components.

pca_output <- prcomp (sw$proj , scale = FALSE)
par(mfrow=c(1,2))
plot(pca_output$x[,1] , pca_output$x [,2] , xlab = 'PC-1' , ylab = 'PC-2' , pch =20 , col = cl)
legend("topleft" , vunq , col = as.character(c(1:ncl)) , pch =20)
plot(pca_output$x[,3] , pca_output$x [,4] , xlab = 'PC-3' , ylab = 'PC-4' , pch =20 , col = cl)

Note that the genera are clearly grouped together. Also note, for example, that genera Azospira and Dechloromonas overlap in PC-1 x PC-2, but move apart in PC-3 x PC-4, becoming separable.

2.2. UMAP

We obtain the UMAP visualization.

library(umap)
umap_output <- umap(sw$proj)
plot(umap_output$layout[,1],umap_output$layout[,2] , xlab = 'UMAP-1' , ylab = 'UMAP-2' , pch =20, col=cl)
legend("bottomleft" , vunq , col = as.character(c(1:ncl)) , pch =20)

2.3. Heatmap

Note that the heatmap groups the genera together. On the scale, the darker, the greater the relationship between the organisms.

data = pca_output$x[,1:5] 
rownames(data) = mt$genus
mdist = dist(data,method='euclidean')
heatmap(as.matrix(mdist),symm=T,col=colorRampPalette(c('#000000','#FFFFFF'))(10))

2.4. Phylogenetic tree

We present an unrooted tree. There are some inconsistencies with regard to genera, which are discussed in the study by Raittz et al. (2021).

library(ape)
library(ggtree)
mdist = dist(sw$proj,method='euclidean')
tr = nj(mdist)
p<- ggtree(tr)+ geom_tiplab()
p

References

De Pierri, Camilla Reginatto, Ricardo Voyceik, Letı́cia Graziela Costa Santos de Mattos, Mariane Gonçalves Kulik, Josué Oliveira Camargo, Aryel Marlus Repula de Oliveira, Bruno Thiago de Lima Nichio, et al. 2020. “SWeeP: Representing Large Biological Sequences Datasets in Compact Vectors.” Scientific Reports 10 (1): 1–10. https://doi.org/10.1038/s41598-019-55627-4.
Raittz, Roberto Tadeu, Camilla Reginatto De Pierri, Marta Maluk, Marcelo Bueno Batista, Manuel Carmona, Madan Junghare, Helisson Faoro, et al. 2021. “Comparative Genomics Provides Insights into the Taxonomy of Azoarcus and Reveals Separate Origins of Nif Genes in the Proposed Azoarcus and Aromatoleum Genera.” Genes 12 (1): 71. https://doi.org/10.3390/genes12010071.