1. Installation

The package can installed via Bioconductor, as shown below:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("rSWeeP")

2. SWeeP package

For a brief introduction to using rSWeeP, visit the Quick Start Tutorial

For the original article on the technique see De Pierri et al. (2020).

A biological sequence (amino acid or nucleotide) supplied to the algorithm is first converted into a numerical vector for counting patterns of k-mers, called a High Dimensional Vector (HDV). The HDV represents the biological sequence vectorially, but it is very long. Thus, the HDV is projected onto a dimension of reduced length by the projection matrix (orthBase), and its output is the Low Dimensional Vector (LDV). The LDV corresponds to the final output of the algorithm, and each sequence is represented by a short numerical vector that is easy to analyze.

Start loading the library.

library(rSWeeP)

The rSWeeP package contains the following functions:

SWeeP: conventional SWeeP function, in which it is necessary to generate or load an orthonormal basis from a file.
orthBase: function to generate the orthonormal matrix to be used by SWeeP in the projection.
SWeePlite: SWeeP function with built-in orthonormal basis. Slower function but low RAM usage.
extractHDV: extracts only the High Dimensional Vector (HDV) from biological sequences.

From an input, a matrix is returned where each row of the matrix corresponds to a vectorized sample. The vector is a list of numbers with a parameterized length psz (projection length).

There are 4 main types of input accepted by the SWeeP, SWeePlite and extractHDV functions:

input = path: the entry is an address to a folder containing the FASTAS files to be analyzed;
input = BString: the input is an object of BString, AAString, DNAString or RNAString types loaded with the BioStrings package;
input = dgCMatrix: the input is an RNAseq expression matrix (counting matrix);
input = matrix: any matrix with generic structured data (sample versus features).

Included in the package are amino acid sequences obtained from NCBI. These are mitochondrial proteomes (translated CDSs) from some mammals and bacterial proteomes. These sets will be used to present the functions. Their metadata is also available for visualizing the results.

2.1 Indicated parameters

2.1.1. Masks

To make the use of the masks clearer, the table below explains the correspondences. The third and fourth columns correspond to the appropriate formats to insert into the rSWeeP functions.

data type	k-mer	mask	binary format	short format
protein and small proteomes	4-mer	[1111]	c(1,1,1,1)	c(4)
protein and small proteomes	4-mer	[11011]	c(1,1,0,1,1)	c(2,1,2)
bacterial proteomes	5-mer	[11111]	c(1,1,1,1,1)	c(5)
bacterial proteomes	5-mer	[111011]	c(1,1,1,0,1,1)	c(3,1,2)
nucleotides	10-mer	[1111100000011111]	c(1,1,1,1,1,0,0,0,0,0,1,1,1,1,1)	c(5,5,5)
nucleotides	12-mer	[111111000111111]	c(1,1,1,1,1,1,0,0,0,1,1,1,1,1,1)	c(6,3,6)

We recommend masks between 4 and 5-mer for amino acid analysis, for example: [212], [4], [3,1,2]. The 4-mer is suitable for analyzing proteins and small proteomes such as mitochondrial, plastidial and viral proteomes. The 5-mer is suitable for analyzing bacterial proteomes.

For nucleotides, larger masks are required, such as between 6 and 12-mer, for example: [555], [636].

2.1.2. Length of output vector or Projection size (`psz`)

For proteins and small proteomes, an output vector of approximately 1,000 lengths is recommended. For genomes, longer vectors of around 5,000 length are recommended.

2.1.3. Normalization (`norm`)

There are three options for normalization: norm='none', norm='log' and norm='logNeg'. The none option and simple log (applied to HDV matrix before projection) are enough to solve most problems. The logNeg option is ideal for analyzing genes and short sequences.

via log. For counting mode, norm converts 0 to -1 and applies log() to non-zeros. For the binary case, it only converts 0 to -

2.1.4. Binary/Counting Format (`bin`)

The HDV matrix can be binary (presence-absence of k-mer patterns) or count (number of k-mer patterns found). The counting option (bin=FALSE) tends to be more informative. For smaller, simpler problems, the binary option (bin=TRUE) is sufficient.

2.1.5. Output format details

The output of the SWeeP and SWeePlite functions presents a list format output with two main items:

output$proj a numeric matrix with psz columns and one line per sequence, each row corresponding to a compact vector
output$info aditional information of the process. This object is subdivided in:
- ProjectionSize: a integer corresponding to psz (the output vector length)
- bin: a boolean containing if binary (TRUE) or counting (FALSE)
- mask: a vector containing the mask used
- SequenceType: a character containing the type of the sequence (amino acid: ‘AA’, ou nucleotide: ‘NT’)
- concatenate : a boolean corresponding to the concatenation of sequences
- version : a character corresponding to the version of the package
- norm : a character containing the normalization used
- extension: a character containing the list of extensions considered
- timeElapsed: a double containing the elapsed time in seconds
- headers : list of headers for each analyzed sequence

2.1.6. Parameters from literature

According to previous studies carried out with SWeeP, we present the parameters indicated for each type of data.

Complete bacterial proteomes¹ - small sample size (few hundred samples) Silva Filho et al. (2021):

psz $\approx$ 1500, seqtype= ‘AA’, bin = FALSE, mask = c(2,1,2), norm=‘none’

Bacterial proteomes - big sample size (thousand samples) – protein core:

psz $\approx$ 7500, seqtype= ‘AA’, bin = FALSE, mask = c(5), norm=‘logNeg’

Viral proteomes of SARS-CoV-2 - 8720 sequences Perico et al. (2022)

psz $\geq$ 600, seqtype= ‘AA’, bin = FALSE, mask = c(2,1,2), norm=‘none’

Proteins (individual or few proteins)

psz $\geq$ 600, seqtype= ‘AA’, bin = FALSE, mask = c(2,1,2), norm=‘logNeg’

Nucleotides

psz $\geq$ 1000, seqtype= ‘NT’, bin = FALSE, mask = c(6,3,6), norm=‘none’

Raittz et al. (2021)

More information on parameters can be found in the studies of De Pierri et al. (2020), Raittz et al. (2021) The list of references can be found at the end of the document.

2.2. Function `SWeeP`

The original SWeeP function requires a projection matrix to be provided, the orthBase. This can be generated by supplying the mask to be used, the data type (‘AA’ for amino acid and ‘NT’ for nucleotide), and the desired projection size, as shown below:

mask = c(2,1,2)
psz = 50
base160k = orthBase(mask=mask,col=psz,seqtype='AA')
dim(base160k$mat)

## [1] 160000     50

The orthBase matrix can be generated only once and then stored, avoiding additional processing. For the same set of parameters (mask, projection size and sequence type) the matrix generated is always the same, which guarantees the comparability of vectors from different analyses.

Below, we present the use of SWeeP for the different types of inputs.

2.2.1. Data type BString

One of the possible inputs to the SWeeP function is the BString format, which can be loaded into RAM via the Biostrings package, as shown below.

# aastring data included in the package for example purposes
path = paste (system.file("examples/aaMitochondrial/",package = "rSWeeP"),'/', sep = '')
input = Biostrings::readBStringSet(paste(path,'11_Pan_paniscus.faa',sep=''))
sw = SWeeP(input,base160k,seqtype='AA')

## Starting projection. Please wait.
## Projecting ...

In the case where only one sequence (BString) is supplied, the concatenate=TRUE parameter allows the supplied sequences to be joined into a single sequence. This option is useful, for example, when I want to analyze a complete genome/proteome rather than the individual genes/proteins of a MULTIFASTA. In other words, using concatenate=TRUE you get a single vector representing the organism, and using concatenate=FALSE you get N vectors, each representing a gene/protein.

path = paste (system.file("examples/aaMitochondrial/",package = "rSWeeP"),'/', sep = '')
input = Biostrings::readBStringSet(paste(path,'11_Pan_paniscus.faa',sep=''))
sw = SWeeP(input,base160k,seqtype='AA',concatenate=T,bin=T,mask=mask)

## Note that the 'concatenate' option will concatenate all the samples in your input.
## Starting projection. Please wait.
## Projecting ...

sw$info

## $ProjectionSize
## [1] 50
## 
## $mask
## [1] 1 1 0 1 1
## 
## $SequenceType
## [1] "AA"
## 
## $bin
## [1] "binary (TRUE)"
## 
## $concatenate
## [1] TRUE
## 
## $version
## [1] "SWeeP v2.8"
## 
## $norm
## [1] "none"
## 
## $timeElapsed
## [1] 0.131

2.2.2. Data type path (address for folder with FASTAS)

Another possible input for the SWeeP function is the address of a directory with FASTAS files, each corresponding to a vector to be generated (gene or genome, protein or proteome). All the FASTAS are then loaded into memory and processed.

path = paste (system.file("examples/aaMitochondrial/",package = "rSWeeP"),'/', sep = '')
sw = SWeeP(path,base160k,seqtype='AA',bin=T,mask=mask)

## Starting projection. Please wait.
## Projecting ...

If each FASTA contains a very long sequence or there are many sequences, and memory is limited, there is the option of using the parameter lowRAMmode=TRUE in which each sequence is processed individually, saving RAM.

path = paste (system.file("examples/aaMitochondrial/",package = "rSWeeP"),'/', sep = '')
sw = SWeeP(path,base160k,seqtype='AA',bin=T,mask=mask,lowRAMmode=T)

## starting sequence  1 of 14Projecting ...
##  - complete
## starting sequence  2 of 14Projecting ...
##  - complete
## starting sequence  3 of 14Projecting ...
##  - complete
## starting sequence  4 of 14Projecting ...
##  - complete
## starting sequence  5 of 14Projecting ...
##  - complete
## starting sequence  6 of 14Projecting ...
##  - complete
## starting sequence  7 of 14Projecting ...
##  - complete
## starting sequence  8 of 14Projecting ...
##  - complete
## starting sequence  9 of 14Projecting ...
##  - complete
## starting sequence  10 of 14Projecting ...
##  - complete
## starting sequence  11 of 14Projecting ...
##  - complete
## starting sequence  12 of 14Projecting ...
##  - complete
## starting sequence  13 of 14Projecting ...
##  - complete
## starting sequence  14 of 14Projecting ...
##  - complete

To visualize the vectorized data, we provide the metadata:

pathmetadata <- system.file(package = "rSWeeP" , "examples" , "metadata_mitochondrial.csv")
mt = read.csv(pathmetadata,header=TRUE)
vunq = c('Hominidae','Psittacidae','Muridae','Bovidae','Apocynaceae')
cl = mt$id
ncl = length(vunq)

pca_output <- prcomp (sw$proj , scale = FALSE)
par(mfrow=c(1,2))
plot(pca_output$x[,1] , pca_output$x [,2] , xlab = 'PC-1' , ylab = 'PC-2' , pch =20 , col = cl)
legend("bottomright" , vunq , col = as.character(c(1:ncl)) , pch =20)
plot(pca_output$x[,3] , pca_output$x [,4] , xlab = 'PC-3' , ylab = 'PC-4' , pch =20 , col = cl)

We can also obtain the phylogenetic relationship between the vectorized organisms using the Neighbour-Joining (NJ) algorithm.

library(ape)
mdist = dist(sw$proj,method='euclidean')
tr = nj(mdist)
tr = root(tr,outgroup='14_Rhazya_stricta')
plot(tr)

2.2.3. Data type expression matrices for RNAseq

In addition to biological sequences, SWeeP is also able to perform dimension reduction of RNAseq data from a supplied expression matrix. This matrix can be in dgCMatrix format, the Seurat package standard, or matrix format. Note that, by default, expression matrices allocate the samples (cells) in the columns and the transcribed genes (features) in the rows. In this case it is necessary to use transpose=TRUE or RNAseqdata=TRUE as parameters.

As an example of data we can use the data provided by the Seurat package itself, pbmc3k, as available at link.

library(dplyr)
library(Seurat)
library(patchwork)

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "~/BD/harmony/filtered_matrices_mex/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

pbmc

## An object of class Seurat 
## 16971 features across 3388 samples within 1 assay 
## Active assay: RNA (16971 features, 0 variable features)
##  1 layer present: counts

And now we can use pbmc.data as input for SWeeP. But first, we must scale the projection matrix according to the dimensions of the matrix to be projected, the pbmc

psz=10
base160k = orthBase(dim(pbmc.data)[1],psz)
sw = SWeeP(input=pbmc.data,orthbase=base160k,transpose=T,verbose=F)

## Starting projection. Please wait.
## Projecting ...

If the data is not transposed (each line corresponds to a sample), use FALSE for both parameters. A warning message will be displayed, just ignore it.

psz=10
base160k = orthBase(dim(pbmc.data)[2],psz) # note that if I don't transpose, the columns change
sw = SWeeP(input=pbmc.data,orthbase=base160k,transpose=F,norm='log',verbose=F)

## Caution! 
## If your input is of the RNAseq type, probably each column contains a sample, and each row a gene. 
## In this case, use the 'transpose=TRUE' option.
## 
## Starting projection. Please wait.
## Projecting ...

2.2.4. Data type generic matrix

Other types of matrices can be projected (dimension reduction). The only requirement for the analysis of this data to be valid is that the input matrix has structured features, i.e. each sample is represented by the same set of features. There is also the option of transposing the matrix if necessary.

psz=10
nfeat = 100
nsamples = 5
input = matrix(runif( (nsamples*nfeat) ,0,10), nrow=nsamples, ncol = nfeat)
base160k = orthBase(nfeat,psz)
sw = SWeeP(input=input,orthbase=base160k,transpose=F)

## Starting projection. Please wait.
## Projecting ...

2.3. Function `SWeePlite`

The lite version of the SWeeP tool does not require an orthBase matrix to be generated or supplied, as this version generates it internally. This version also allows the use of larger masks, without compromising the use of the computer’s RAM if it is limited.

Just like the SWeeP function, SWeePlite supports different types of input. Here are some examples of the supported formats. Consider the following parameters:

mask = c(2,1,2)
psz = 500

2.3.1. Data type BString

Firstly, the biological sequences in BString

path = paste (system.file("examples/aaMitochondrial/",package = "rSWeeP"),'/', sep = '')
input = Biostrings::readBStringSet(paste(path,'11_Pan_paniscus.faa',sep=''))
sw = SWeePlite(input,seqtype='AA',mask=mask,psz=psz)

## Starting projection. Please wait.
## starting sequence  1 of 13 - complete
## starting sequence  2 of 13 - complete
## starting sequence  3 of 13 - complete
## starting sequence  4 of 13 - complete
## starting sequence  5 of 13 - complete
## starting sequence  6 of 13 - complete
## starting sequence  7 of 13 - complete
## starting sequence  8 of 13 - complete
## starting sequence  9 of 13 - complete
## starting sequence  10 of 13 - complete
## starting sequence  11 of 13 - complete
## starting sequence  12 of 13 - complete
## starting sequence  13 of 13 - complete

There is also the concatenation option.

path = paste (system.file("examples/aaMitochondrial/",package = "rSWeeP"),'/', sep = '')
input = Biostrings::readBStringSet(paste(path,'11_Pan_paniscus.faa',sep=''))
sw = SWeePlite(input,seqtype='AA',mask=mask,psz=psz,concatenate=T)

## Starting projection. Please wait.
## starting sequence  1 of 1 - complete

2.3.2. Data type path (address for folder with FASTAS)

Address type input for directory with FASTAS files, each file corresponding to a vector to be generated. All the FASTAS are then loaded into memory and processed.

path = paste (system.file("examples/aaMitochondrial/",package = "rSWeeP"),'/', sep = '')
sw = SWeePlite(path,seqtype='AA',bin=T,mask=mask,psz=500,lowRAMmode=F)

## Starting projection. Please wait.
## starting sequence  1 of 14 - complete
## starting sequence  2 of 14 - complete
## starting sequence  3 of 14 - complete
## starting sequence  4 of 14 - complete
## starting sequence  5 of 14 - complete
## starting sequence  6 of 14 - complete
## starting sequence  7 of 14 - complete
## starting sequence  8 of 14 - complete
## starting sequence  9 of 14 - complete
## starting sequence  10 of 14 - complete
## starting sequence  11 of 14 - complete
## starting sequence  12 of 14 - complete
## starting sequence  13 of 14 - complete
## starting sequence  14 of 14 - complete

And for large volumes of data, the lowRAMmode=TRUE option for processing each FASTA individually.

path = paste (system.file("examples/aaMitochondrial/",package = "rSWeeP"),'/', sep = '')
sw = SWeePlite(path,seqtype='AA',bin=T,mask=mask,psz=500,lowRAMmode=T)

## Starting projection. Please wait.
## starting sequence  1 of 14 - complete
## starting sequence  2 of 14 - complete
## starting sequence  3 of 14 - complete
## starting sequence  4 of 14 - complete
## starting sequence  5 of 14 - complete
## starting sequence  6 of 14 - complete
## starting sequence  7 of 14 - complete
## starting sequence  8 of 14 - complete
## starting sequence  9 of 14 - complete
## starting sequence  10 of 14 - complete
## starting sequence  11 of 14 - complete
## starting sequence  12 of 14 - complete
## starting sequence  13 of 14 - complete
## starting sequence  14 of 14 - complete

To visualize the projected data, we provide the metadata and plot the first two components via PCA:

pathmetadata <- system.file(package = "rSWeeP" , "examples" , "metadata_mitochondrial.csv")
mt = read.csv(pathmetadata,header=TRUE)
vunq = c('Hominidae','Psittacidae','Muridae','Bovidae')
cl = mt$id
ncl = length(vunq)

pca_output <- prcomp (sw$proj , scale = FALSE)
par(mfrow=c(1,2))
plot(pca_output$x[,1] , pca_output$x [,2] , xlab = 'PC-1' , ylab = 'PC-2' , pch =20 , col = cl)
legend("bottomright" , vunq , col = as.character(c(1:ncl)) , pch =20)
plot(pca_output$x[,3] , pca_output$x [,4] , xlab = 'PC-3' , ylab = 'PC-4' , pch =20 , col = cl)

2.3.3. Data type expression matrices for RNAseq

For dimension reduction of RNAseq data from a supplied expression matrix format dgCMatrix (Seurat package) or matrix format. Use the parameter transpose=TRUE or RNAseqdata=TRUE if samples (cells) are in columns and the transcribed genes (features) in the rows.

As an example of data we can use the data provided by the Seurat package itself, pbmc3k, as available at link.

library(dplyr)
library(Seurat)
library(patchwork)

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "~/BD/harmony/filtered_matrices_mex/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

pbmc

## An object of class Seurat 
## 16971 features across 3388 samples within 1 assay 
## Active assay: RNA (16971 features, 0 variable features)
##  1 layer present: counts

This step can take several minutes. Although slower, it is ideal for large data sets with limited memory usage.

psz=10
sw = SWeePlite(input=pbmc.data,transpose=T,psz=psz,verbose=F)

2.3.4. Data type generic matrix

Finally, a generic structured matrix is allowed. The format of the matrix must correspond to a sample in each row and a feature in each column. But if the matrix is transposed, use transpose=TRUE.

psz=10
nfeat = 100
nsamples = 5
input = matrix(runif( (nsamples*nfeat) ,0,10), nrow=nsamples, ncol = nfeat)
sw = SWeePlite(input=input,psz=10,transpose=F)

## starting sequence  1 of 5 - complete
## starting sequence  2 of 5 - complete
## starting sequence  3 of 5 - complete
## starting sequence  4 of 5 - complete
## starting sequence  5 of 5 - complete

2.4. Function `extractHDV`

The extractHDV returns a vector for each sequence, but not the compact one (LDV). The HDV corresponds to the high-dimensional vector representing the k-mer pattern count of the given sequence. The HDV is best described in Figure from section 2

path = paste (system.file("examples/aaMitochondrial/",package = "rSWeeP"),'/', sep = '')
mask = c(2,1,2)
output = extractHDV(input=path,mask=mask,seqtype='AA',bin=FALSE,extension=c('.faa','.fas','.fasta'))

## starting sequence  1 of 14 - complete
## starting sequence  2 of 14 - complete
## starting sequence  3 of 14 - complete
## starting sequence  4 of 14 - complete
## starting sequence  5 of 14 - complete
## starting sequence  6 of 14 - complete
## starting sequence  7 of 14 - complete
## starting sequence  8 of 14 - complete
## starting sequence  9 of 14 - complete
## starting sequence  10 of 14 - complete
## starting sequence  11 of 14 - complete
## starting sequence  12 of 14 - complete
## starting sequence  13 of 14 - complete
## starting sequence  14 of 14 - complete

The dimension of the HDV matrix for a 4-mer mask applied to amino acids is

dim(output$HDV)

## [1]     14 160000

The information provided by the output

output$info

## $headers
##  [1] "01_Pan_troglodytes"               "02_Capra_aegagrus"               
##  [3] "03_Homo_sapiens"                  "04_Bos_taurus"                   
##  [5] "05_Ara_ararauna"                  "06_Mus_musculus"                 
##  [7] "07_Brotogeris_cyanoptera"         "08_Homo_sapiens_neanderthalensis"
##  [9] "09_Gazella_gazella"               "10_Rattus_norvegicus"            
## [11] "11_Pan_paniscus"                  "12_Psittacara_rubritorquis"      
## [13] "13_Apodemus_sylvaticus"           "14_Rhazya_stricta"               
## 
## $mask
## [1] 1 1 0 1 1
## 
## $SequenceType
## [1] "AA"
## 
## $extension
## [1] ".faa"   ".fas"   ".fasta"
## 
## $concatenate
## [1] FALSE
## 
## $bin
## [1] "counting (FALSE)"
## 
## $version
## [1] "SWeeP v2.8"
## 
## $saturation
##  [1] 0.01241875 0.01256875 0.01240625 0.01260000 0.01223125 0.01255000
##  [7] 0.01198125 0.01236250 0.01250000 0.01237500 0.01239375 0.01207500
## [13] 0.01238125 0.03448125
## 
## $timeElapsed
## [1] 0.293

The output presents:

output$HDV: a matrix containing the High Dimensional Vectors of the given FASTAS. The dimension of the vector corresponds to $20^k$ for amino acids and $4^k$ for nucleotides. $k$ corresponds to the k-mer of the mask used.
output$info aditional information of the process. This object is subdivided in:
- headers: a character containing the list of samples
- mask: a integer containing the mask used
- SequenceType: a character containing the type of the sequence (amino acid: AA, ou nucleotide: NT)
- extension: a character containing the list of extensions considered
- bin: a character containing if binary or counting
- saturation: a vector containing the filled (non-zero) percentage of the HDV for each sample
- timeElapsed: a double containing the elapsed time in seconds

Saturation is important information about the mask. When, for all sequences, the saturation is very high (up to 95%), the mask needs to have a higher k-mer.

3. Additional information

Other important parameters for optimizing the process of SWeeP and SweePlite functions:

extension provide the file extension if necessary, for example extension = c('.faa','.fasta'). To consider all the files in the folder, you can simply ignore the parameter or provide extension='' so that all the files in the folder are considered. [for SweeP and SWeePlite]
lowRAMmode = TRUE reduces memory consumption during processing by loading one sequence at a time from the file. Ideal when you want to analyze a large number of sequences (several FASTAS) or when the files are very large (long genomic FASTAS). [for SweeP and SWeePlite]
nk = 15000 by default. The higher the nk, the greater the memory consumption but the shorter the processing time. A vectorization with a 5-mer mask and nk=15000 can run on a machine with 8Gb of memory, with the use of 2 cores. [only for SWeePlite]
ncores=2 by default. Number of cores to be used by the process. [for SweeP and SWeePlite]

More information on how to use each function is available in the package manual or at ?rSWeeP.

sessionInfo()

## R version 4.4.0 (2024-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: LMDE 5 (elsie)
## 
## Matrix products: default
## BLAS:   /usr/local/lib/R/lib/libRblas.so 
## LAPACK: /usr/local/lib/R/lib/libRlapack.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=pt_BR.UTF-8        LC_COLLATE=pt_BR.UTF-8    
##  [5] LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=pt_BR.UTF-8   
##  [7] LC_PAPER=pt_BR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/Sao_Paulo
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] patchwork_1.2.0     Seurat_5.0.3        SeuratObject_5.0.1 
##  [4] sp_2.1-4            dplyr_1.1.4         ape_5.8            
##  [7] rSWeeP_2.9          Biostrings_2.72.0   GenomeInfoDb_1.40.0
## [10] XVector_0.44.0      IRanges_2.38.0      S4Vectors_0.42.0   
## [13] BiocGenerics_0.50.0 doParallel_1.0.17   iterators_1.0.14   
## [16] foreach_1.5.2       knitr_1.46         
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      rstudioapi_0.16.0       jsonlite_1.8.8         
##   [4] magrittr_2.0.3          spatstat.utils_3.0-4    rmarkdown_2.26         
##   [7] fs_1.6.4                zlibbioc_1.50.0         vctrs_0.6.5            
##  [10] ROCR_1.0-11             spatstat.explore_3.2-7  memoise_2.0.1          
##  [13] htmltools_0.5.8.1       usethis_2.2.3           sctransform_0.4.1      
##  [16] sass_0.4.9              parallelly_1.37.1       KernSmooth_2.23-22     
##  [19] bslib_0.7.0             htmlwidgets_1.6.4       desc_1.4.3             
##  [22] ica_1.0-3               plyr_1.8.9              plotly_4.10.4          
##  [25] zoo_1.8-12              cachem_1.0.8            igraph_2.0.3           
##  [28] mime_0.12               lifecycle_1.0.4         pkgconfig_2.0.3        
##  [31] Matrix_1.7-0            R6_2.5.1                fastmap_1.2.0          
##  [34] GenomeInfoDbData_1.2.12 fitdistrplus_1.1-11     future_1.33.2          
##  [37] shiny_1.8.1.1           digest_0.6.35           colorspace_2.1-0       
##  [40] tensor_1.5              rprojroot_2.0.4         RSpectra_0.16-1        
##  [43] irlba_2.3.5.1           pkgload_1.3.4           progressr_0.14.0       
##  [46] spatstat.sparse_3.0-3   fansi_1.0.6             polyclip_1.10-6        
##  [49] abind_1.4-5             httr_1.4.7              compiler_4.4.0         
##  [52] remotes_2.5.0           withr_3.0.0             fastDummies_1.7.3      
##  [55] pkgbuild_1.4.4          highr_0.10              R.utils_2.12.3         
##  [58] MASS_7.3-60.2           sessioninfo_1.2.2       tools_4.4.0            
##  [61] lmtest_0.9-40           httpuv_1.6.15           future.apply_1.11.2    
##  [64] goftest_1.2-3           R.oo_1.26.0             glue_1.7.0             
##  [67] nlme_3.1-164            promises_1.3.0          grid_4.4.0             
##  [70] Rtsne_0.17              reshape2_1.4.4          cluster_2.1.6          
##  [73] generics_0.1.3          spatstat.data_3.0-4     gtable_0.3.5           
##  [76] R.methodsS3_1.8.2       tidyr_1.3.1             data.table_1.15.4      
##  [79] utf8_1.2.4              spatstat.geom_3.2-9     RcppAnnoy_0.0.22       
##  [82] ggrepel_0.9.5           RANN_2.6.1              pillar_1.9.0           
##  [85] stringr_1.5.1           spam_2.10-0             RcppHNSW_0.6.0         
##  [88] later_1.3.2             splines_4.4.0           lattice_0.22-6         
##  [91] deldir_2.0-4            survival_3.6-4          tidyselect_1.2.1       
##  [94] miniUI_0.1.1.1          pbapply_1.7-2           gridExtra_2.3          
##  [97] scattermore_1.2         xfun_0.43               devtools_2.4.5         
## [100] matrixStats_1.3.0       stringi_1.8.3           UCSC.utils_1.0.0       
## [103] lazyeval_0.2.2          yaml_2.3.8              evaluate_0.23          
## [106] codetools_0.2-20        tibble_3.2.1            cli_3.6.3              
## [109] uwot_0.2.2              xtable_1.8-4            reticulate_1.38.0      
## [112] munsell_0.5.1           jquerylib_0.1.4         Rcpp_1.0.12            
## [115] spatstat.random_3.2-3   globals_0.16.3          png_0.1-8              
## [118] ellipsis_0.3.2          ggplot2_3.5.1           dotCall64_1.1-1        
## [121] profvis_0.3.8           urlchecker_1.0.1        listenv_0.9.1          
## [124] viridisLite_0.4.2       scales_1.3.0            ggridges_0.5.6         
## [127] leiden_0.4.3.1          purrr_1.0.2             crayon_1.5.2           
## [130] rlang_1.1.4             cowplot_1.1.3

References

De Pierri, Camilla Reginatto, Ricardo Voyceik, Letı́cia Graziela Costa Santos de Mattos, Mariane Gonçalves Kulik, Josué Oliveira Camargo, Aryel Marlus Repula de Oliveira, Bruno Thiago de Lima Nichio, et al. 2020. “SWeeP: Representing Large Biological Sequences Datasets in Compact Vectors.” Scientific Reports 10 (1): 1–10. https://doi.org/10.1038/s41598-019-55627-4.

Fernandes, Danrley, Mariane G Kulik, Diogo JS Machado, Jeroniza N Marchaukoski, Fabio O Pedrosa, Camilla R De Pierri, and Roberto T Raittz. 2020. “rSWeeP: AR/Bioconductor Package Deal with SWeeP Sequences Representation.” bioRxiv. https://doi.org/10.1101/2020.09.09.290247.

Perico, Camila P, Camilla R De Pierri, Giuseppe Pasqualato Neto, Danrley R Fernandes, Fabio O Pedrosa, Emanuel M De Souza, and Roberto T Raittz. 2022. “Genomic Landscape of the SARS-CoV-2 Pandemic in Brazil Suggests an External p. 1 Variant Origin.” Frontiers in Microbiology 13: 1037455. https://doi.org/10.3389/fmicb.2022.1037455.

Raittz, Roberto Tadeu, Camilla Reginatto De Pierri, Marta Maluk, Marcelo Bueno Batista, Manuel Carmona, Madan Junghare, Helisson Faoro, et al. 2021. “Comparative Genomics Provides Insights into the Taxonomy of Azoarcus and Reveals Separate Origins of Nif Genes in the Proposed Azoarcus and Aromatoleum Genera.” Genes 12 (1): 71. https://doi.org/10.3390/genes12010071.

Silva Filho, Antonio Camilo da, Jeroniza Nunes Marchaukoski, Roberto Tadeu Raittz, Camilla Reginatto De Pierri, Diogo de Jesus Soares Machado, Cyntia Maria Telles Fadel-Picheth, and Geraldo Picheth. 2021. “Prediction and Analysis in Silico of Genomic Islands in Aeromonas Hydrophila.” Frontiers in Microbiology 12. https://doi.org/10.3389/fmicb.2021.769380.

Proteomes refer to the complete set of CDSs translated into proteins from the complete genome.↩︎

Tutorial on rSWeeP package functions and parametrization

Camila Pereira Perico

2024-05-14

1. Installation

2. SWeeP package

2.1 Indicated parameters

2.1.1. Masks

2.1.2. Length of output vector or Projection size (`psz`)

2.1.3. Normalization (`norm`)

2.1.4. Binary/Counting Format (`bin`)

2.1.5. Output format details

2.1.6. Parameters from literature

2.2. Function `SWeeP`

2.2.1. Data type BString

2.2.2. Data type path (address for folder with FASTAS)

2.2.3. Data type expression matrices for RNAseq

2.2.4. Data type generic matrix

2.3. Function `SWeePlite`

2.3.1. Data type BString

2.3.2. Data type path (address for folder with FASTAS)

2.3.3. Data type expression matrices for RNAseq

2.3.4. Data type generic matrix

2.4. Function `extractHDV`

3. Additional information

References

Tutorial on rSWeeP package functions and parametrization

Camila Pereira Perico

2024-05-14

1. Installation

2. SWeeP package

2.1 Indicated parameters

2.1.1. Masks

2.1.2. Length of output vector or Projection size (psz)

2.1.3. Normalization (norm)

2.1.4. Binary/Counting Format (bin)

2.1.5. Output format details

2.1.6. Parameters from literature

2.2. Function SWeeP

2.2.1. Data type BString

2.2.2. Data type path (address for folder with FASTAS)

2.2.3. Data type expression matrices for RNAseq

2.2.4. Data type generic matrix

2.3. Function SWeePlite

2.3.1. Data type BString

2.3.2. Data type path (address for folder with FASTAS)

2.3.3. Data type expression matrices for RNAseq

2.3.4. Data type generic matrix

2.4. Function extractHDV

3. Additional information

References

2.1.2. Length of output vector or Projection size (`psz`)

2.1.3. Normalization (`norm`)

2.1.4. Binary/Counting Format (`bin`)

2.2. Function `SWeeP`

2.3. Function `SWeePlite`

2.4. Function `extractHDV`