Seq2Vec - DNA sequence vectorization

🛑 a newer, faster and a complete tool is on the way - https://github.com/anuradhawick/kmertools

This tool is intended to be used for data generation in Bioinformatics Machine Learning related tasks. You can use Seq2Vec to convert FASTA or FASTQ data sets into k-mer frequency vectors. We use memory mapped files to write faster and a multi-worker pipeline to vectorise the sequences.

Citation

@article{10.1093/bioinformatics/btaa441,
    author = {Wickramarachchi, Anuradha and Mallawaarachchi, Vijini and Rajan, Vaibhav and Lin, Yu},
    title = "{MetaBCC-LR: metagenomics binning by coverage and composition for long reads}",
    journal = {Bioinformatics},
    volume = {36},
    number = {Supplement_1},
    pages = {i3-i11},
    year = {2020},
    month = {07},
    abstract = "{Metagenomics studies have provided key insights into the composition and structure of microbial communities found in different environments. Among the techniques used to analyse metagenomic data, binning is considered a crucial step to characterize the different species of micro-organisms present. The use of short-read data in most binning tools poses several limitations, such as insufficient species-specific signal, and the emergence of long-read sequencing technologies offers us opportunities to surmount them. However, most current metagenomic binning tools have been developed for short reads. The few tools that can process long reads either do not scale with increasing input size or require a database with reference genomes that are often unknown. In this article, we present MetaBCC-LR, a scalable reference-free binning method which clusters long reads directly based on their k-mer coverage histograms and oligonucleotide composition.We evaluate MetaBCC-LR on multiple simulated and real metagenomic long-read datasets with varying coverages and error rates. Our experiments demonstrate that MetaBCC-LR substantially outperforms state-of-the-art reference-free binning tools, achieving ∼13\\% improvement in F1-score and ∼30\\% improvement in ARI compared to the best previous tools. Moreover, we show that using MetaBCC-LR before long-read assembly helps to enhance the assembly quality while significantly reducing the assembly cost in terms of time and memory usage. The efficiency and accuracy of MetaBCC-LR pave the way for more effective long-read-based metagenomics analyses to support a wide range of applications.The source code is freely available at: https://github.com/anuradhawick/MetaBCC-LR.Supplementary data are available at Bioinformatics online.}",
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btaa441},
    url = {https://doi.org/10.1093/bioinformatics/btaa441},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/36/Supplement\_1/i3/33488763/btaa441.pdf},
}

Downloading and Compiling

First download the repository. The script named build.sh has all the required steps automated for easy compilation.

git clone https://github.com/anuradhawick/seq2vec.git
cd seq2vec
./build.sh

Usage

Binary will be available at build/seq2vec. Help is available with -h command;

Seq2Vec fast sequence vectorization:
  -h [ --help ]              show help message
  -f [ --file ] arg          input file path
  -o [ --output ] arg        output vectors path
  -x [ --preset ] arg (=csv) output type, should be one of csv, tsv, or json
  -k [ --k-size ] arg (=3)   set k-mer size
  -t [ --threads ] arg (=8)  set thread count

Output

A text file with the output will be generated at the output provided as the -o argument.

Notes

The default k-value is 3 and usually keep it under 8.

Have a good one! 😃

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
include		include
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
LICENSE_THIRDPARTY		LICENSE_THIRDPARTY
README.md		README.md
build.sh		build.sh
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

include

include

.gitignore

.gitignore

CMakeLists.txt

CMakeLists.txt

LICENSE

LICENSE

LICENSE_THIRDPARTY

LICENSE_THIRDPARTY

README.md

README.md

build.sh

build.sh

main.cpp

main.cpp

Repository files navigation

Seq2Vec - DNA sequence vectorization

🛑 a newer, faster and a complete tool is on the way - https://github.com/anuradhawick/kmertools

Citation

Downloading and Compiling

Usage

Output

Notes

About

Releases 1

Languages

License

anuradhawick/seq2vec

Folders and files

Latest commit

History

Repository files navigation

Seq2Vec - DNA sequence vectorization

🛑 a newer, faster and a complete tool is on the way - https://github.com/anuradhawick/kmertools

Citation

Downloading and Compiling

Usage

Output

Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Languages