CSSSCL: a taxonomic classifier for DNA sequences.

Project Leader Vincent Ferretti

Author Ivan Borozan

About

CSSSCL is a python package that uses Combined Sequence Similarity Scores for accurate taxonomic CLassification of long and short reads.

Tested environments

Distributor ID: Debian/Ubuntu
Description: Debian GNU/Linux 8.1 (jessie) / Ubuntu /12.04.3 LTS /14.04.1/19.04
Release: 8.1 64-bit / 12.04 64-bit / 14.04 64-bit
Codename: jessie / precise / trusty

Python = 2.7.9 (biopython==1.67, Cython==0.29.10, numpy==1.9.2, pymongo==2.9.5, pysam==0.15.2, python-dateutil==2.5.3, scikit-learn==0.17.1, scipy==0.15.1, six==1.12.0)

We have setup three ways for installing the cssscl package:

Getting started

Installation Guide

Option A - Installing `cssscl` using the Python's Virtual Environment

We recommend to install the `cssscl` package using the Python's Virtual Environment tool to keep the dependencies required by the `cssscl` package in a separate directory and to keep your global python dist- or site-packages directory clean and manageable as shown below:
Note: if any of the following packages: jellyfish, BLAST or plzip are already installed on your system make sure that they are in your executable search path (i.e. PATH variable) (as shown in the examples below):
BLAST

# e.g. PATH_TO_YOUR_BLAST=/home/user_x/blast/ncbi-blast-2.2.30+/bin
$ export PATH=$PATH:PATH_TO_YOUR_BLAST
jellyfish

# e.g. PATH_TO_YOUR_jellyfish=/home/user_x/jellyfish-1.1.12/bin
$ export PATH=$PATH:PATH_TO_YOUR_jellyfish
plzip

# e.g. PATH_TO_YOUR_plzip=/home/user_x/plzip-1.1/plzip
$ export PATH=$PATH:PATH_TO_YOUR_plzip
Step 1. Install dependencies on Debian and Ubuntu
In order to compile `cssscl` on Debian GNU/Linux 8.1 and Ubuntu 12.04 LTS the following packages need to be installed:

$ sudo apt-get update
$ sudo apt-get install build-essential g++ libxml2-dev libxslt-dev gfortran libopenblas-dev liblapack-dev
Step 2. Download the `cssscl` package

# use wget
$ wget --no-check-certificate https://github.com/oicr-ibc/cssscl/archive/master.tar.gz
$ tar -zxvf master.tar.gz; mv cssscl-master cssscl
or use git clone, note that `sudo apt-get install git` is required for git access

# use git clone
$ git clone https://github.com/oicr-ibc/cssscl.git
Step 3. Check that all packages necessary to run the `cssscl` are installed and are available by running the `cssscl_check_pre_installation.sh` script (only for Ubuntu/Debian distributions).

$ cd cssscl
$ ./cssscl_check_pre_installation.sh
Note: when prompted follow instructions to export when `source cssscl/scripts/export.sh` shows on the screen.
Note: for more information regarding the `cssscl_check_pre_installation.sh` script see here.
Step 4. In the `cssscl` `directory` create a virtual environment (e.g. name it `csssclvenv`)

$ virtualenv csssclvenv
Step 5. To begin using the virtual environment, it first needs to be activated as shown below:

$ source csssclvenv/bin/activate
Step 6. Install `cssscl` as root

$ sudo pip install .
Note: this will install all the python modules necessary for running the `cssscl` package in the `cssscl/csssclvenv/` directory.
Step 7. Configure `cssscl`

$ cssscl configure
Accept all the values prompted by default by pressing [ENTER]
Note: If you are done working in the virtual environment, you can deactivate it as shown below.

$ deactivate
If you would like to run the `cssscl` program again (and you have deactivated the python virtual environment) you will need to activate it again as shown above.
Option B - Install `cssscl` without using the Python's Virtual Environment

Install the `cssscl` package directly to your python global dist- or site-packages directory as shown below (CAUTION: some of the python packages on your system might be updated if required by the `cssscl` package):
Note: if any of the following packages: jellyfish, BLAST or plzip are already installed on your system make sure that they are in your executable search path (i.e. PATH variable) (as shown in the examples below):
BLAST

# e.g. PATH_TO_YOUR_BLAST=/home/user_x/blast/ncbi-blast-2.2.30+/bin
$ export PATH=$PATH:PATH_TO_YOUR_BLAST
jellyfish

# e.g. PATH_TO_YOUR_jellyfish=/home/user_x/jellyfish-1.1.12/bin
$ export PATH=$PATH:PATH_TO_YOUR_jellyfish
plzip

# e.g. PATH_TO_YOUR_plzip=/home/user_x/plzip-1.1/plzip
$ export PATH=$PATH:PATH_TO_YOUR_plzip
Step 1. Install dependencies on Debian and Ubuntu
Python: Only Python 2.7.3+ is supported. No support for Python 3 at the moment.
In order to compile `cssscl` on Debian GNU/Linux 8.1 and Ubuntu 12.04 LTS the following packages need to be installed:

$ sudo apt-get update
$ sudo apt-get install build-essential python2.7 python2.7-dev g++ libxml2-dev libxslt-dev gfortran libopenblas-dev liblapack-dev
Step 2. Download the `cssscl` package

# use wget
$ wget --no-check-certificate https://github.com/oicr-ibc/cssscl/archive/master.tar.gz
$ tar -zxvf master.tar.gz; mv cssscl-master cssscl
or use git clone, note that `sudo apt-get install git` is required for git access

# use git clone
$ git clone https://github.com/oicr-ibc/cssscl.git
Step 3. Check that all packages necessary to run the `cssscl` are installed and are avaialble by running the `cssscl_check_pre_installation.sh` script (only for Ubuntu/Debian distributions).

$ cd cssscl
$ ./cssscl_check_pre_installation.sh
Note: when prompted follow instructions to export when `source cssscl/scripts/export.sh` shows on the screen.
Note: for more information regarding the `cssscl_check_pre_installation.sh` script please see here.
Step 4. Install `cssscl` as root

$ sudo pip install .
Step 5. Configure `cssscl`

$ cssscl configure
Accept all the values prompted by default by pressing [ENTER]
Additional instructions for non-automated installation of third party software necessary for running the `cssscl` package

In case the cssscl_check_pre_installation.sh script (see the installation subsections above) fails please read the info below for the manual installation of individual third party software:
Necessary Python modules:
- BioPython - Tools for biological computation.
- PyMongo - Python module needed for working with MongoDB (PyMongo = 2.8)
- Sklearn - Machine Learning in Python
- Numpy - NumPy is the fundamental package for scientific computing with Python
- Cython - Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language (based on Pyrex)
SciPy - SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering. In particular, these are some of the core packages:
Installing python modules using pip manually:

$ pip install cython==0.29.10
$ pip install numpy==1.9.2
$ pip install pymongo==2.9.5
$ pip install biopython==1.67
$ pip install scikit-learn==0.17.1
$ pip install scipy==0.15.1
Third party software:
BLAST (version 2.2.30+ and higher)
Basic Local Alignment Search Tool.
http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download
JELLYFISH (version 1.1.+ but not 2.0.+)
JELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA.
http://www.cbcb.umd.edu/software/jellyfish/
PLZIP (version 1.1+)
Plzip is a massively parallel (multi-threaded) lossless data compressor based on the lzlib compression library, with a user interface similar to the one of lzip, bzip2 or gzip.
http://download.savannah.gnu.org/releases/lzip/plzip/
Note: that the classification results in the paper were obtained using: Plzip 1.1 using Lzlib 1.5
To compile Plzip 1.1 and Lzlib 1.5:
Step 1. Donwload lzlib-1.5.tar.gz

$ wget --no-check-certificate http://download.savannah.gnu.org/releases/lzip/lzlib/lzlib-1.5.tar.gz
Step 2. Install lzlib-1.5:

$ gunzip lzlib-1.5.tar.gz
$ tar -xvf lzlib-1.5.tar
$ cd lzlib-1.5
$ ./configure
$ make
$ make install
Step 3. Donwload Plzip 1.1

$ wget --no-check-certificate http://download.savannah.gnu.org/releases/lzip/plzip/plzip-1.1.tar.gz
Step 4. Install Plzip

$ gunzip plzip-1.1.tar.gz
$ tar -xvf plzip-1.1.tar
$ cd plzip-1.1
$ ./configure
$ make
$ make install
For more information about plzip consult:
http://www.nongnu.org/lzip/manual/plzip_manual.html
and for memory required to compress and decompress:
http://www.nongnu.org/lzip/manual/plzip_manual.html#Memory-requirements
Make sure that JELLYFISH, BLAST and Plzip are in your executable search path (see the examples below):

# for example
$ export PATH=$PATH:PATH_TO_BLAST/blast/ncbi-blast-2.2.30+/bin
$ export PATH=$PATH:PATH_TO_jellyfish/jellyfish-1.1.12/bin
$ export PATH=$PATH:PATH_TO_plzip/plzip-1.1/plzip
Install MongoDB
MongoDB should be installed using the following set of instructions:
Ubuntu 12.04.3 LTS /14.04.1
First add the 10gen GPG key, the public gpg key used for signing these packages. It should be possible to import the key into apt's public keyring with a command like this:

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
Add this line verbatim to your `/etc/apt/sources.list`:

$ deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen
In order to complete the installation of the packages, you need to update the sources and then install the desired package

$ sudo apt-get update
$ sudo apt-get install mongodb-10gen=2.4.14
Ubuntu 19.04

$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
$ echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse" \| sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
$ sudo apt update
$ sudo apt-get install -y mongodb-org
Start mongo service
$ sudo service mongod start
Debian

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
$ echo 'deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' \| tee -a /etc/apt/sources.list
$ apt-get update
$ apt-get install mongodb-10gen=2.4.14
Uninstall `cssscl`

Note: this will only work if you installed cssscl with the cmd `sudo pip install .` as shown in the Installation section above.

$ cd cssscl/
$ ./cssscl_uninstall.sh
==========
User Guide

Download taxon and test data

Download taxon data:

https://drive.google.com/open?id=1okbaJkv6IgvWf8R1A97CX9lq10wV_INY

$ tar -zxvf taxon.tar.gz

Download test/train data:

https://drive.google.com/open?id=1glzuBJAqf5MPuO5_ivaFFLnZxpjnamFc

$ tar -zxvf test_data.tar.gz

Example 1 - run the `cssscl` classifier without the optimization using the taxon data and the test set provided

Step 1. Build the necessary databases from the training set

$ cssscl build_dbs -btax -c -blast -nt 2 PATH_TO/test_data/TRAIN.fa PATH_TO/taxon/

(the whole process should take ~ 37 min using 2 CPUs)

By default all databases will be outputted to the directory where the TRAIN.fa resides (note that all paths provided in the examples above are using absolute/full paths to the files/directories). The above command will build three databases (blast, compression and the kmer database) for sequences in the training set.

The cssscl's build_dbs module requires two positional arguments to be provided:

1. a file in the fasta format (e.g. TRAIN.fa as in the example above) that specifies the collection of reference genomes composing the training set.

2. a directory (taxon/ in the example above) that specifies the location where the taxon data is stored (more specifically the directory should contain the following files: gi_taxid_nucl.dmp, names.dmp and nodes.dmp, these files can be downloaded from the NCBI taxonomy database at ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/).

The information about the additional optional arguments used in the command line above is provided here.

For more information please consult the cssscl's build_dbs help page by typing:

$ cssscl build_dbs --help

Step 2. Perform the classification of the sequences in the test set

# use cssscl to classify sequences in TEST.fa 
$ cssscl classify -c -blast blastn -tax genus -nt 2 PATH_TO/test_data/test/TEST.fa PATH_TO/test_data/

(the whole process should take ~ 29 min using 2 CPUs)

Note that in the above example the output file cssscl_results_genus.txt with classification results will be located in the directory where the TEST.fa resides.

Note: For the test set data provided above the values of the parameters used in the model have already been optimized and are included as part of the test set (see the optimum_kmer directory in the test_set/ directory provided). Thus for the test dataset the optimization is not required to be performed prior to running the classifier. On how to run the classifier by performing the optimization stage first please see the step 3 below.

The cssscl's classify module requires two positional arguments to be provided:

1. a file with test data with sequences in the FASTA format for classification (e.g. TEST.fa as in the example above)

2. a directory where the databases (built using the training set) reside

Note: This will run the classifier with all the similarity measures (including the compression and the blast measure) as described in: Borozan I et al. "Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification." Bioinformatics. 2015 Jan 7. pii: btv006.

The information about the additional optional arguments used in the command line above is provided here.

For more information please consult the cssscl's classify help page by typing

$ cssscl classify --help

Example 2 - perform the classification by optimizing the `cssscl's` parameter values first

Step 1. Build the necessary databases from the training set

Note: Only do this is you did not already built the database in Example 1 above.

$ cssscl build_dbs -btax -c -blast -nt 2 PATH_TO/test_data/TRAIN.fa PATH_TO/taxon/

(the whole process should take ~ 37 min using 2 CPUs)

Step 2. Perform the classification of the sequences in the test set by optimizing the cssscl's parameter values first

$ cssscl classify -c -blast blastn -opt -tax genus -nt 8 PATH_TO/test_data/test/TEST.fa PATH_TO/test_data/

More information about the optimization can be found here.

Note that the optimization phase will take considerably longer when -c (compression) argument is used as mentioned in the section Note regarding the compression measure below.

The information about the additional optional arguments used in the command line above is provided here.

Note regarding the compression measure

The use of the compression measure will slow down considerably the optimization and the classification parts because of the running time complexity ~ O(n*n) (for the optimization phase) and ~ O(n*m) for the classification phase, where n and m are respectively the number of sequences in the training and test sets. Thus the compression measure should only be used with smaller genome databases (e.g. viruses) and/or with smaller datasets (i.e. smaller number of reads/contigs to classify).

License and Copyright

Licensed under the GNU General Public License, Version 3.0. See LICENSE for more details.

Acknowledgement

This project is supported by the Ontario Institute for Cancer Research (OICR) through funding provided by the government of Ontario, Canada.

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
bin		bin
cssscl		cssscl
scripts		scripts
DOCKER.rst		DOCKER.rst
INSTALL.rst		INSTALL.rst
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
NEWS.rst		NEWS.rst
README.rst		README.rst
VM.rst		VM.rst
cssscl_check_pre_installation.sh		cssscl_check_pre_installation.sh
cssscl_uninstall.sh		cssscl_uninstall.sh
setup.py		setup.py

License

oicr-ibc/cssscl

Folders and files

Latest commit

History

Repository files navigation

CSSSCL: a taxonomic classifier for DNA sequences.

About

Tested environments

Getting started

Download taxon and test data

Example 1 - run the cssscl classifier without the optimization using the taxon data and the test set provided

Example 2 - perform the classification by optimizing the cssscl's parameter values first

Note regarding the compression measure

License and Copyright

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Languages

Example 1 - run the `cssscl` classifier without the optimization using the taxon data and the test set provided

Example 2 - perform the classification by optimizing the `cssscl's` parameter values first