Skip to content

oicr-ibc/cssscl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CSSSCL: a taxonomic classifier for DNA sequences.

Project Leader Vincent Ferretti

Author Ivan Borozan

About

CSSSCL is a python package that uses Combined Sequence Similarity Scores for accurate taxonomic CLassification of long and short reads.

Tested environments

Distributor ID: Debian/Ubuntu
Description: Debian GNU/Linux 8.1 (jessie) / Ubuntu /12.04.3 LTS /14.04.1/19.04
Release: 8.1 64-bit / 12.04 64-bit / 14.04 64-bit
Codename: jessie / precise / trusty

Python = 2.7.9 (biopython==1.67, Cython==0.29.10, numpy==1.9.2, pymongo==2.9.5, pysam==0.15.2, python-dateutil==2.5.3, scikit-learn==0.17.1, scipy==0.15.1, six==1.12.0)

We have setup three ways for installing the cssscl package:

  1. Quick deployment using Docker (small file).
  2. System wide installation from the source code (see the Installation Guide below).

Getting started

Installation Guide
Option A - Installing cssscl using the Python's Virtual Environment

We recommend to install the cssscl package using the Python's Virtual Environment tool to keep the dependencies required by the cssscl package in a separate directory and to keep your global python dist- or site-packages directory clean and manageable as shown below:

Note: if any of the following packages: jellyfish, BLAST or plzip are already installed on your system make sure that they are in your executable search path (i.e. PATH variable) (as shown in the examples below):

BLAST

# e.g. PATH_TO_YOUR_BLAST=/home/user_x/blast/ncbi-blast-2.2.30+/bin

$ export PATH=$PATH:PATH_TO_YOUR_BLAST

jellyfish

# e.g. PATH_TO_YOUR_jellyfish=/home/user_x/jellyfish-1.1.12/bin

$ export PATH=$PATH:PATH_TO_YOUR_jellyfish

plzip

# e.g. PATH_TO_YOUR_plzip=/home/user_x/plzip-1.1/plzip

$ export PATH=$PATH:PATH_TO_YOUR_plzip

Step 1. Install dependencies on Debian and Ubuntu

In order to compile cssscl on Debian GNU/Linux 8.1 and Ubuntu 12.04 LTS the following packages need to be installed:

$ sudo apt-get update

$ sudo apt-get install build-essential g++ libxml2-dev libxslt-dev gfortran libopenblas-dev liblapack-dev

Step 2. Download the cssscl package

# use wget

$ wget --no-check-certificate https://github.com/oicr-ibc/cssscl/archive/master.tar.gz

$ tar -zxvf master.tar.gz; mv cssscl-master cssscl

or use git clone, note that sudo apt-get install git is required for git access

# use git clone

$ git clone https://github.com/oicr-ibc/cssscl.git

Step 3. Check that all packages necessary to run the cssscl are installed and are available by running the cssscl_check_pre_installation.sh script (only for Ubuntu/Debian distributions).

$ cd cssscl

$ ./cssscl_check_pre_installation.sh

Note: when prompted follow instructions to export when source cssscl/scripts/export.sh shows on the screen.

Note: for more information regarding the cssscl_check_pre_installation.sh script see here.

Step 4. In the cssscl directory create a virtual environment (e.g. name it csssclvenv)

$ virtualenv csssclvenv

Step 5. To begin using the virtual environment, it first needs to be activated as shown below:

$ source csssclvenv/bin/activate

Step 6. Install cssscl as root

$ sudo pip install .

Note: this will install all the python modules necessary for running the cssscl package in the cssscl/csssclvenv/ directory.

Step 7. Configure cssscl

$ cssscl configure

Accept all the values prompted by default by pressing [ENTER]

Note: If you are done working in the virtual environment, you can deactivate it as shown below.

$ deactivate

If you would like to run the cssscl program again (and you have deactivated the python virtual environment) you will need to activate it again as shown above.

Option B - Install cssscl without using the Python's Virtual Environment

Install the cssscl package directly to your python global dist- or site-packages directory as shown below (CAUTION: some of the python packages on your system might be updated if required by the cssscl package):

Note: if any of the following packages: jellyfish, BLAST or plzip are already installed on your system make sure that they are in your executable search path (i.e. PATH variable) (as shown in the examples below):

BLAST

# e.g. PATH_TO_YOUR_BLAST=/home/user_x/blast/ncbi-blast-2.2.30+/bin

$ export PATH=$PATH:PATH_TO_YOUR_BLAST

jellyfish

# e.g. PATH_TO_YOUR_jellyfish=/home/user_x/jellyfish-1.1.12/bin

$ export PATH=$PATH:PATH_TO_YOUR_jellyfish

plzip

# e.g. PATH_TO_YOUR_plzip=/home/user_x/plzip-1.1/plzip

$ export PATH=$PATH:PATH_TO_YOUR_plzip

Step 1. Install dependencies on Debian and Ubuntu

Python: Only Python 2.7.3+ is supported. No support for Python 3 at the moment.

In order to compile cssscl on Debian GNU/Linux 8.1 and Ubuntu 12.04 LTS the following packages need to be installed:

$ sudo apt-get update

$ sudo apt-get install build-essential python2.7 python2.7-dev g++ libxml2-dev libxslt-dev gfortran libopenblas-dev liblapack-dev

Step 2. Download the cssscl package

# use wget

$ wget --no-check-certificate https://github.com/oicr-ibc/cssscl/archive/master.tar.gz

$ tar -zxvf master.tar.gz; mv cssscl-master cssscl

or use git clone, note that sudo apt-get install git is required for git access

# use git clone

$ git clone https://github.com/oicr-ibc/cssscl.git

Step 3. Check that all packages necessary to run the cssscl are installed and are avaialble by running the cssscl_check_pre_installation.sh script (only for Ubuntu/Debian distributions).

$ cd cssscl

$ ./cssscl_check_pre_installation.sh

Note: when prompted follow instructions to export when source cssscl/scripts/export.sh shows on the screen.

Note: for more information regarding the cssscl_check_pre_installation.sh script please see here.

Step 4. Install cssscl as root

$ sudo pip install .

Step 5. Configure cssscl

$ cssscl configure

Accept all the values prompted by default by pressing [ENTER]

Additional instructions for non-automated installation of third party software necessary for running the cssscl package

In case the cssscl_check_pre_installation.sh script (see the installation subsections above) fails please read the info below for the manual installation of individual third party software:

Necessary Python modules:

- BioPython - Tools for biological computation.
- PyMongo - Python module needed for working with MongoDB (PyMongo = 2.8)
- Sklearn - Machine Learning in Python
- Numpy - NumPy is the fundamental package for scientific computing with Python
- Cython - Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language (based on Pyrex)
  • SciPy - SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering. In particular, these are some of the core packages:

Installing python modules using pip manually:

$ pip install cython==0.29.10

$ pip install numpy==1.9.2

$ pip install pymongo==2.9.5

$ pip install biopython==1.67

$ pip install scikit-learn==0.17.1

$ pip install scipy==0.15.1

Third party software:

BLAST (version 2.2.30+ and higher)
Basic Local Alignment Search Tool.

http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download

JELLYFISH (version 1.1.+ but not 2.0.+)
JELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA.

http://www.cbcb.umd.edu/software/jellyfish/

PLZIP (version 1.1+)
Plzip is a massively parallel (multi-threaded) lossless data compressor based on the lzlib compression library, with a user interface similar to the one of lzip, bzip2 or gzip.

http://download.savannah.gnu.org/releases/lzip/plzip/

Note: that the classification results in the paper were obtained using: Plzip 1.1 using Lzlib 1.5

To compile Plzip 1.1 and Lzlib 1.5:

Step 1. Donwload lzlib-1.5.tar.gz

$ wget --no-check-certificate http://download.savannah.gnu.org/releases/lzip/lzlib/lzlib-1.5.tar.gz

Step 2. Install lzlib-1.5:

$ gunzip lzlib-1.5.tar.gz

$ tar -xvf lzlib-1.5.tar

$ cd lzlib-1.5

$ ./configure

$ make

$ make install

Step 3. Donwload Plzip 1.1

$ wget --no-check-certificate http://download.savannah.gnu.org/releases/lzip/plzip/plzip-1.1.tar.gz

Step 4. Install Plzip

$ gunzip plzip-1.1.tar.gz

$ tar -xvf plzip-1.1.tar

$ cd plzip-1.1

$ ./configure

$ make

$ make install

For more information about plzip consult:

http://www.nongnu.org/lzip/manual/plzip_manual.html

and for memory required to compress and decompress:

http://www.nongnu.org/lzip/manual/plzip_manual.html#Memory-requirements

Make sure that JELLYFISH, BLAST and Plzip are in your executable search path (see the examples below):

# for example

$ export PATH=$PATH:PATH_TO_BLAST/blast/ncbi-blast-2.2.30+/bin

$ export PATH=$PATH:PATH_TO_jellyfish/jellyfish-1.1.12/bin

$ export PATH=$PATH:PATH_TO_plzip/plzip-1.1/plzip

Install MongoDB

MongoDB should be installed using the following set of instructions:

Ubuntu 12.04.3 LTS /14.04.1

First add the 10gen GPG key, the public gpg key used for signing these packages. It should be possible to import the key into apt's public keyring with a command like this:

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10

Add this line verbatim to your /etc/apt/sources.list:

$ deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen

In order to complete the installation of the packages, you need to update the sources and then install the desired package

$ sudo apt-get update

$ sudo apt-get install mongodb-10gen=2.4.14

Ubuntu 19.04

$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4

$ echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list

$ sudo apt update

$ sudo apt-get install -y mongodb-org

Start mongo service

$ sudo service mongod start

Debian

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10

$ echo 'deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' | tee -a /etc/apt/sources.list

$ apt-get update

$ apt-get install mongodb-10gen=2.4.14

Uninstall cssscl

Note: this will only work if you installed cssscl with the cmd sudo pip install . as shown in the Installation section above.

$ cd cssscl/

$ ./cssscl_uninstall.sh

==========
User Guide

Download taxon and test data

Download taxon data:

https://drive.google.com/open?id=1okbaJkv6IgvWf8R1A97CX9lq10wV_INY

$ tar -zxvf taxon.tar.gz

Download test/train data:

https://drive.google.com/open?id=1glzuBJAqf5MPuO5_ivaFFLnZxpjnamFc

$ tar -zxvf test_data.tar.gz

Example 1 - run the cssscl classifier without the optimization using the taxon data and the test set provided

Step 1. Build the necessary databases from the training set

$ cssscl build_dbs -btax -c -blast -nt 2 PATH_TO/test_data/TRAIN.fa PATH_TO/taxon/

(the whole process should take ~ 37 min using 2 CPUs)

By default all databases will be outputted to the directory where the TRAIN.fa resides (note that all paths provided in the examples above are using absolute/full paths to the files/directories). The above command will build three databases (blast, compression and the kmer database) for sequences in the training set.

The cssscl's build_dbs module requires two positional arguments to be provided:

1. a file in the fasta format (e.g. TRAIN.fa as in the example above) that specifies the collection of reference genomes composing the training set.

2. a directory (taxon/ in the example above) that specifies the location where the taxon data is stored (more specifically the directory should contain the following files: gi_taxid_nucl.dmp, names.dmp and nodes.dmp, these files can be downloaded from the NCBI taxonomy database at ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/).

The information about the additional optional arguments used in the command line above is provided here.

For more information please consult the cssscl's build_dbs help page by typing:

$ cssscl build_dbs --help

Step 2. Perform the classification of the sequences in the test set

# use cssscl to classify sequences in TEST.fa 
$ cssscl classify -c -blast blastn -tax genus -nt 2 PATH_TO/test_data/test/TEST.fa PATH_TO/test_data/

(the whole process should take ~ 29 min using 2 CPUs)

Note that in the above example the output file cssscl_results_genus.txt with classification results will be located in the directory where the TEST.fa resides.

Note: For the test set data provided above the values of the parameters used in the model have already been optimized and are included as part of the test set (see the optimum_kmer directory in the test_set/ directory provided). Thus for the test dataset the optimization is not required to be performed prior to running the classifier. On how to run the classifier by performing the optimization stage first please see the step 3 below.

The cssscl's classify module requires two positional arguments to be provided:

1. a file with test data with sequences in the FASTA format for classification (e.g. TEST.fa as in the example above)

2. a directory where the databases (built using the training set) reside

Note: This will run the classifier with all the similarity measures (including the compression and the blast measure) as described in: Borozan I et al. "Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification." Bioinformatics. 2015 Jan 7. pii: btv006.

The information about the additional optional arguments used in the command line above is provided here.

For more information please consult the cssscl's classify help page by typing

$ cssscl classify --help 

Example 2 - perform the classification by optimizing the cssscl's parameter values first

Step 1. Build the necessary databases from the training set

Note: Only do this is you did not already built the database in Example 1 above.

$ cssscl build_dbs -btax -c -blast -nt 2 PATH_TO/test_data/TRAIN.fa PATH_TO/taxon/

(the whole process should take ~ 37 min using 2 CPUs)

Step 2. Perform the classification of the sequences in the test set by optimizing the cssscl's parameter values first

$ cssscl classify -c -blast blastn -opt -tax genus -nt 8 PATH_TO/test_data/test/TEST.fa PATH_TO/test_data/

More information about the optimization can be found here.

Note that the optimization phase will take considerably longer when -c (compression) argument is used as mentioned in the section Note regarding the compression measure below.

The information about the additional optional arguments used in the command line above is provided here.

Note regarding the compression measure

The use of the compression measure will slow down considerably the optimization and the classification parts because of the running time complexity ~ O(n*n) (for the optimization phase) and ~ O(n*m) for the classification phase, where n and m are respectively the number of sequences in the training and test sets. Thus the compression measure should only be used with smaller genome databases (e.g. viruses) and/or with smaller datasets (i.e. smaller number of reads/contigs to classify).

Licensed under the GNU General Public License, Version 3.0. See LICENSE for more details.

Copyright 2015 The Ontario Institute for Cancer Research.

Acknowledgement

This project is supported by the Ontario Institute for Cancer Research (OICR) through funding provided by the government of Ontario, Canada.