Skip to content

Releases: mtisza1/Cenote-Taker2

Cenote Taker 2 version 2.1.5

09 May 17:47
Compare
Choose a tag to compare

NOTE: Downloading the binaries will not help you to set up Cenote-Taker 2. If you haven't already installed Cenote-Taker 2, please follow installation/update instructions in README, including the database updates.

Update notes:

  1. Major changes have been made to make the installation faster, easier and have a smaller data footprint (was ~130GB and now is ~8GB to ~75GB depending on your database choices). Details:
  • The following tools (either tricky to install or out of date) were removed from the dependencies: krona, emboss suite, circlator, mummer.
  • The following tools were added to the dependencies: seqkit
  • The following tools were changed from stand-alone git clones to packages in the conda environment: lastal/lastdb, hhblits/hhsearch, phanotate.
  • The protein BLAST database of RefSeq etc sequences was updated to include ~3000 new RefSeq virus entries
  • The hhsuite databases are now optional. PDB, PFAM, CDD
  1. The tool now checks that your run_title is appropriately formatted
  2. For contigs with DTRs (direct terminal repeats), the --wrap option allows users to choose either: clip repeat region and rotate contig to an appropriate position, or forgo rotating and clipping but DTRs are reported in the genome map. #29
  3. Certain rm commands were fixed. #21
  4. The taxonomy calling framework has been updated. NCBI Taxdump files are used for TaxIDs instead of the krona database. "tax_guide.blastx.out" files now show the taxid of the best hit, and have tab-separated hierarchical taxonomy info for that reference. Example:
example_ct1_1	gi|849254117|ref|YP_009150201.1| terminase [Propionibacterium phage PHL085N00]	45.575	9.81e-119	452
taxid: 1500812
10239	Viruses	superkingdom
2731341	Duplodnaviria	clade
2731360	Heunggongvirae	kingdom
2731618	Uroviricota	phylum
2731619	Caudoviricetes	class
28883	Caudovirales	order
10699	Siphoviridae	family
1982251	Pahexavirus	genus
1982275	Pahexavirus PHL037M02	species
  1. protein sequence based taxonomy now is more flexible, with thresholds for genome taxon assignment:
Hallmark AAI to Reference Taxonomic granularity from CT2
>90% Genus, e.g. "Ilzatvirus"
>40% Family, e.g. "Siphoviridae"
>25% Order, e.g. "Caudovirales"
=<25% Generic name, e.g. "phage"
  1. --hallmark_taxonomy option allows users to get hierarchical taxonomy information for all identified hallmark genes. This could be useful for more sophisticated downstream taxonomy assignments.
  2. -db virion is now the default setting. I think most people are inputting contigs assembled from WGS data, and this is the correct option for this data type.

Good luck with all of your Cenotes :neckbeard: 💖

Mike

Cenote-Taker 2 Version 2.1.3

16 Jun 14:26
Compare
Choose a tag to compare

Downloading the binaries will not help you to set up Cenote-Taker 2. If you haven't already installed Cenote-Taker 2, please follow installation instructions in README. If you have already installed it and you are updating from v2.1.1 or earlier, please do:

conda activate cenote-taker2_env
conda install -c bioconda biopython bedtools
cd Cenote-Taker2
git pull

If you are updating from v2.1.2:

conda activate cenote-taker2_env
git pull

Anyone doing the update should also update the HMM database!
Thank you.

Update notes:

  1. ITR sequencing are now getting annotated correctly.
  2. Problems with very large (many contigs) datasets should be resolved. There were previously some issues with find commands and argument list length.
  3. New HMMs for RNA-dependent RNA polymerase genes (7 new HMMs) have been added to the hallmark database. Thanks to Darren Obbard.

Best,

Mike

Cenote-Taker 2 Version 2.1.2

21 Apr 19:23
Compare
Choose a tag to compare

If you haven't already installed Cenote-Taker 2, please follow installation instructions in README. If you have already installed it, please do:

conda activate cenote-taker2_env
conda install -c bioconda biopython bedtools
cd Cenote-Taker2
git pull

Then update the HMM database.
Thank you.

This release improves a number of things regarding the annotation and outputs of Cenote-Taker 2. Here is a fairly comprehensive list:

  1. BLASTN can be used to determine if your sequence belongs to an extant virus species based on 95% Average Nucleotide Identity (ANI) and 85% Alignment Fraction (AF), per community standards. This module requires GenBank nt database, GenBank virus nucleotide database, or some subset thereof. If a sequence has at least 95% ANI and 85% AF to a virus, the taxonomy/organism name will be changed to match the GenBank entry. This module uses anicalc.py from CheckV, see license and copyright in anicalc directory.
  2. ORFs that overlap tRNAs are now removed to comply with GenBank guidelines. ORFs that are cut off by the end of a contig are now properly formatted per GenBank guidelines.
  3. "Messy" gene names are largely improved to comply with GenBank guidelines.
  4. Organism/Taxonomy and BLASTN info are now included in the summary .tsv file
  5. Cenote-Taker 2 uses more refined gene content searches to identify putative conjugative transposons. Also, genes that Cenote-Taker 2 flags as conjugative machinery are output as a .gtf file in the sequin_and_genome_maps directory.
  6. Cenote-Taker 2 will now take a CRISPR spacer hit table as an optional input, and will put CRISPR spacer hit info in the note of the genome output files. The format required is a tab-separated table:
    CONTIG_NAME HOST_NAME NUMBER_OF_HITS
    e.g.
    my_contig_1 bacteroides 9

Best,

Mike

Cenote-Taker 2 Version 2.1.1

04 Mar 19:36
Compare
Choose a tag to compare

If you haven't already installed Cenote-Taker 2, please follow installation instructions in README. If you have already installed it, please do: cd Cenote-Taker2 then git pull. Then update the HMM database.
Thank you.

The code was largely re-written to increase parallelization, making runs with any number of contigs of any length run much faster. I've improved the output file structure to make a more sensible summary (.tsv) file, and I've put all the genome maps (.gbf) and gene tables (.gtf) in a single directory (sequin_and_genome_maps/). I've added some additional options to make the user experience more intuitive, especially the -am True option which makes Cenote-Taker 2 assume that all input sequences are viral, and will simply annotate them. On the other hand, I've written code for Cenote Unlimited Breadsticks (unlimited_breadsticks.py) and included it in the Cenote-Taker 2 repo. The Unlimited Breadsticks tool is ONLY the discovery and pruning modules of Cenote-Taker 2. It runs dramatically faster than Cenote-Taker 2, as it skips all annotation steps. I still urge users to generate and examine genome maps for important virus sequences in order to manually inspect putative viruses.
Finally, I've removed about 100 HMMs from the hallmark gene database, and added about 50 new HMMs. The removed HMMs, while generated from virus sequences, were also found in non-virus regions of bacterial chromosomes, making them unsuitable for virus discovery.
Best of luck :)
Mike