Skip to content

g1ronn1mo/PARANOID

 
 

Repository files navigation

PARANOID

Pipeline for Automated Read ANalysis Of iCLIP Data

PARANOiD is a versatile software for the fully automated analysis of iCLIP and iCLIP2 data. It contains all steps necessary for preprocessing, the determination of cross-link locations and several additional steps which can be used to detect specific characteristics, e.g. definite distances between cross-link events or binding motifs. The cross-link sites are presented as WIG files that can be easily visualized e.g. using IGV, for which a config file is offered. Additionally, results are offered as statistical plots for a quick overview and as standardized bioinformatics file formats or TSV files which can be used for further analysis steps.

Overview

Basic usage
Inputs
Parameters
Additional analyses
Outputs

Basic-usage

nextflow PARANOiD.nf --reads \<reads.fastq\> --reference \<reference_sequence.fasta\> --barcodes \<barcodes.tsv\>

Inputs

Reads (essential)

Reads generated by iCLIP experiments. Can be provided as one or more files. If providing more than one file, regular expressions can be used within quotation marks.
Format: FASTQ

Usage

--reads reads_file.fastq
--reads "reads_{1,2}.fastq"
--reads "*.fastq"

Reference (essential)

File containing the reference to which the reads will be mapped.
Format: FASTA

Usage

--reference reference_file.fasta

Barcodes (essential)

Barcode sequences are used to assign reads to their experiment. The file is provided as TSV-file (tab separated value). The first consists of the experiment name and the second of the nucleotide sequence representing the barcode of the experiment. One experiment is described per lane and the columns are divided by a tab.
The experiment name should be named as follows:
<experiment_name>_rep_<replicate-number>

Example:

experiment1_rep_1	GCATTG  
experiment1_rep_2	CAGTAA  
experiment1_rep_3	GGCCTA  
experiment2_rep_1	AATCCG  
experiment2_rep_2	CCGTTA  
experiment2_rep_3	GTCATT  

Usage

--barcodes barcode_file.tsv

Annotation

File containing annotations of the reference provided. Advised when working with splicing capable organisms. Necessary for RNA subtype analysis.
Formats: GFF GTF

Usage

--annotation annotation_file.gff

Parameters

--barcode_pattern

A string that allows to adapt to other barcode patterns (default is iCLIP2). N represent the random barcodes and X represent the experimental barcode.
Default: NNNNNXXXXXXNNNN

Usage

Example for iCLIP:

--barcode_pattern NNXXXXNNN

Default:

--barcode_pattern NNNNNXXXXXXNNNN

--domain

Enables the use of a splicing capable mapping tool (STAR) if necessary.
Options:
pro -> Bowtie2 for splicing incapable organisms or spliced transcripts
eu -> STAR for splicing capable organisms
Default: pro

Usage

--domain eu

Default:

--domain pro

--output

Path to output directory. Allows to save outputs to another location.
Default: ./output

Usage

--output /path/to/output

Default:

--output ./output

--min_length

Minimum length for reads to retain after adapter trimming. All reads that are cut shorter during this step are removed.
Default: 30

Usage

--min_length 30

--min_qual

Minimum quality of bases necessary to retain them. Bases below that quality are cut of. Furthermore, reads with a certain percentage of bases below that quality are completely removed (see --min_percent_qual_filter). The value is based on the Phred score:

Quality score Error Accuracy
10 10% 90%
20 1% 99%
30 0.1% 99.9%
40 0.01% 99.99%

For more information click here
Default: 20

Usage

--min_qual 20

--min_percent_qual_filter

Minimum percent of bases above the stated quality (see --min_qual) necessary to retain a read after quality filtering.
Default: 90

Usage

--min_percent_qual_filter 90

--barcode_mismatches

Allowed number of mismatches in experimental barcode sequence to assign reads to experiments. This gives the possibility to still assign reads when a sequencing error occurs in the barcode sequence.
Default: 1

Usage

--barcode_mismatches 1

--mapq

Usage

--mapq 2

--split_fastq_by

Usage

--split_fastq_by 1000000

Additional-analyses

Transcript analysis

--map_to_transcripts

Usage
--map_to_transcripts

--number_top_transcripts

Usage
--number_top_transcripts 10

Usage

--mapq 2

Peak calling

--omit_peak_calling

Usage
--peak_calling

Merging of replicates

--merge_replicates

Usage
--merge_replicates

RNA subtype distribution

--rna_subtypes

Usage
--rna_subtypes 3_prime_UTR,transcript,5_prime_UTR

--gene_id

Usage
--gene_id ID

--color_barplot

Usage
--color_barplot #69b3a2

Peak distance analysis

--omit_peak_distance

Usage
--peak_distance

--percentile

Shared with sequence extraction

Usage
--percentile 90

--distance

Usage
--distance 50

Sequence extraction

--omit_sequence_extraction

Usage
--sequence_extraction

--percentile

Shared with sequence extraction

Usage
--percentile 90

--seq_len

Usage
--seq_len 20

--sequence_format_txt

Usage
--sequence_format_txt

TODO: document streme parameters params.max_motif_num = 50 // INT max number of motifs to search for params.min_motif_width = 8 // INT minimum motif width to report, >=3 params.max_motif_width = 15 // INT maximum motif width to report, <= 30

Outputs

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 60.1%
  • Nextflow 31.3%
  • R 7.8%
  • Other 0.8%