Available steps

Source steps

bcl2fastq_source

Connections:
  • Output Connection:
    • ‘out/configureBcl2Fastq_log_stderr’
    • ‘out/make_log_stderr’
    • ‘out/sample_sheet’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   bcl2fastq_source [style=filled, fillcolor="#fce94f"];
   out_0 [label="configureBcl2Fastq_log_stderr"];
   bcl2fastq_source -> out_0;
   out_1 [label="make_log_stderr"];
   bcl2fastq_source -> out_1;
   out_2 [label="sample_sheet"];
   bcl2fastq_source -> out_2;
}

Options:
  • adapter-sequence (str, optional) - adapter-stringency (str, optional) - fastq-cluster-count (int, optional) - filter-dir (str, optional) - flowcell-id (str, optional) - ignore-missing-bcl (bool, optional) - ignore-missing-control (bool, optional) - ignore-missing-stats (bool, optional) - input-dir (str, required) – file URL
  • intensities-dir (str, optional) - mismatches (int, optional) - no-eamss (str, optional) - output-dir (str, optional) - positions-dir (str, optional) - positions-format (str, optional) - sample-sheet (str, required) - tiles (str, optional) - use-bases-mask (str, optional) – Conversion mask characters:- Y or y: use- N or n: discard- I or i: use for indexingIf not given, the mask will be guessed from theRunInfo.xml file in the run folder.For instance, in a 2x76 indexed paired end run, themask Y76,I6n,y75n means: “use all 76 bases from thefirst end, discard the last base of the indexing read,and use only the first 75 bases of the second end”.
  • with-failed-reads (str, optional)**Required tools:** configureBclToFastq.pl, make, mkdir, mv

This step provides input files which already exists and therefore creates no tasks in the pipeline.

fastq_source

The FastqSource class acts as a source for FASTQ files. This source creates a run for every sample.

Specify a file name pattern in pattern and define how sample names should be determined from file names by specifyign a regular expression in group.

Sample index barcodes may specified by providing a filename to a CSV file containing the columns Sample_ID and Index or directly by defining a dictionary which maps indices to sample names.

Connections:
  • Output Connection:
    • ‘out/first_read’
    • ‘out/second_read’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   fastq_source [style=filled, fillcolor="#fce94f"];
   out_0 [label="first_read"];
   fastq_source -> out_0;
   out_1 [label="second_read"];
   fastq_source -> out_1;
}

Options:
  • first_read (str, required) – Part of the file name that marks all files containing sequencing data of the first read. Example: ‘R1.fastq’ or ‘_1.fastq’
  • group (str, optional) – A regular expression which is applied to found files, and which is used to determine the sample name from the file name. For example, (Sample_\d+)_R[12].fastq.gz, when applied to a file called Sample_1_R1.fastq.gz, would result in a sample name of Sample_1. You can specify multiple capture groups in the regular expression.
  • indices (str/dict, optional) – path to a CSV file or a dictionary of sample_id: barcode entries.
  • paired_end (bool, required) – Specify whether the samples are paired end or not.
  • pattern (str, optional) – A file name pattern, for example /home/test/fastq/Sample_*.fastq.gz.
  • sample_id_prefix (str, optional) – This optional prefix is prepended to every sample name.
  • sample_to_files_map (dict/str, optional) – A listing of sample names and their associated files. This must be provided as a YAML dictionary.
  • second_read (str, required) – Part of the file name that marks all files containing sequencing data of the second read. Example: ‘R2.fastq’ or ‘_2.fastq’

This step provides input files which already exists and therefore creates no tasks in the pipeline.

fetch_chrom_sizes_source

Connections:
  • Output Connection:
    • ‘out/chromosome_sizes’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   fetch_chrom_sizes_source [style=filled, fillcolor="#fce94f"];
   out_0 [label="chromosome_sizes"];
   fetch_chrom_sizes_source -> out_0;
}

Options:
  • path (str, required) – directory to move file to
  • ucsc-database (str, required) – Name of UCSC database e.g. hg38, mm9

Required tools: cp, fetchChromSizes

This step provides input files which already exists and therefore creates no tasks in the pipeline.

raw_file_source

Connections:
  • Output Connection:
    • ‘out/raw’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   raw_file_source [style=filled, fillcolor="#fce94f"];
   out_0 [label="raw"];
   raw_file_source -> out_0;
}

Options:
  • group (str, optional) – A regular expression which is applied to found files, and which is used to determine the sample name from the file name. For example, (Sample_d+)_R[12].fastq.gz`, when applied to a file called Sample_1_R1.fastq.gz, would result in a sample name of Sample_1. You can specify multiple capture groups in the regular expression.
  • pattern (str, optional) – A file name pattern, for example /home/test/fastq/Sample_*.fastq.gz.
  • sample_id_prefix (str, optional) – This optional prefix is prepended to every sample name.
  • sample_to_files_map (dict/str, optional) – A listing of sample names and their associated files. This must be provided as a YAML dictionary.

This step provides input files which already exists and therefore creates no tasks in the pipeline.

raw_file_sources

The RawFileSources class acts as a tyemporary fix to get files into the pipeline. This source creates a run for every sample.

Specify a file name pattern in pattern and define how sample names should be determined from file names by specifyign a regular expression in group.

Connections:
  • Output Connection:
    • ‘out/raws’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   raw_file_sources [style=filled, fillcolor="#fce94f"];
   out_0 [label="raws"];
   raw_file_sources -> out_0;
}

Options:
  • group (str, required) – This is a LEGACY step. Do NOT use it, better use the raw_file_source step. A regular expression which is applied to found files, and which is used to determine the sample name from the file name. For example, (Sample_\d+)_R[12].fastq.gz, when applied to a file called Sample_1_R1.fastq.gz, would result in a sample name of Sample_1. You can specify multiple capture groups in the regular expression.
  • paired_end (bool, required) – Specify whether the samples are paired end or not.
  • pattern (str, required) – A file name pattern, for example /home/test/fastq/Sample_*.fastq.gz.
  • sample_id_prefix (str, optional) – This optional prefix is prepended to every sample name.

This step provides input files which already exists and therefore creates no tasks in the pipeline.

raw_url_source

Connections:
  • Output Connection:
    • ‘out/raw’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   raw_url_source [style=filled, fillcolor="#fce94f"];
   out_0 [label="raw"];
   raw_url_source -> out_0;
}

Options:
  • filename (str, optional) – local file name of downloaded file
  • hashing-algorithm (str, optional) – hashing algorithm to use
    • possible values: ‘md5’, ‘sha1’, ‘sha224’, ‘sha256’, ‘sha384’, ‘sha512’
  • path (str, required) – directory to move downloaded file to
  • secure-hash (str, optional) – expected secure hash of downloaded file
  • uncompress (bool, optional) – File is uncompressed after download
  • url (str, required) – Download URL

Required tools: compare_secure_hashes, cp, curl, dd, mkdir, pigz

This step provides input files which already exists and therefore creates no tasks in the pipeline.

raw_url_sources

Connections:
  • Output Connection:
    • ‘out/raw’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   raw_url_sources [style=filled, fillcolor="#fce94f"];
   out_0 [label="raw"];
   raw_url_sources -> out_0;
}

Options:
  • run-download-info (dict, required) – Dictionary of dictionaries. The keys are the names of the runs. The values are dictionaries whose keys are identical with the options of an ‘raw_url_source’ source step. An example: <name>: filename: <filename> hashing-algorithm: <hashing-algorithm> path: <path> secure-hash: <secure-hash> uncompress: <uncompress> url: <url>

Required tools: compare_secure_hashes, cp, curl, dd, mkdir, pigz

This step provides input files which already exists and therefore creates no tasks in the pipeline.

run_folder_source

This source looks for fastq.gz files in [path]/Unaligned/Project_*/Sample_* and pulls additional information from CSV sample sheets it finds. It also makes sure that index information for all samples is coherent and unambiguous.
Connections:
  • Output Connection:
    • ‘out/first_read’
    • ‘out/second_read’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   run_folder_source [style=filled, fillcolor="#fce94f"];
   out_0 [label="first_read"];
   run_folder_source -> out_0;
   out_1 [label="second_read"];
   run_folder_source -> out_1;
}

Options:
  • first_read (str, required) – Part of the file name that marks all files containing sequencing data of the first read. Example: ‘_R1.fastq’ or ‘_1.fastq’
    • default value: _R1
  • paired_end (bool, required) - path (str, required) - project (str, required) - default value: *
  • second_read (str, required) – Part of the file name that marks all files containing sequencing data of the second read. Example: ‘R2.fastq’ or ‘_2.fastq’
    • default value: _R2

This step provides input files which already exists and therefore creates no tasks in the pipeline.

Processing steps

bam_to_bedgraph_and_bigwig

Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/bedgraph’
    • ‘out/bigwig’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   bam_to_bedgraph_and_bigwig [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> bam_to_bedgraph_and_bigwig;
   out_1 [label="bedgraph"];
   bam_to_bedgraph_and_bigwig -> out_1;
   out_2 [label="bigwig"];
   bam_to_bedgraph_and_bigwig -> out_2;
}

Options:
  • chromosome-sizes (str, required) - temp-sort-dir (str, optional)**Required tools:** bedGraphToBigWig, bedtools, sort

CPU Cores: 8

bam_to_genome_browser

Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/alignments’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   bam_to_genome_browser [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> bam_to_genome_browser;
   out_1 [label="alignments"];
   bam_to_genome_browser -> out_1;
}

Options:
  • bedtools-bamtobed-color (str, optional) - bedtools-bamtobed-tag (str, optional) - bedtools-genomecov-3 (bool, optional) - bedtools-genomecov-5 (bool, optional) - bedtools-genomecov-max (int, optional) - bedtools-genomecov-report-zero-coverage (bool, required) - bedtools-genomecov-scale (float, optional) - bedtools-genomecov-split (bool, required) - default value: True
  • bedtools-genomecov-strand (str, optional) - possible values: ‘+’, ‘-‘
  • chromosome-sizes (str, required) - dd-blocksize (str, optional) - default value: 256k
  • output-format (str, required) - default value: bigWig - possible values: ‘bed’, ‘bigBed’, ‘bedGraph’, ‘bigWig’
  • trackline (dict, optional) - trackopts (dict, optional)**Required tools:** bedGraphToBigWig, bedToBigBed, bedtools, dd, mkfifo, pigz

CPU Cores: 8

bowtie2

Bowtie2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.

http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

typical command line:

bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} -S [<hit>]
Connections:
  • Input Connection:
    • ‘in/first_read’
    • ‘in/second_read’
  • Output Connection:
    • ‘out/alignments’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   bowtie2 [style=filled, fillcolor="#fce94f"];
   in_0 [label="first_read"];
   in_0 -> bowtie2;
   in_1 [label="second_read"];
   in_1 -> bowtie2;
   out_2 [label="alignments"];
   bowtie2 -> out_2;
}

Options:
  • dd-blocksize (str, optional) - default value: 256k
  • index (str, required) – Path to bowtie2 index (not containing file suffixes).

Required tools: bowtie2, dd, mkfifo, pigz

CPU Cores: 6

bowtie2_generate_index

bowtie2-build builds a Bowtie index from a set of DNA sequences. bowtie2-build outputs a set of 6 files with suffixes .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. In the case of a large index these suffixes will have a bt2l termination. These files together constitute the index: they are all that is needed to align reads to that reference. The original sequence FASTA files are no longer used by Bowtie 2 once the index is built.

http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-build-indexer

typical command line:

bowtie2-build [options]* <reference_in> <bt2_index_base>
Connections:
  • Input Connection:
    • ‘in/reference_sequence’
  • Output Connection:
    • ‘out/bowtie_index’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   bowtie2_generate_index [style=filled, fillcolor="#fce94f"];
   in_0 [label="reference_sequence"];
   in_0 -> bowtie2_generate_index;
   out_1 [label="bowtie_index"];
   bowtie2_generate_index -> out_1;
}

Options:
  • bmax (int, optional) – The maximum number of suffixes allowed in a block. Allowing more suffixes per block makes indexing faster, but increases peak memory usage. Setting this option overrides any previous setting for –bmax, or –bmaxdivn. Default (in terms of the –bmaxdivn parameter) is –bmaxdivn 4. This is configured automatically by default; use -a/–noauto to configure manually.
  • bmaxdivn (int, optional) – The maximum number of suffixes allowed in a block, expressed as a fraction of the length of the reference. Setting this option overrides any previous setting for –bmax, or –bmaxdivn. Default: –bmaxdivn 4. This is configured automatically by default; use -a/–noauto to configure manually.
  • cutoff (int, optional) – Index only the first <int> bases of the reference sequences (cumulative across sequences) and ignore the rest.
  • dcv (int, optional) – Use <int> as the period for the difference-cover sample. A larger period yields less memory overhead, but may make suffix sorting slower, especially if repeats are present. Must be a power of 2 no greater than 4096. Default: 1024. This is configured automatically by default; use -a/–noauto to configure manually.
  • dd-blocksize (str, optional) - default value: 256k
  • ftabchars (int, optional) – The ftab is the lookup table used to calculate an initial Burrows-Wheeler range with respect to the first <int> characters of the query. A larger <int> yields a larger lookup table but faster query times. The ftab has size 4^(<int>+1) bytes. The default setting is 10 (ftab is 4MB).
  • index-basename (str, required) – Base name used for the bowtie2 index.
  • large-index (bool, optional) – Force bowtie2-build to build a large index, even if the reference is less than ~ 4 billion nucleotides long.
  • noauto (bool, optional) – Disable the default behavior whereby bowtie2-build automatically selects values for the –bmax, –dcv and –packed parameters according to available memory. Instead, user may specify values for those parameters. If memory is exhausted during indexing, an error message will be printed; it is up to the user to try new parameters.
  • nodc (bool, optional) – Disable use of the difference-cover sample. Suffix sorting becomes quadratic-time in the worst case (where the worst case is an extremely repetitive reference). Default: off.
  • offrate (int, optional) – To map alignments back to positions on the reference sequences, it’s necessary to annotate (‘mark’) some or all of the Burrows-Wheeler rows with their corresponding location on the genome. -o/–offrate governs how many rows get marked: the indexer will mark every 2^<int> rows. Marking more rows makes reference-position lookups faster, but requires more memory to hold the annotations at runtime. The default is 5 (every 32nd row is marked; for human genome, annotations occupy about 340 megabytes).
  • packed (bool, optional) – Use a packed (2-bits-per-nucleotide) representation for DNA strings. This saves memory but makes indexing 2-3 times slower. Default: off. This is configured automatically by default; use -a/–noauto to configure manually.
  • seed (int, optional) – Use <int> as the seed for pseudo-random number generator.

Required tools: bowtie2-build, dd, pigz

CPU Cores: 6

bwa_backtrack

bwa-backtrack is the bwa algorithm designed for Illumina sequence reads up to 100bp. The computation of the alignments is done by running ‘bwa aln’ first, to align the reads, followed by running ‘bwa samse’ or ‘bwa sampe’ afterwards to generate the final SAM output.

http://bio-bwa.sourceforge.net/

typical command line for single-end data:

bwa aln <bwa-index> <first-read.fastq> > <first-read.sai>
bwa samse <bwa-index> <first-read.sai> <first-read.fastq> > <sam-output>

typical command line for paired-end data:

bwa aln <bwa-index> <first-read.fastq> > <first-read.sai>
bwa aln <bwa-index> <second-read.fastq> > <second-read.sai>
bwa sampe <bwa-index> <first-read.sai> <second-read.sai>                   <first-read.fastq> <second-read.fastq> > <sam-output>
Connections:
  • Input Connection:
    • ‘in/first_read’
    • ‘in/second_read’
  • Output Connection:
    • ‘out/alignments’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   bwa_backtrack [style=filled, fillcolor="#fce94f"];
   in_0 [label="first_read"];
   in_0 -> bwa_backtrack;
   in_1 [label="second_read"];
   in_1 -> bwa_backtrack;
   out_2 [label="alignments"];
   bwa_backtrack -> out_2;
}

Options:
  • aln-0 (bool, optional) – When aln-b is specified, only use single-end reads in mapping.
  • aln-1 (bool, optional) – When aln-b is specified, only use the first read in a read pair in mapping (skip single-end reads and the second reads).
  • aln-2 (bool, optional) – When aln-b is specified, only use the second read in a read pair in mapping.
  • aln-B (int, optional) – Length of barcode starting from the 5’-end. When INT is positive, the barcode of each read will be trimmed before mapping and will be written at the BC SAM tag. For paired-end reads, the barcode from both ends are concatenated. [0]
  • aln-E (int, optional) – Gap extension penalty [4]
  • aln-I (bool, optional) – The input is in the Illumina 1.3+ read format (quality equals ASCII-64).
  • aln-M (int, optional) – Mismatch penalty. BWA will not search for suboptimal hits with a score lower than (bestScore-misMsc). [3]
  • aln-N (bool, optional) – Disable iterative search. All hits with no more than maxDiff differences will be found. This mode is much slower than the default.
  • aln-O (int, optional) – Gap open penalty [11]
  • aln-R (int, optional) – Proceed with suboptimal alignments if there are no more than INT equally best hits. This option only affects paired-end mapping. Increasing this threshold helps to improve the pairing accuracy at the cost of speed, especially for short reads (~32bp).
  • aln-b (bool, optional) – Specify the input read sequence file is the BAM format. For paired-end data, two ends in a pair must be grouped together and options aln-1 or aln-2 are usually applied to specify which end should be mapped. Typical command lines for mapping pair-end data in the BAM format are:

bwa aln ref.fa -b1 reads.bam > 1.sai bwa aln ref.fa -b2 reads.bam > 2.sai bwa sampe ref.fa 1.sai 2.sai reads.bam reads.bam > aln.sam

  • aln-c (bool, optional) – Reverse query but not complement it, which is required for alignment in the color space. (Disabled since 0.6.x)
  • aln-d (int, optional) – Disallow a long deletion within INT bp towards the 3’-end [16]
  • aln-e (int, optional) – Maximum number of gap extensions, -1 for k-difference mode (disallowing long gaps) [-1]
  • aln-i (int, optional) – Disallow an indel within INT bp towards the ends [5]
  • aln-k (int, optional) – Maximum edit distance in the seed [2]
  • aln-l (int, optional) – Take the first INT subsequence as seed. If INT is larger than the query sequence, seeding will be disabled. For long reads, this option is typically ranged from 25 to 35 for ‘-k 2’. [inf]
  • aln-n (float, optional) – Maximum edit distance if the value is INT, or the fraction of missing alignments given 2% uniform base error rate if FLOAT. In the latter case, the maximum edit distance is automatically chosen for different read lengths. [0.04]
  • aln-o (int, optional) – Maximum number of gap opens [1]
  • aln-q (int, optional) – Parameter for read trimming. BWA trims a read down to argmax_x{sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original read length. [0]
  • aln-t (int, optional) – Number of threads (multi-threading mode) [1]
    • default value: 6
  • dd-blocksize (str, optional) - default value: 256k
  • index (str, required) – Path to BWA index
  • sampe-N (int, optional) – Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]
  • sampe-P (bool, optional) – Load the entire FM-index into memory to reduce disk operations (base-space reads only). With this option, at least 1.25N bytes of memory are required, where N is the length of the genome.
  • sampe-a (int, optional) – Maximum insert size for a read pair to be considered being mapped properly. Since 0.4.5, this option is only used when there are not enough good alignment to infer the distribution of insert sizes. [500]
  • sampe-n (int, optional) – Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]
  • sampe-o (int, optional) – Maximum occurrences of a read for pairing. A read with more occurrneces will be treated as a single-end read. Reducing this parameter helps faster pairing. [100000]
  • sampe-r (str, optional) – Specify the read group in a format like '@RG ID:foo SM:bar’. [null]
  • samse-n (int, optional) – Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]
  • samse-r (str, optional) – Specify the read group in a format like '@RG ID:foo SM:bar’. [null]

Required tools: bwa, dd, mkfifo, pigz

CPU Cores: 8

bwa_generate_index

This step generates the index database from sequences in the FASTA format.

Typical command line:

bwa index -p <index-basename> <seqeunce.fasta>
Connections:
  • Input Connection:
    • ‘in/reference_sequence’
  • Output Connection:
    • ‘out/bwa_index’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   bwa_generate_index [style=filled, fillcolor="#fce94f"];
   in_0 [label="reference_sequence"];
   in_0 -> bwa_generate_index;
   out_1 [label="bwa_index"];
   bwa_generate_index -> out_1;
}

Options:
  • index-basename (str, required) – Prefix of the created index database

Required tools: bwa

CPU Cores: 6

bwa_mem

Align 70bp-1Mbp query sequences with the BWA-MEM algorithm. Briefly, the algorithm works by seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW).

http://bio-bwa.sourceforge.net/bwa.shtml

Typical command line:

bwa mem [options] <bwa-index> <first-read.fastq> [<second-read.fastq>]         > <sam-output>
Connections:
  • Input Connection:
    • ‘in/first_read’
    • ‘in/second_read’
  • Output Connection:
    • ‘out/alignments’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   bwa_mem [style=filled, fillcolor="#fce94f"];
   in_0 [label="first_read"];
   in_0 -> bwa_mem;
   in_1 [label="second_read"];
   in_1 -> bwa_mem;
   out_2 [label="alignments"];
   bwa_mem -> out_2;
}

Options:
  • A (int, optional) – score for a sequence match, which scales options -TdBOELU unless overridden [1]
  • B (int, optional) – penalty for a mismatch [4]
  • C (bool, optional) – append FASTA/FASTQ comment to SAM output
  • D (float, optional) – drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
  • E (str, optional) – gap extension penalty; a gap of size k cost ‘{-O} + {-E}*k’ [1,1]
  • H (str, optional) – insert STR to header if it starts with @; or insert lines in FILE [null]
  • L (str, optional) – penalty for 5’- and 3’-end clipping [5,5]
  • M (str, optional) – mark shorter split hits as secondary
  • O (str, optional) – gap open penalties for deletions and insertions [6,6]
  • P (bool, optional) – skip pairing; mate rescue performed unless -S also in use
  • R (str, optional) – read group header line such as '@RG ID:foo SM:bar’ [null]
  • S (bool, optional) – skip mate rescue
  • T (int, optional) – minimum score to output [30]
  • U (int, optional) – penalty for an unpaired read pair [17]
  • V (bool, optional) – output the reference FASTA header in the XR tag
  • W (int, optional) – discard a chain if seeded bases shorter than INT [0]
  • Y (str, optional) – use soft clipping for supplementary alignments
  • a (bool, optional) – output all alignments for SE or unpaired PE
  • c (int, optional) – skip seeds with more than INT occurrences [500]
  • d (int, optional) – off-diagonal X-dropoff [100]
  • dd-blocksize (str, optional) - default value: 256k
  • e (bool, optional) – discard full-length exact matches
  • h (str, optional) – if there are <INT hits with score >80% of the max score, output all in XA [5,200]
  • index (str, required) – Path to BWA index
  • j (bool, optional) – treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)
  • k (int, optional) – minimum seed length [19]
  • m (int, optional) – perform at most INT rounds of mate rescues for each read [50]
  • p (bool, optional) – smart pairing (ignoring in2.fq)
  • r (float, optional) – look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
  • t (int, optional) – number of threads [6]
    • default value: 6
  • v (int, optional) – verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]
  • w (int, optional) – band width for banded alignment [100]
  • x (str, optional) – read type. Setting -x changes multiple parameters unless overriden [null]

pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 (PacBio reads to ref) ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 (Oxford Nanopore 2D-reads to ref) intractg: -B9 -O16 -L5 (intra-species contigs to ref)

  • y (int, optional) – seed occurrence for the 3rd round seeding [20]

Required tools: bwa, dd, mkfifo, pigz

CPU Cores: 6

chromhmm_binarizebam

This command converts coordinates of aligned reads into binarized data form from which a chromatin state model can be learned. The binarization is based on a poisson background model. If no control data is specified the parameter to the poisson distribution is the global average number of reads per bin. If control data is specified the global average number of reads is multiplied by the local enrichment for control reads as determined by the specified parameters. Optionally intermediate signal files can also be outputted and these signal files can later be directly converted into binary form using the BinarizeSignal command.
Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/cellmarkfiletable’
    • ‘out/chromhmm_binarization’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   chromhmm_binarizebam [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> chromhmm_binarizebam;
   out_1 [label="cellmarkfiletable"];
   chromhmm_binarizebam -> out_1;
   out_2 [label="chromhmm_binarization"];
   chromhmm_binarizebam -> out_2;
}

Options:
  • b (int, optional) – The number of base pairs in a bin determining the resolution of the model learning and segmentation. By default this parameter value is set to 200 base pairs.
  • cell_mark_files (dict, required) – A dictionary where the keys are the names of the run and the values are lists of lists. The lists of lists describe the content of a ‘cellmarkfiletable’ files as used by ‘BinarizeBam’. But instead of file names use the run ID for the mark and control per line. That is a tab delimited file where each row contains the cell type or other identifier for a groups of marks, then the associated mark, then the name of a BAM file, and optionally a corresponding control BAM file. If a mark is missing in one cell type, but not others it will receive a 2 for all entries in the binarization file and -1 in the signal file. If the same cell and mark combination appears on multiple lines, then the union of all the reads across entries is taken except for control data where each unique file is only counted once.
  • center (bool, optional) – If this flag is present then the center of the interval is used to determine the bin to assign a read. This can make sense to use if the coordinates are based on already extended reads. If this option is selected, then the strand information of a read and the shift parameter are ignored. By default reads are assigned to a bin based on the position of its 5’ end as determined from the strand of the read after shifting an amount determined by the -n shift option.
  • chrom_sizes_file (str, required) - e (int, optional) – Specifies the amount that should be subtracted from the end coordinate of a read so that both coordinates are inclusive and 0 based. The default value is 1 corresponding to standard bed convention of the end interval being 0-based but not inclusive.
  • f (int, optional) – This indicates a threshold for the fold enrichment over expected that must be met or exceeded by the observed count in a bin for a present call. The expectation is determined in the same way as the mean parameter for the poission distribution in terms of being based on a uniform background unless control data is specified. This parameter can be useful when dealing with very deeply and/or unevenly sequenced data. By default this parameter value is 0 meaning effectively it is not used.
  • g (int, optional) – This indicates a threshold for the signal that must be met or exceeded by the observed count in a bin for a present call. This parameter can be useful when desiring to directly place a threshold on the signal. By default this parameter value is 0 meaning effectively it is not used.
  • n (int, optional) – The number of bases a read should be shifted to determine a bin assignment. Bin assignment is based on the 5’ end of a read shifted this amount with respect to the strand orientation. By default this value is 100.
  • p (float, optional) – This option specifies the tail probability of the poisson distribution that the binarization threshold should correspond to. The default value of this parameter is 0.0001.
  • s (int, optional) – The amount that should be subtracted from the interval start coordinate so the interval is inclusive and 0 based. Default is 0 corresponding to the standard bed convention.
  • strictthresh (bool, optional) – If this flag is present then the poisson threshold must be strictly greater than the tail probability, otherwise by default the largest integer count for which the tail includes the poisson threshold probability is used.
  • u (int, optional) – An integer pseudocount that is uniformly added to every bin in the control data in order to smooth the control data from 0. The default value is 1.
  • w (int, optional) – This determines the extent of the spatial smoothing in computing the local enrichment for control reads. The local enrichment for control signal in the x-th bin on the chromosome after adding pseudocountcontrol is computed based on the average control counts for all bins within x-w and x+w. If no controldir is specified, then this option is ignored. The default value is 5.

Required tools: ChromHMM, ln, ls, mkdir, printf, tar, xargs

CPU Cores: 4

chromhmm_learnmodel

This command takes a directory with a set of binarized data files and learns a chromatin state model. Binarized data files have “_binary” in the file name. The format for the binarized data files are that the first line contains the name of the cell separated by a tab with the name of the chromosome. The second line contains in tab delimited form the name of each mark. The remaining lines correspond to consecutive bins on the chromosome. The remaining lines in tab delimited form corresponding to each mark, with a “1” for a present call or “0” for an absent call and a “2” if the data is considered missing at that interval for the mark.
Connections:
  • Input Connection:
    • ‘in/cellmarkfiletable’
    • ‘in/chromhmm_binarization’
  • Output Connection:
    • ‘out/chromhmm_model’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   chromhmm_learnmodel [style=filled, fillcolor="#fce94f"];
   in_0 [label="cellmarkfiletable"];
   in_0 -> chromhmm_learnmodel;
   in_1 [label="chromhmm_binarization"];
   in_1 -> chromhmm_learnmodel;
   out_2 [label="chromhmm_model"];
   chromhmm_learnmodel -> out_2;
}

Options:
  • assembly (str, required) – specifies the genome assembly. overlap and neighborhood enrichments will be called with default parameters using this genome assembly.Assembly names are e.g. hg18, hg19, GRCh38
  • b (int, optional) – The number of base pairs in a bin determining the resolution of the model learning and segmentation. By default this parameter value is set to 200 base pairs.
  • color (str, optional) – This specifies the color of the heat map. “r,g,b” are integer values between 0 and 255 separated by commas. By default this parameter value is 0,0,255 corresponding to blue.
  • d (float, optional) – The threshold on the change on the estimated log likelihood that if it falls below this value, then parameter training will terminate. If this value is less than 0 then it is not used as part of the stopping criteria. The default value for this parameter is 0.001.
  • e (float, optional) – This parameter is only applicable if the load option is selected for the init parameter. This parameter controls the smoothing away from 0 when loading a model. The emission value used in the model initialization is a weighted average of the value in the file and a uniform probability over the two possible emissions. The value in the file gets weight (1-loadsmoothemission) while uniform gets weight loadsmoothemission. The default value of this parameter is 0.02.
  • h (float, optional) – A smoothing constant away from 0 for all parameters in the information based initialization. This option is ignored if random or load are selected for the initialization method. The default value of this parameter is 0.02.
  • holdcolumnorder (bool, optional) – Including this flag suppresses the reordering of the mark columns in the emission parameter table display.
  • init (str, optional) – This specifies the method for parameter initialization method. ‘information’ is the default method described in (Ernst and Kellis, Nature Methods 2012). ‘random’ - randomly initializes the parameters from a uniform distribution. ‘load’ loads the parameters specified in ‘-m modelinitialfile’ and smooths them based on the value of the ‘loadsmoothemission’ and ‘loadsmoothtransition’ parameters. The default is information.
    • possible values: ‘information’, ‘random’, ‘load’
  • l (str, optional) – This file specifies the length of the chromosomes. It is a two column tab delimited file with the first column specifying the chromosome name and the second column the length. If this file is provided then no end coordinate will exceed what is specified in this file. By default BinarizeBed excludes the last partial bin along the chromosome, but if that is included in the binarized data input files then this file should be included to give a valid end coordinate for the last interval.
  • m (str, optional) – This specifies the model file containing the initial parameters which can then be used with the load option
  • nobed (bool, optional) – If this flag is present, then this suppresses the printing of segmentation information in the four column format. The default is to generate a four column segmentation file
  • nobrowser (bool, optional) – If this flag is present, then browser files are not printed. If -nobed is requested then browserfile writing is also suppressed.
  • noenrich (bool, optional) – If this flag is present, then enrichment files are not printed. If -nobed is requested then enrichment file writing is also suppressed.
  • numstates (int, required) - r (int, optional) – This option specifies the maximum number of iterations over all the input data in the training. By default this is set to 200.
  • s (int, optional) – This allows the specification of the random seed. Randomization is used to determine the visit order of chromosomes in the incremental expectation-maximization algorithm used to train the parameters and also used to generate the initial values of the parameters if random is specified for the init method.
  • stateordering (str, optional) – This determines whether the states are ordered based on the emission or transition parameters. See (Ernst and Kellis, Nature Methods) for details. Default is ‘emission’.
    • possible values: ‘emission’, ‘transition’
  • t (float, optional) – This parameter is only applicable if the load option is selected for the init parameter. This parameter controls the smoothing away from 0 when loading a model. The transition value used in the model initialization is a weighted average of the value in the file and a uniform probability over the transitions. The value in the file gets weight (1-loadsmoothtransition) while uniform gets weight loadsmoothtransition. The default value is 0.5.
  • x (int, optional) – This parameter specifies the maximum number of seconds that can be spent optimizing the model parameters. If it is less than 0, then there is no limit and termination is based on maximum number of iterations or a log likelihood change criteria. The default value of this parameter is -1.
  • z (int, optional) – This parameter determines the threshold at which to set extremely low transition probabilities to 0 durining training. Setting extremely low transition probabilities makes model learning more efficient with essentially no impact on the final results. If a transition probability falls below 10^-zerotransitionpower during training it is set to 0. Making this parameter to low and thus the cutoff too high can potentially cause some numerical instability. By default this parameter is set to 8.

Required tools: ChromHMM, ls, mkdir, rm, tar, xargs

CPU Cores: 8

cuffcompare

CuffCompare is part of the ‘Cufflinks suite of tools’ for differential expr. analysis of RNA-Seq data and their visualisation. This step compares a cufflinks assembly to known annotation. For details about cuffcompare we refer to the author’s webpage:

http://cole-trapnell-lab.github.io/cufflinks/cuffcompare/

Connections:
  • Input Connection:
    • ‘in/features’
  • Output Connection:
    • ‘out/features’
    • ‘out/loci’
    • ‘out/log_stderr’
    • ‘out/stats’
    • ‘out/tracking’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   cuffcompare [style=filled, fillcolor="#fce94f"];
   in_0 [label="features"];
   in_0 -> cuffcompare;
   out_1 [label="features"];
   cuffcompare -> out_1;
   out_2 [label="loci"];
   cuffcompare -> out_2;
   out_3 [label="log_stderr"];
   cuffcompare -> out_3;
   out_4 [label="stats"];
   cuffcompare -> out_4;
   out_5 [label="tracking"];
   cuffcompare -> out_5;
}

Options:
  • ref-gtf (str, optional) – A “reference” annotation GTF. The input assemblies are merged together with the reference GTF and included in the final output.

Required tools: cuffcompare

CPU Cores: 1

cuffmerge

CuffMerge is part of the ‘Cufflinks suite of tools’ for differential expr. analysis of RNA-Seq data and their visualisation. This step applies the cuffmerge tool which merges several Cufflinks assemblies. For details on cuffmerge we refer to the author’s webpage:

http://cole-trapnell-lab.github.io/cufflinks/cuffmerge/

Connections:
  • Input Connection:
    • ‘in/features’
  • Output Connection:
    • ‘out/assemblies’
    • ‘out/features’
    • ‘out/log_stderr’
    • ‘out/run_log’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   cuffmerge [style=filled, fillcolor="#fce94f"];
   in_0 [label="features"];
   in_0 -> cuffmerge;
   out_1 [label="assemblies"];
   cuffmerge -> out_1;
   out_2 [label="features"];
   cuffmerge -> out_2;
   out_3 [label="log_stderr"];
   cuffmerge -> out_3;
   out_4 [label="run_log"];
   cuffmerge -> out_4;
}

Options:
  • num-threads (int, optional) – Use this many threads to merge assemblies.
    • default value: 6
  • ref-gtf (str, optional) – A “reference” annotation GTF. The input assemblies are merged together with the reference GTF and included in the final output.
  • ref-sequence (str, optional) – This argument should point to the genomic DNA sequences for the reference. If a directory, it should contain one fasta file per contig. If a multifasta file, all contigs should be present.
  • run_id (str, optional) – An arbitrary name of the new run (which is a merge of all samples).
    • default value: magic

Required tools: cuffmerge, mkdir, mv, printf

CPU Cores: 6

cutadapt

Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

https://cutadapt.readthedocs.org/en/stable/

Connections:
  • Input Connection:
    • ‘in/first_read’
    • ‘in/second_read’
  • Output Connection:
    • ‘out/first_read’
    • ‘out/log_first_read’
    • ‘out/log_second_read’
    • ‘out/second_read’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   cutadapt [style=filled, fillcolor="#fce94f"];
   in_0 [label="first_read"];
   in_0 -> cutadapt;
   in_1 [label="second_read"];
   in_1 -> cutadapt;
   out_2 [label="first_read"];
   cutadapt -> out_2;
   out_3 [label="log_first_read"];
   cutadapt -> out_3;
   out_4 [label="log_second_read"];
   cutadapt -> out_4;
   out_5 [label="second_read"];
   cutadapt -> out_5;
}

Options:
  • adapter-R1 (str, optional) – Adapter sequence to be clipped off of thefirst read.
  • adapter-R2 (str, optional) – Adapter sequence to be clipped off of thesecond read
  • adapter-file (str, optional) – File containing adapter sequences to be clipped off of the reads.
  • adapter-type (str, optional) – a: 3’ adapter, b: 3’ or 5’ adapter, g: 5’ adapter
    • default value: -a
    • possible values: ‘-a’, ‘-g’, ‘-b’
  • dd-blocksize (str, optional) - default value: 256k
  • fix_qnames (bool, required) – If set to true, only the leftmost string without spaces of the QNAME field of the FASTQ data is kept. This might be necessary for downstream analysis.
  • use_reverse_complement (bool, required) – The reverse complement of adapter sequences ‘adapter-R1’ and ‘adapter-R2’ are used for adapter clipping.

Required tools: cat, cutadapt, dd, fix_qnames, mkfifo, pigz

CPU Cores: 4

discardLargeSplitsAndPairs

discardLargeSplitsAndPairs reads SAM formatted alignments of the mapped reads. It discards all split reads that skip more than splits_N nucleotides in their alignment to the ref genome. In addition, all read pairs that are mapped to distant region such that the final template will exceed N_mates nucleotides will also be discarded. All remaining reads are returned in SAM format. The discarded reads are also collected in a SAM formatted file and a statistic is returned.
Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/alignments’
    • ‘out/log’
    • ‘out/stats’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   discardLargeSplitsAndPairs [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> discardLargeSplitsAndPairs;
   out_1 [label="alignments"];
   discardLargeSplitsAndPairs -> out_1;
   out_2 [label="log"];
   discardLargeSplitsAndPairs -> out_2;
   out_3 [label="stats"];
   discardLargeSplitsAndPairs -> out_3;
}

Options:
  • M_mates (str, required) – Size of template (in nucleotides) that would arise from a read pair. Read pairs that exceed this value are discarded.
  • N_splits (str, required) – Size of the skipped region within a split read (in nucleotides). Split Reads that skip more nt than this value are discarded.

Required tools: dd, discardLargeSplitsAndPairs, pigz, samtools

CPU Cores: 4

fastqc

The fastqc step is a wrapper for the fastqc tool. It generates some quality metrics for fastq files. For this specific instance only the zip archive is preserved.

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Connections:
  • Input Connection:
    • ‘in/first_read’
    • ‘in/second_read’
  • Output Connection:
    • ‘out/first_read_fastqc_report’
    • ‘out/first_read_fastqc_report_webpage’
    • ‘out/first_read_log_stderr’
    • ‘out/second_read_fastqc_report’
    • ‘out/second_read_fastqc_report_webpage’
    • ‘out/second_read_log_stderr’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   fastqc [style=filled, fillcolor="#fce94f"];
   in_0 [label="first_read"];
   in_0 -> fastqc;
   in_1 [label="second_read"];
   in_1 -> fastqc;
   out_2 [label="first_read_fastqc_report"];
   fastqc -> out_2;
   out_3 [label="first_read_fastqc_report_webpage"];
   fastqc -> out_3;
   out_4 [label="first_read_log_stderr"];
   fastqc -> out_4;
   out_5 [label="second_read_fastqc_report"];
   fastqc -> out_5;
   out_6 [label="second_read_fastqc_report_webpage"];
   fastqc -> out_6;
   out_7 [label="second_read_log_stderr"];
   fastqc -> out_7;
}

Required tools: fastqc, mkdir, mv

CPU Cores: 1

fastx_quality_stats

fastx_quality_stats generates a text file containing quality information of the input FASTQ data.

Documentation:

http://hannonlab.cshl.edu/fastx_toolkit/
Connections:
  • Input Connection:
    • ‘in/first_read’
    • ‘in/second_read’
  • Output Connection:
    • ‘out/first_read_quality_stats’
    • ‘out/second_read_quality_stats’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   fastx_quality_stats [style=filled, fillcolor="#fce94f"];
   in_0 [label="first_read"];
   in_0 -> fastx_quality_stats;
   in_1 [label="second_read"];
   in_1 -> fastx_quality_stats;
   out_2 [label="first_read_quality_stats"];
   fastx_quality_stats -> out_2;
   out_3 [label="second_read_quality_stats"];
   fastx_quality_stats -> out_3;
}

Options:
  • dd-blocksize (str, optional) - default value: 256k
  • new_output_format (bool, optional) - default value: True
  • quality (int, optional) - default value: 33

Required tools: cat, dd, fastx_quality_stats, mkfifo, pigz

CPU Cores: 4

fix_cutadapt

This step takes FASTQ data and removes both reads of a paired-end read, if one of them has been completely removed by cutadapt (or any other software).
Connections:
  • Input Connection:
    • ‘in/first_read’
    • ‘in/second_read’
  • Output Connection:
    • ‘out/first_read’
    • ‘out/second_read’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   fix_cutadapt [style=filled, fillcolor="#fce94f"];
   in_0 [label="first_read"];
   in_0 -> fix_cutadapt;
   in_1 [label="second_read"];
   in_1 -> fix_cutadapt;
   out_2 [label="first_read"];
   fix_cutadapt -> out_2;
   out_3 [label="second_read"];
   fix_cutadapt -> out_3;
}

Options:
  • dd-blocksize (str, optional) - default value: 256k

Required tools: cat, dd, fix_cutadapt, mkfifo, pigz

CPU Cores: 4

htseq_count

The htseq-count script counts the number of reads overlapping a feature. Input needs to be a file with aligned sequencing reads and a list of genomic features. For more information see:

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

Connections:
  • Input Connection:
    • ‘in/alignments’
    • ‘in/features’
  • Output Connection:
    • ‘out/counts’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   htseq_count [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> htseq_count;
   in_1 [label="features"];
   in_1 -> htseq_count;
   out_2 [label="counts"];
   htseq_count -> out_2;
}

Options:
  • a (int, optional) - dd-blocksize (str, optional) - default value: 256k
  • feature-file (str, optional) - idattr (str, optional) - default value: gene_id
  • mode (str, optional) - default value: union - possible values: ‘union’, ‘intersection-strict’, ‘intersection-nonempty’
  • order (str, required) - possible values: ‘name’, ‘pos’
  • stranded (str, required) - possible values: ‘yes’, ‘no’, ‘reverse’
  • type (str, optional) - default value: exon

Required tools: dd, htseq-count, pigz, samtools

CPU Cores: 2

macs2

Model-based Analysis of ChIP-Seq (MACS) is a algorithm, for the identifcation of transcript factor binding sites. MACS captures the influence of genome complexity to evaluate the significance of enriched ChIP regions, and MACS improves the spatial resolution of binding sites through combining the information of both sequencing tag position and orientation. MACS can be easily used for ChIP-Seq data alone, or with control sample data to increase the specificity.

https://github.com/taoliu/MACS

typical command line for single-end data:

macs2 callpeak --treatment <aligned-reads> [--control <aligned-reads>]
               --name <run-id> --gsize 2.7e9
Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/broadpeaks’
    • ‘out/broadpeaks-xls’
    • ‘out/diagnosis’
    • ‘out/gappedpeaks’
    • ‘out/log’
    • ‘out/model’
    • ‘out/narrowpeaks’
    • ‘out/narrowpeaks-xls’
    • ‘out/summits’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   macs2 [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> macs2;
   out_1 [label="broadpeaks"];
   macs2 -> out_1;
   out_2 [label="broadpeaks-xls"];
   macs2 -> out_2;
   out_3 [label="diagnosis"];
   macs2 -> out_3;
   out_4 [label="gappedpeaks"];
   macs2 -> out_4;
   out_5 [label="log"];
   macs2 -> out_5;
   out_6 [label="model"];
   macs2 -> out_6;
   out_7 [label="narrowpeaks"];
   macs2 -> out_7;
   out_8 [label="narrowpeaks-xls"];
   macs2 -> out_8;
   out_9 [label="summits"];
   macs2 -> out_9;
}

Options:
  • bdg (bool, optional) - broad (bool, optional) - broad-cutoff (float, optional) - buffer-size (int, optional) - bw (int, optional) - call-summits (bool, optional) - control (dict, required) - down-sample (bool, optional) - extsize (int, optional) - format (str, required) - default value: AUTO - possible values: ‘AUTO’, ‘ELAND’, ‘ELANDMULTI’, ‘ELANDMULTIPET’, ‘ELANDEXPORT’, ‘BED’, ‘SAM’, ‘BAM’, ‘BAMPE’, ‘BOWTIE’
  • gsize (str, required) - default value: 2.7e9
  • keep-dup (int, optional) - llocal (str, optional) - mfold (str, optional) - nolambda (bool, optional) - nomodel (bool, optional) - pvalue (float, optional) - qvalue (float, optional) - read-length (int, optional) - shift (int, optional) - slocal (str, optional) - to-large (bool, optional) - verbose (int, optional) - possible values: ‘0’, ‘1’, ‘2’, ‘3’

Required tools: macs2, mkdir, mv, pigz

CPU Cores: 4

merge_fasta_files

This step merges all .fasta(.gz) files belonging to a certain sample. The output files are gzipped.
Connections:
  • Input Connection:
    • ‘in/sequence’
  • Output Connection:
    • ‘out/sequence’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   merge_fasta_files [style=filled, fillcolor="#fce94f"];
   in_0 [label="sequence"];
   in_0 -> merge_fasta_files;
   out_1 [label="sequence"];
   merge_fasta_files -> out_1;
}

Options:
  • compress-output (bool, optional) – If set to true output is gzipped.
    • default value: True
  • dd-blocksize (str, optional) - default value: 256k
  • merge-all-runs (bool, optional) – If set to true sequences from all runs are merged
  • output-fasta-basename (str, optional) – Name used as prefix for FASTA output.

Required tools: cat, dd, mkfifo, pigz

CPU Cores: 4

merge_fastq_files

This step merges all .fastq(.gz) files belonging to a certain sample. First and second read files are merged separately. The output files are gzipped.
Connections:
  • Input Connection:
    • ‘in/first_read’
    • ‘in/second_read’
  • Output Connection:
    • ‘out/first_read’
    • ‘out/second_read’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   merge_fastq_files [style=filled, fillcolor="#fce94f"];
   in_0 [label="first_read"];
   in_0 -> merge_fastq_files;
   in_1 [label="second_read"];
   in_1 -> merge_fastq_files;
   out_2 [label="first_read"];
   merge_fastq_files -> out_2;
   out_3 [label="second_read"];
   merge_fastq_files -> out_3;
}

Options:
  • dd-blocksize (str, optional) - default value: 256k

Required tools: cat, dd, mkfifo, pigz

CPU Cores: 4

picard_add_replace_read_groups

Replace read groups in a BAM file. This tool enables the user to replace all read groups in the INPUT file with a single new read group and assign all reads to this read group in the OUTPUT BAM file.

Documentation:

https://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups
Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/alignments’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   picard_add_replace_read_groups [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> picard_add_replace_read_groups;
   out_1 [label="alignments"];
   picard_add_replace_read_groups -> out_1;
}

Options:
  • COMPRESSION_LEVEL (int, optional) – Compression level for all compressed files created (e.g. BAM and GELI). Default value: 5. This option can be set to “null” to clear the default value.
  • CREATE_INDEX (bool, optional) – Whether to create a BAM index when writing a coordinate-sorted BAM file. Default value: false. This option can be set to “null” to clear the default value.
  • CREATE_MD5_FILE (bool, optional) – Whether to create an MD5 digest for any BAM or FASTQ files created. Default value: false. This option can be set to “null” to clear the default value.
  • GA4GH_CLIENT_SECRETS (str, optional) – Google Genomics API client_secrets.json file path. Default value: client_secrets.json. This option can be set to “null” to clear the default value.
  • MAX_RECORDS_IN_RAM (int, optional) – When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed. Default value: 500000. This option can be set to “null” to clear the default value.
  • QUIET (bool, optional) – Whether to suppress job-summary info on System.err. Default value: false. This option can be set to “null” to clear the default value.
  • REFERENCE_SEQUENCE (str, optional) – Reference sequence file. Default value: null.
  • RGCN (str, optional) – Read Group sequencing center name. Default value: null.
  • RGDS (str, optional) – Read Group description. Default value: null.
  • RGDT (str, optional) – Read Group run date. Default value: null.
  • RGID (str, optional) – Read Group ID Default value: 1. This option can be set to ‘null’ to clear the default value.
  • RGLB (str, required) – Read Group library
  • RGPG (str, optional) – Read Group program group. Default value: null.
  • RGPI (int, optional) – Read Group predicted insert size. Default value: null.
  • RGPL (str, required) – Read Group platform (e.g. illumina, solid)
  • RGPM (str, optional) – Read Group platform model. Default value: null.
  • RGPU (str, required) – Read Group platform unit (eg. run barcode)
  • SORT_ORDER (str, optional) – Optional sort order to output in. If not supplied OUTPUT is in the same order as INPUT. Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate}
    • possible values: ‘unsorted’, ‘queryname’, ‘coordinate’, ‘duplicate’
  • TMP_DIR (str, optional) – A file. Default value: null. This option may be specified 0 or more times.
  • VALIDATION_STRINGENCY (str, optional) – Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT. This option can be set to “null” to clear the default value.
    • possible values: ‘STRICT’, ‘LENIENT’, ‘SILENT’
  • VERBOSITY (str, optional) – Control verbosity of logging. Default value: INFO. This option can be set to “null” to clear the default value.
    • possible values: ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’

Required tools: picard-tools

CPU Cores: 6

picard_markduplicates

Identifies duplicate reads. This tool locates and tags duplicate reads (both PCR and optical/ sequencing-driven) in a BAM or SAM file, where duplicate reads are defined as originating from the same original fragment of DNA. Duplicates are identified as read pairs having identical 5’ positions (coordinate and strand) for both reads in a mate pair (and optinally, matching unique molecular identifier reads; see BARCODE_TAG option). Optical, or more broadly Sequencing, duplicates are duplicates that appear clustered together spatially during sequencing and can arise from optical/ imagine-processing artifacts or from bio-chemical processes during clonal amplification and sequencing; they are identified using the READ_NAME_REGEX and the OPTICAL_DUPLICATE_PIXEL_DISTANCE options. The tool’s main output is a new SAM or BAM file in which duplicates have been identified in the SAM flags field, or optionally removed (see REMOVE_DUPLICATE and REMOVE_SEQUENCING_DUPLICATES), and optionally marked with a duplicate type in the ‘DT’ optional attribute. In addition, it also outputs a metrics file containing the numbers of READ_PAIRS_EXAMINED, UNMAPPED_READS, UNPAIRED_READS, UNPAIRED_READ_DUPLICATES, READ_PAIR_DUPLICATES, and READ_PAIR_OPTICAL_DUPLICATES.

Usage example:

java -jar picard.jar MarkDuplicates I=input.bam         O=marked_duplicates.bam M=marked_dup_metrics.txt

Documentation:

https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates
Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/alignments’
    • ‘out/metrics’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   picard_markduplicates [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> picard_markduplicates;
   out_1 [label="alignments"];
   picard_markduplicates -> out_1;
   out_2 [label="metrics"];
   picard_markduplicates -> out_2;
}

Options:
  • ASSUME_SORTED (bool, optional) - COMMENT (str, optional) - COMPRESSION_LEVEL (int, optional) – Compression level for all compressed files created (e.g. BAM and GELI). Default value: 5. This option can be set to “null” to clear the default value.
  • CREATE_INDEX (bool, optional) – Whether to create a BAM index when writing a coordinate-sorted BAM file. Default value: false. This option can be set to “null” to clear the default value.
  • CREATE_MD5_FILE (bool, optional) – Whether to create an MD5 digest for any BAM or FASTQ files created. Default value: false. This option can be set to “null” to clear the default value.
  • GA4GH_CLIENT_SECRETS (str, optional) – Google Genomics API client_secrets.json file path. Default value: client_secrets.json. This option can be set to “null” to clear the default value.
  • MAX_FILE_HANDLES (int, optional) - MAX_RECORDS_IN_RAM (int, optional) – When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed. Default value: 500000. This option can be set to “null” to clear the default value.
  • OPTICAL_DUPLICATE_PIXEL_DISTANCE (int, optional) - PROGRAM_GROUP_COMMAND_LINE (str, optional) - PROGRAM_GROUP_NAME (str, optional) - PROGRAM_GROUP_VERSION (str, optional) - PROGRAM_RECORD_ID (str, optional) - QUIET (bool, optional) – Whether to suppress job-summary info on System.err. Default value: false. This option can be set to “null” to clear the default value.
  • READ_NAME_REGEX (str, optional) - REFERENCE_SEQUENCE (str, optional) – Reference sequence file. Default value: null.
  • SORTING_COLLECTION_SIZE_RATIO (float, optional) - TMP_DIR (str, optional) – A file. Default value: null. This option may be specified 0 or more times.
  • VALIDATION_STRINGENCY (str, optional) – Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT. This option can be set to “null” to clear the default value.
    • possible values: ‘STRICT’, ‘LENIENT’, ‘SILENT’
  • VERBOSITY (str, optional) – Control verbosity of logging. Default value: INFO. This option can be set to “null” to clear the default value.
    • possible values: ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’

Required tools: picard-tools

CPU Cores: 12

picard_merge_sam_bam_files

Documentation:

https://broadinstitute.github.io/picard/command-line-overview.html#MergeSamFiles
Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/alignments’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   picard_merge_sam_bam_files [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> picard_merge_sam_bam_files;
   out_1 [label="alignments"];
   picard_merge_sam_bam_files -> out_1;
}

Options:
  • ASSUME_SORTED (bool, optional) – If true, assume that the input files are in the same sort order as the requested output sort order, even if their headers say otherwise. Default value: false. This option can be set to ‘null’ to clear the default value. Possible values: {true, false}
  • COMMENT (str, optional) – Comment(s) to include in the merged output file’s header. Default value: null.
  • COMPRESSION_LEVEL (int, optional) – Compression level for all compressed files created (e.g. BAM and GELI). Default value: 5. This option can be set to “null” to clear the default value.
  • CREATE_INDEX (bool, optional) – Whether to create a BAM index when writing a coordinate-sorted BAM file. Default value: false. This option can be set to “null” to clear the default value.
  • CREATE_MD5_FILE (bool, optional) – Whether to create an MD5 digest for any BAM or FASTQ files created. Default value: false. This option can be set to “null” to clear the default value.
  • GA4GH_CLIENT_SECRETS (str, optional) – Google Genomics API client_secrets.json file path. Default value: client_secrets.json. This option can be set to “null” to clear the default value.
  • INTERVALS (str, optional) – An interval list file that contains the locations of the positions to merge. Assume bam are sorted and indexed. The resulting file will contain alignments that may overlap with genomic regions outside the requested region. Unmapped reads are discarded. Default value: null.
  • MAX_RECORDS_IN_RAM (int, optional) – When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed. Default value: 500000. This option can be set to “null” to clear the default value.
  • MERGE_SEQUENCE_DICTIONARIES (bool, optional) – Merge the sequence dictionaries. Default value: false. This option can be set to ‘null’ to clear the default value. Possible values: {true, false}
  • QUIET (bool, optional) – Whether to suppress job-summary info on System.err. Default value: false. This option can be set to “null” to clear the default value.
  • REFERENCE_SEQUENCE (str, optional) – Reference sequence file. Default value: null.
  • SORT_ORDER (str, optional) – Sort order of output file. Default value: coordinate. This option can be set to ‘null’ to clear the default value. Possible values: {unsorted, queryname, coordinate, duplicate}
    • possible values: ‘unsorted’, ‘queryname’, ‘coordinate’, ‘duplicate’
  • TMP_DIR (str, optional) – A file. Default value: null. This option may be specified 0 or more times.
  • USE_THREADING (bool, optional) – Option to create a background thread to encode, compress and write to disk the output file. The threaded version uses about 20% more CPU and decreases runtime by ~20% when writing out a compressed BAM file. Default value: false. This option can be set to ‘null’ to clear the default value. Possible values: {true, false}
  • VALIDATION_STRINGENCY (str, optional) – Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT. This option can be set to “null” to clear the default value.
    • possible values: ‘STRICT’, ‘LENIENT’, ‘SILENT’
  • VERBOSITY (str, optional) – Control verbosity of logging. Default value: INFO. This option can be set to “null” to clear the default value.
    • possible values: ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’

Required tools: ln, picard-tools

CPU Cores: 12

post_cufflinksSuite

The cufflinks suite can be used to assembly new transcripts and
merge those with known annotations. However, the output .gtf files need to be reformatted in several aspects afterwards. This step can be used to reformat and filter the cufflinksSuite .gtf file.
Connections:
  • Input Connection:
    • ‘in/features’
  • Output Connection:
    • ‘out/features’
    • ‘out/log_stderr’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   post_cufflinksSuite [style=filled, fillcolor="#fce94f"];
   in_0 [label="features"];
   in_0 -> post_cufflinksSuite;
   out_1 [label="features"];
   post_cufflinksSuite -> out_1;
   out_2 [label="log_stderr"];
   post_cufflinksSuite -> out_2;
}

Options:
  • class_list (str, optional) – Class codes to be removed; possible ‘=,c,j,e,i,o,p,r,u,x,s,.’
  • filter_by_class (bool, required) – Remove gtf if any class is found in class_code field, requieres class_list
  • filter_by_class_and_gene_name (bool, required) – Combines remove-by-class and remove-by-gene-name
  • gene_name (str, optional) – String to match in gtf field gene_name for discarding
    • default value: ENS
  • remove_by_gene_name (bool, required) – Remove gtf if matches ‘string’ in gene_name field
  • remove_gencode (bool, required) – Hard removal of gtf line which match ‘ENS’ in gene_name field
  • remove_unstranded (bool, required) – Removes transcripts without strand specifity
  • run_id (str, optional) – An arbitrary name of the new run (which is a merge of all samples).
    • default value: magic

Required tools: cat, post_cufflinks_merge

CPU Cores: 6

preseq_complexity_curve

The preseq package is aimed at predicting the yield of distinct reads from a genomic library from an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.

c_curve computes the expected yield of distinct reads for experiments smaller than the input experiment in a .bed or .bam file through resampling. The full set of parameters can be outputed by simply typing the program name. If output.txt is the desired output file name and input.bed is the input .bed file, then simply type:

preseq c_curve -o output.txt input.sort.bed
Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/complexity_curve’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   preseq_complexity_curve [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> preseq_complexity_curve;
   out_1 [label="complexity_curve"];
   preseq_complexity_curve -> out_1;
}

Options:
  • hist (bool, optional) – input is a text file containing the observed histogram
  • pe (bool, required) – input is paired end read file
  • seg_len (int, optional) – maximum segment length when merging paired end bam reads (default: 5000)
  • step (int, optional) – step size gin extrapolations (default: 1e+06)
  • vals (bool, optional) – input is a text file containing only the observed counts

Required tools: preseq

CPU Cores: 4

preseq_future_genome_coverage

The preseq package is aimed at predicting the yield of distinct reads from a genomic library from an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.

gc_extrap computes the expected genomic coverage for deeper sequencing for single cell sequencing experiments. The input should be a mr or bed file. The tool bam2mr is provided to convert sorted bam or sam files to mapped read format.

Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/future_genome_coverage’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   preseq_future_genome_coverage [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> preseq_future_genome_coverage;
   out_1 [label="future_genome_coverage"];
   preseq_future_genome_coverage -> out_1;
}

Options:
  • bin_size (int, optional) – bin size (default: 10)
  • bootstraps (int, optional) – number of bootstraps (default: 100)
  • cval (float, optional) – level for confidence intervals (default: 0.95)
  • extrap (int, optional) – maximum extrapolation in base pairs (default: 1e+12)
  • max_width (int, optional) – max fragment length, set equal to read length for single end reads
  • quick (bool, optional) – quick mode: run gc_extrap without bootstrapping for confidence intervals
  • step (int, optional) – step size in bases between extrapolations (default: 1e+08)
  • terms (int, optional) – maximum number of terms

Required tools: preseq

CPU Cores: 4

preseq_future_yield

The preseq package is aimed at predicting the yield of distinct reads from a genomic library from an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.

lc_extrap computes the expected future yield of distinct reads and bounds on the number of total distinct reads in the library and the associated confidence intervals.

Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/future_yield’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   preseq_future_yield [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> preseq_future_yield;
   out_1 [label="future_yield"];
   preseq_future_yield -> out_1;
}

Options:
  • bootstraps (int, optional) – number of bootstraps (default: 100)
  • cval (float, optional) – level for confidence intervals (default: 0.95)
  • dupl_level (float, optional) – fraction of duplicate to predict (default: 0.5)
  • extrap (int, optional) – maximum extrapolation (default: 1e+10)
  • hist (bool, optional) – input is a text file containing the observed histogram
  • pe (bool, required) – input is paired end read file
  • quick (bool, optional) – quick mode, estimate yield without bootstrapping for confidence intervals
  • seg_len (int, optional) – maximum segment length when merging paired end bam reads (default: 5000)
  • step (int, optional) – step size in extrapolations (default: 1e+06)
  • terms (int, optional) – maximum number of terms
  • vals (bool, optional) – input is a text file containing only the observed counts

Required tools: preseq

CPU Cores: 4

remove_duplicate_reads_runs

Duplicates are removed by Picard tools ‘MarkDuplicates’.

typical command line:

MarkDuplicates INPUT=<SAM/BAM> OUTPUT=<SAM/BAM>
               METRICS_FILE=<metrics-out> REMOVE_DUPLICATES=true
Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/alignments’
    • ‘out/metrics’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   remove_duplicate_reads_runs [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> remove_duplicate_reads_runs;
   out_1 [label="alignments"];
   remove_duplicate_reads_runs -> out_1;
   out_2 [label="metrics"];
   remove_duplicate_reads_runs -> out_2;
}

Required tools: MarkDuplicates

CPU Cores: 12

rgt_thor

THOR is an HMM-based approach to detect and analyze differential peaks in two sets of ChIP-seq data from distinct biological conditions with replicates. THOR performs genomic signal processing, peak calling and p-value calculation in an integrated framework. For differential peak calling without replicates use ODIN.

More information please refer to:

Allhoff, M., Sere K., Freitas, J., Zenke, M., Costa, I.G. (2016), Differential Peak Calling of ChIP-seq Signals with Replicates with THOR, Nucleic Acids Research, epub gkw680 [paper][supp].

Feel free to post your question in our googleGroup or write an e-mail: rgtusers@googlegroups.com

Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/chip_seq_bigwig’
    • ‘out/diff_narrow_peaks’
    • ‘out/diff_peaks_thor_bed’
    • ‘out/thor_config’
    • ‘out/thor_setup_info’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   rgt_thor [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> rgt_thor;
   out_1 [label="chip_seq_bigwig"];
   rgt_thor -> out_1;
   out_2 [label="diff_narrow_peaks"];
   rgt_thor -> out_2;
   out_3 [label="diff_peaks_thor_bed"];
   rgt_thor -> out_3;
   out_4 [label="thor_config"];
   rgt_thor -> out_4;
   out_5 [label="thor_setup_info"];
   rgt_thor -> out_5;
}

Options:
  • binsize (int, optional) – Size of bins for creating the signal.
  • chrom_sizes_file (str, required) - config_file (dict, required) – A dictionary with
  • deadzones (str, optional) – Define blacklisted genomic regions to be ignored by the peak caller.
  • exts (str, optional) – Read’s extension size for BAM files (comma separated list for each BAM file in config file). If option is not chosen, estimate extension sizes from reads.
  • factors-inputs (str, optional) – Normalization factors for input-DNA (comma separated list for each BAM file in config file). If option is not chosen, estimate factors.
  • genome (str, required) – FASTA file containing the complete genome sequence
  • housekeeping-genes (str, optional) – Define housekeeping genes (BED format) used for normalizing.
  • merge (bool, optional) – Merge peaks which have a distance less than the estimated mean fragment size (recommended for histone data).
  • name (str, optional) – Experiment’s name and prefix for all files that are created.
  • no-correction (bool, optional) – Do not use multiple test correction for p-values (Benjamini/Hochberg).
  • no-gc-content (bool, optional) – Do not normalize towards GC content.
  • pvalue (float, optional) – P-value cutoff for peak detection. Call only peaks with p-value lower than cutoff.
  • report (bool, optional) – Generate HTML report about experiment.
  • save-input (bool, optional) – Save input DNA bigwig (if input was provided).
  • scaling-factors (str, optional) – Scaling factor for each BAM file (not control input-DNA) as comma separated list for each BAM file in config file. If option is not chosen, follow normalization strategy (TMM or HK approach)
  • step (int, optional) – Stepsize with which the window consecutively slides across the genome to create the signal.

Required tools: printf, rgt-THOR

CPU Cores: 4

rseqc

The RSeQC step can be used to evaluate aligned reads in a BAM file. RSeQC does not only report raw sequence-based metrics, but also quality control metrics like read distribution, gene coverage, and sequencing depth.
Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/bam_stat’
    • ‘out/infer_experiment’
    • ‘out/read_distribution’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   rseqc [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> rseqc;
   out_1 [label="bam_stat"];
   rseqc -> out_1;
   out_2 [label="infer_experiment"];
   rseqc -> out_2;
   out_3 [label="read_distribution"];
   rseqc -> out_3;
}

Options:
  • reference (str, required) – Reference gene model in bed fomat. [required]

Required tools: bam_stat.py, cat, infer_experiment.py, read_distribution.py

CPU Cores: 1

s2c

s2c formats the output of segemehl mapping to be compatible with the cufflinks suite of tools for differential expr. analysis of RNA-Seq data and their visualisation. For details on cufflinks we refer to the author’s webpage:

http://cole-trapnell-lab.github.io/cufflinks/

Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/alignments’
    • ‘out/log’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   s2c [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> s2c;
   out_1 [label="alignments"];
   s2c -> out_1;
   out_2 [label="log"];
   s2c -> out_2;
}

Options:
  • tmp_dir (str, required) – Temp directory for ‘s2c.py’. This can be in the /work/username/ path, since it is only temporary.

Required tools: cat, dd, fix_s2c, pigz, s2c, samtools

CPU Cores: 6

sam_to_sorted_bam

The step sam_to_sorted_bam builds on ‘samtools sort’ to sort SAM files and output BAM files.

Sort alignments by leftmost coordinates, or by read name when -n is used. An appropriate @HD-SO sort order header tag will be added or an existing one updated if necessary.

Documentation:

http://www.htslib.org/doc/samtools.html
Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/alignments’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   sam_to_sorted_bam [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> sam_to_sorted_bam;
   out_1 [label="alignments"];
   sam_to_sorted_bam -> out_1;
}

Options:
  • dd-blocksize (str, optional) - default value: 256k
  • genome-faidx (str, required) - sort-by-name (bool, required) - temp-sort-dir (str, required) – Intermediate sort files are stored intothis directory.

Required tools: dd, pigz, samtools

CPU Cores: 8

samtools_faidx

Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format.
Connections:
  • Input Connection:
    • ‘in/sequence’
  • Output Connection:
    • ‘out/indices’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   samtools_faidx [style=filled, fillcolor="#fce94f"];
   in_0 [label="sequence"];
   in_0 -> samtools_faidx;
   out_1 [label="indices"];
   samtools_faidx -> out_1;
}

Required tools: mv, samtools

CPU Cores: 4

samtools_index

Index a coordinate-sorted BAM or CRAM file for fast random access. (Note that this does not work with SAM files even if they are bgzip compressed to index such files, use tabix(1) instead.)

Documentation:

http://www.htslib.org/doc/samtools.html
Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/alignments’
    • ‘out/index_stats’
    • ‘out/indices’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   samtools_index [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> samtools_index;
   out_1 [label="alignments"];
   samtools_index -> out_1;
   out_2 [label="index_stats"];
   samtools_index -> out_2;
   out_3 [label="indices"];
   samtools_index -> out_3;
}

Options:
  • index_type (str, required) - possible values: ‘bai’, ‘csi’

Required tools: ln, samtools

CPU Cores: 4

samtools_stats

samtools stats collects statistics from BAM files and outputs in a text format. The output can be visualized graphically using plot-bamstats.

Documentation:

http://www.htslib.org/doc/samtools.html
Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/stats’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   samtools_stats [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> samtools_stats;
   out_1 [label="stats"];
   samtools_stats -> out_1;
}

Options:
  • dd-blocksize (str, optional) - default value: 256k

Required tools: dd, pigz, samtools

CPU Cores: 1

segemehl

segemehl is a software to map short sequencer reads to reference genomes. Unlike other methods, segemehl is able to detect not only mismatches but also insertions and deletions. Furthermore, segemehl is not limited to a specific read length and is able to mapprimer- or polyadenylation contaminated reads correctly.

This step creates at first two FIFOs. The first is used to provide the genome data for segemehl and the second is used for the output of the unmapped reads:

mkfifo genome_fifo unmapped_fifo
cat <genome-fasta> -o genome_fifo

The executed segemehl command is this:

segemehl -d genome_fifo -i <genome-index-file> -q <read1-fastq>
         [-p <read2-fastq>] -u unmapped_fifo -H 1 -t 11 -s -S -D 0
         -o /dev/stdout |  pigz --blocksize 4096 --processes 2 -c

The unmapped reads are saved via these commands:

cat unmapped_fifo | pigz --blocksize 4096 --processes 2 -c >
<unmapped-fastq>
Connections:
  • Input Connection:
    • ‘in/first_read’
    • ‘in/second_read’
  • Output Connection:
    • ‘out/alignments’
    • ‘out/log’
    • ‘out/unmapped’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   segemehl [style=filled, fillcolor="#fce94f"];
   in_0 [label="first_read"];
   in_0 -> segemehl;
   in_1 [label="second_read"];
   in_1 -> segemehl;
   out_2 [label="alignments"];
   segemehl -> out_2;
   out_3 [label="log"];
   segemehl -> out_3;
   out_4 [label="unmapped"];
   segemehl -> out_4;
}

Options:
  • MEOP (bool, optional) – output MEOP field for easier variance calling in SAM (XE:Z:)
  • SEGEMEHL (bool, optional) – output SEGEMEHL format (needs to be selected for brief)
  • accuracy (int, optional) – min percentage of matches per read in semi-global alignment (default:90)
  • autoclip (bool, optional) – autoclip unknown 3prime adapter
  • bisulfite (int, optional) – bisulfite mapping with methylC-seq/Lister et al. (=1) or bs-seq/Cokus et al. protocol (=2) (default:0)
    • possible values: ‘0’, ‘1’, ‘2’
  • brief (bool, optional) – brief output
  • clipacc (int, optional) – clipping accuracy (default:70)
  • dd-blocksize (str, optional) - default value: 256k
  • differences (int, optional) – search seeds initially with <n> differences (default:1)
    • default value: 1
  • dropoff (int, optional) – dropoff parameter for extension (default:8)
  • evalue (float, optional) – max evalue (default:5.000000)
  • extensionpenalty (int, optional) – penalty for a mismatch during extension (default:4)
  • extensionscore (int, optional) – score of a match during extension (default:2)
  • fix-qnames (bool, optional) – The QNAMES field of the input will be purged from spaces and everything thereafter.
  • genome (str, required) – Path to genome file
  • hardclip (bool, optional) – enable hard clipping
  • hitstrategy (int, optional) – report only best scoring hits (=1) or all (=0) (default:1)
    • default value: 1
    • possible values: ‘0’, ‘1’
  • index (str, required) – Path to genome index for segemehl
  • jump (int, optional) – search seeds with jump size <n> (0=automatic) (default:0)
  • maxinsertsize (int, optional) – maximum size of the inserts (paired end) (default:5000)
  • maxinterval (int, optional) – maximum width of a suffix array interval, i.e. a query seed will be omitted if it matches more than <n> times (default:100)
  • maxsplitevalue (float, optional) – max evalue for splits (default:50.000000)
  • minfraglen (int, optional) – min length of a spliced fragment (default:20)
  • minfragscore (int, optional) – min score of a spliced fragment (default:18)
  • minsize (int, optional) – minimum size of queries (default:12)
  • minsplicecover (int, optional) – min coverage for spliced transcripts (default:80)
  • nohead (bool, optional) – do not output header
  • order (bool, optional) – sorts the output by chromsome and position (might take a while!)
  • polyA (bool, optional) – clip polyA tail
  • prime3 (str, optional) – add 3’ adapter (default:none)
  • prime5 (str, optional) – add 5’ adapter (default:none)
  • showalign (bool, optional) – show alignments
  • silent (bool, optional) – shut up!
    • default value: True
  • splicescorescale (float, optional) – report spliced alignment with score s only if <f>*s is larger than next best spliced alignment (default:1.000000)
  • splits (bool, optional) – detect split/spliced reads (default:none)
    • default value: True
  • threads (int, optional) – start <n> threads (default:10)
    • default value: 10

Required tools: cat, dd, fix_qnames, mkfifo, pigz, segemehl

CPU Cores: 10

segemehl_generate_index

The step segemehl_generate_index generates a index for given reference sequences.

Documentation:

http://www.bioinf.uni-leipzig.de/Software/segemehl/
Connections:
  • Input Connection:
    • ‘in/reference_sequence’
  • Output Connection:
    • ‘out/log’
    • ‘out/segemehl_index’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   segemehl_generate_index [style=filled, fillcolor="#fce94f"];
   in_0 [label="reference_sequence"];
   in_0 -> segemehl_generate_index;
   out_1 [label="log"];
   segemehl_generate_index -> out_1;
   out_2 [label="segemehl_index"];
   segemehl_generate_index -> out_2;
}

Options:
  • dd-blocksize (str, optional) - default value: 256k
  • index-basename (str, required) – Basename for created segemehl index.

Required tools: dd, mkfifo, pigz, segemehl

CPU Cores: 4

sra_fastq_dump

sra tools is a suite from NCBI to handle sra (short read archive) files. fastq-dump is an sra tool that dumps the content of an sra file in fastq format

The following options cannot be set, as they would interefere with the pipeline implemented in this step

-O|–outdir <path> Output directory, default is working
directory ‘.’ )
-Z|–stdout Output to stdout, all split data become
joined into single stream
--gzip Compress output using gzip
--bzip2 Compress output using bzip2
Multiple File Options Setting these options will produce more
than 1 file, each of which will be suffixed according to splitting criteria.
--split-files Dump each read into separate file.Files will receive suffix corresponding to read number
--split-3 Legacy 3-file splitting for mate-pairs: First biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq Biological reads and above are ignored.

-G|–spot-group Split into files by SPOT_GROUP (member name) -R|–read-filter <[filter]> Split into files by READ_FILTER value

optionally filter by value: pass|reject|criteria|redacted

-T|–group-in-dirs Split into subdirectories instead of files -K|–keep-empty-files Do not delete empty files

Details on fastq-dump can be found at https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump

To make IO cluster friendly, fastq-dump is not reading th sra file directly. Rather, dd with configurable blocksize is used to provide the sra file via a fifo to fastq-dump.

The executed calls lools like this

mkfifo sra_fifo dd bs=4M if=<sra-file> of=sra_fifo fastq-dump -Z sra_fifo | pigz –blocksize 4096 –processes 2 > file.fastq

Connections:
  • Input Connection:
    • ‘in/sequence’
  • Output Connection:
    • ‘out/first_read’
    • ‘out/log’
    • ‘out/second_read’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   sra_fastq_dump [style=filled, fillcolor="#fce94f"];
   in_0 [label="sequence"];
   in_0 -> sra_fastq_dump;
   out_1 [label="first_read"];
   sra_fastq_dump -> out_1;
   out_2 [label="log"];
   sra_fastq_dump -> out_2;
   out_3 [label="second_read"];
   sra_fastq_dump -> out_3;
}

Options:
  • accession (str, optional) – Replaces accession derived from <path> in filename(s) and deflines (only for single table dump)
  • aligned (bool, optional) – Dump only aligned sequences
  • aligned-region (str, optional) – Filter by position on genome. Name can either be accession.version (ex:NC_000001.10) or file specific name (ex:”chr1” or “1”). “from” and “to” are 1-based coordinates. <name[:from-to]>
  • clip (bool, optional) – Apply left and right clips
  • dd-blocksize (str, optional) - default value: 256k
  • defline-qual (str, optional) – Defline format specification for quality.
  • defline-seq (str, optional) – Defline format specification for sequence.
  • disable-multithreading (bool, optional) – disable multithreading
  • dumpbase (bool, optional) – Formats sequence using base space (default for other than SOLiD).
  • dumpcs (bool, optional) – Formats sequence using color space (default for SOLiD),”cskey” may be specified for translation.
  • fasta (int, optional) – FASTA only, no qualities, optional line wrap width (set to zero for no wrapping). <[line width]>
  • helicos (bool, optional) – Helicos style defline
  • legacy-report (bool, optional) – use legacy style “Written spots” for tool
  • log-level (str, optional) – Logging level as number or enum string One of (fatal|sys|int|err|warn|info) or (0-5). Current/default is warn. <level>
  • matepair-distance (str, optional) – Filter by distance beiween matepairs. Use “unknown” to find matepairs split between the references. Use from-to to limit matepair distance on the same reference. <from-to|unknown>
  • maxSpotId (int, optional) – Maximum spot id to be dumped. Use with “minSpotId” to dump a range.
  • max_cores (int, optional) – Maximum number of cores available on the cluster
    • default value: 10
  • minReadLen (int, optional) – Filter by sequence length >= <len>
  • minSpotId (int, optional) – Minimum spot id to be dumped. Use with “maxSpotId” to dump a range.
  • ncbi_error_report (str, optional) – Control program execution environment report generation (if implemented). One of (never|error|always). Default is error. <error>
  • offset (int, optional) – Offset to use for quality conversion, default is 33
  • origfmt (bool, optional) – Defline contains only original sequence name
  • qual-filter (bool, optional) – Filter used in early 1000 Genomes data: no sequences starting or ending with >= 10N
  • qual-filter-1 (bool, optional) – Filter used in current 1000 Genomes data
  • read-filter (str, optional) – Split into files by READ_FILTER value optionally filter by value: pass|reject|criteria|redacted
  • readids (bool, optional) – Append read id after spot id as “accession.spot.readid” on defline.
  • skip-technical (bool, optional) – Dump only biological reads
  • split-spot (str, optional) – Split spots into individual reads
  • spot-groups (str, optional) – Filter by SPOT_GROUP (member): name[,...]
  • suppress-qual-for-cskey (bool, optional) – supress quality-value for cskey
  • table (str, optional) – Table name within cSRA object, default is “SEQUENCE”
  • unaligned (bool, optional) – Dump only unaligned sequences
  • verbose (bool, optional) – Increase the verbosity level of the program. Use multiple times for more verbosity.

Required tools: dd, fastq-dump, mkfifo, pigz

CPU Cores: 10

subsetMappedReads

subsetMappedReads selects a provided number of mapped reads from a file in .sam or .bam format. Depending on the set options the first N mapped reads and their mates (for paired end sequencing) are returned in .sam format. If the number of requested reads exceeds the number of available mapped reads, all mapped reads are returned.
Connections:
  • Input Connection:
    • ‘in/alignments’
  • Output Connection:
    • ‘out/alignments’
    • ‘out/log’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   subsetMappedReads [style=filled, fillcolor="#fce94f"];
   in_0 [label="alignments"];
   in_0 -> subsetMappedReads;
   out_1 [label="alignments"];
   subsetMappedReads -> out_1;
   out_2 [label="log"];
   subsetMappedReads -> out_2;
}

Options:
  • Nreads (str, required) – Number of reads to extract from input file.
  • genome-faidx (str, required) - paired_end (bool, required) – The reads are expected to have a mate, due to paired end sequencing.

Required tools: cat, dd, head, pigz, samtools

CPU Cores: 1

tophat2

TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.

http://tophat.cbcb.umd.edu/

typical command line:

tophat [options]* <index_base> <reads1_1[,...,readsN_1]>         [reads1_2,...readsN_2]
Connections:
  • Input Connection:
    • ‘in/first_read’
    • ‘in/second_read’
  • Output Connection:
    • ‘out/align_summary’
    • ‘out/alignments’
    • ‘out/deletions’
    • ‘out/insertions’
    • ‘out/junctions’
    • ‘out/log_stderr’
    • ‘out/misc_logs’
    • ‘out/prep_reads’
    • ‘out/unmapped’

digraph foo {
   rankdir = LR;
   splines = true;
   graph [fontname = Helvetica, fontsize = 12, size = "14, 11", nodesep = 0.2, ranksep = 0.3];
   node [fontname = Helvetica, fontsize = 12, shape = rect];
   edge [fontname = Helvetica, fontsize = 12];
   tophat2 [style=filled, fillcolor="#fce94f"];
   in_0 [label="first_read"];
   in_0 -> tophat2;
   in_1 [label="second_read"];
   in_1 -> tophat2;
   out_2 [label="align_summary"];
   tophat2 -> out_2;
   out_3 [label="alignments"];
   tophat2 -> out_3;
   out_4 [label="deletions"];
   tophat2 -> out_4;
   out_5 [label="insertions"];
   tophat2 -> out_5;
   out_6 [label="junctions"];
   tophat2 -> out_6;
   out_7 [label="log_stderr"];
   tophat2 -> out_7;
   out_8 [label="misc_logs"];
   tophat2 -> out_8;
   out_9 [label="prep_reads"];
   tophat2 -> out_9;
   out_10 [label="unmapped"];
   tophat2 -> out_10;
}

Options:
  • index (str, required) – Path to genome index for tophat2
  • library_type (str, required) – The default is unstranded (fr-unstranded). If either fr-firststrand or fr-secondstrand is specified, every read alignment will have an XS attribute tag as explained below. Consider supplying library type options below to select the correct RNA-seq protocol.(https://ccb.jhu.edu/software/tophat/manual.shtml)
    • possible values: ‘fr-unstranded’, ‘fr-firststrand’, ‘fr-secondstrand’

Required tools: mkdir, mv, tar, tophat2

CPU Cores: 6